Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 94 · Next
| Author | Message |
|---|---|
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size? IMHO The real problem on SETI or the boinc itself is they are obsolete (software & hardware) for today's hardware and software's used by the volunteers. Try to keep very slow devices (who takes days to crunch a single WU) and superfast devices (few of them spoofed and who crunch a WU in a couple of seconds) is the real source of all our constant problems. The non test release the servers WU limits just make all worst. With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778 Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline. Besides the impossible solution of buy new hardware (lots of $ and time) the only real way that could be solved is by reducing even more the WU limits on the slow devices and drastically reduce the dead time of the WU's. Makes little or no sense at all, allow a host who produce a couple of WU per day have a 150 WU buffer. The number of WU available to all the hosts must be limited to the capacity of it to crunch & produce the valid WU. Maybe a 1 day will be enough. Please remember... i'm talking about 1000's of hosts who DL large caches and will never return them on time. So please stop blaming the, maybe the 30-40 hosts who run with spoofed clients, who DL and return his WU in a day or 2. They actually help to clear the DB. my 0.02
|
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
more of a whine / observation / OCD thing but why are there 5 'chunks' of data waiting to be processed since late 2019? would flushing everything out of the repositories have a potential cleansing affect? i'm no DB nor systems design guy, just wondering..... it has come close a few times only to have another coupla days of data pushed in front - like today. Before we got the new blc73s I had hope that we would split the old files in the queue, but alas that was not to be. I don't think files in the to-be-split queue would affect the process much. I don't think we are near the "limit" on things in that queue as I've seen it much longer in the past. There is the possibility that the files that are in the process of being split, but left hanging as the splitters move on to new files might be an issue but I have no idea. They increased the number of splitting processes a couple of months ago. There are now 16 processes working on 11 files versus the previous 14 processes working on 9 files. I have no idea how the splitters choose which file to work on next. It doesn't have a pattern anymore. My opinion, based on nothing, is that they don't watch the splitting queue that much. We usually have to let them know when files have gotten stuck splitting. They just don't monitor it that closely. My guess is that they have no idea that there are 5 "stale" files hanging out in the queue. I truly think that there is no problem caused by the stale files, but I will admit it still bothers me. |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
Something else that could help is to end the spoofing. If a host has 4 GPUs they should only get WUs for those 4 GPUs. No more editing software to spoof that a host has 15, 20, 30 or more GPUs when they have no more than 8.Spoofing has actually allowed me to be nice to the servers and other users. I reduce my fake gpu count when the Tuesday outage has started so that when the outage ends, I'm still above my new cap so I'm only reporting results but not competing with the other hosts for new tasks. When my host finally starts asking for new tasks, it is only asking a few at the time matching the number it reported. And when this happens, the post-outage congestion is over already. . . Or do as many others do and set No New Tasks until the reporting crush has finished and the servers are getting back on their feet. These days that takes several hours. But either way the benefit is minimal because the majority of volunteers do none of the above and so the problem persists. As far as spoofing goes, while telling the system that 2 GPUs are really 64 GPUs is ridiculously extreme, most of the those spoofing such high number do have huge numbers of GPUs like 6 to 8 physical units. And the totally impost from this practice is also quite minimal because those spoofing are a tiny, tiny fraction of overall volunteers. But that is just my 2c worth. Stephen < shrug > |
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
+1 If there is a limit of 10 days on the amount of WUs one can have (if you don't hit the #/cpu & #/gpu limit already), why can't they shorten the due date to 3 weeks?? I think this would get the WUs that are ghosted or on machines that have left the project back into circulation quicker. Some of my WUs have a due date 8 weeks out. Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking?? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking?? No, none. Project deadlines are only applicable for each individual project. The backlash if Seti shortened their deadline would be hosts would do more Seti work more quickly and would not get preempted as much by other projects shorter deadline tasks. For the majority of projects, Seti has the longest deadlines except for some outliers like WCG and Climate Prediction which have almost year long deadlines. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
kittyman ![]() Send message Joined: 9 Jul 00 Posts: 51580 Credit: 1,018,363,574 RAC: 1,004
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. "Time is simply the mechanism that keeps everything from happening all at once."
|
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. . . Sorry Rob but I have to disagree with you there. We have seen on a couple of occasions that if we get all the data that Greenbank produces in one day our whole setup is busy crunching for several months. The daily output from one GBT recorder takes several days to get through and there are 64 potential recorders which we can process. If/when we finally start processing data from Parkes (to which I am looking forward) that output will double. While our processing capacity has grown manifold over the decades this program has been operating so has the data which we need to process. If our daily capacity could grow 10 fold then we would still be lagging behind the data available. One thing in our favour is we do not see all the data from every day. We only some of the output from each observatory, and so far none from Parkes (had to throw that in). Eric and David have both stressed the need for more volunteers and they aint lying. . . As far as being greedy, since when it is greedy to want to do work for somebody else at no charge despite the cost to yourself? And these spoofing hosts you seem to feel are denying you the work you want are simply trying to keep the supply of work up to the level of productivity of which their machines are capable. I would call that efficiency not greed, and working towards having the processing power that the project requires. The issue is that the crunching power of the volunteer force has grown beyond the capacity of the server farm. While there is at least one very powerful new server under test it appears there are problems which are delaying its deployment in main. And there is a new data storage unit under construction that will be many times greater, and several times faster than that currently available, but this could still be a long time away from being deployed. The whole issue has been seriously aggravated by recent large runs of noisy data such as the Blc35 and Blc41 series which have at least in a large part been about 90% noise bombs. The servers are choking on the output of a much greater crunching force than ever before that is working on the project. There are probably some other issues biting at the ankles of the system since the unfortunate BOINC upgrade that kicked off most of these problems prior to Christmas and required a 'roll back', which I have no doubt has left several time bombs in the machinery that could also be part of the problem. The cure is a matter of time and money, and I cannot see the boffins at Berkeley actually asking all the volunteers to do less work. . . Sorry but you tripped a trigger with that message ... Stephen < shrug > |
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. Nice to see you around. Back to topic. Even the slowest computer or devices like cell phones, PI, etc. could crunch a WU in less than a month.... Then why keep a deadline of 2-3 months? Remember the DB needs to keep track the WU for all this time and if at the end of the time it was not returned, rinse & repeat. More 2-3 months? My point is simple: stop the fix limit of WU per GPU/CPU and allow the host DL only up to the number of Wu it crunch and return valid for an specific period of time (1 day for example). And reduce the death lines. That will squeeze the DB to a more manageable size for sure. Desperate times needs desperate measures......
|
kittyman ![]() Send message Joined: 9 Jul 00 Posts: 51580 Credit: 1,018,363,574 RAC: 1,004
|
It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request. Methinks the overhead would be too high to do be worth it. Meow. "Time is simply the mechanism that keeps everything from happening all at once."
|
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. . . Exactly, we are all in the same boat ... Stephen < shrug > |
|
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. but when the number of computers that need 4-6 weeks to complete 1 WU are almost non-existent (even a RPi can complete a WU in under a day, and the impact this kind of setting is having on the project... probably best to do what's in the best interest of the project. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours
|
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
The easiest solution would be to have a limit for number of task to be given for one host at a time. This limit could be the same as the size of the pot at first but whenever there has been a request that had to be offered 0 tasks because the pot was empty, this limit would be reduced a bit. And whenever there was still tasks left in the pot when it was refilled, the limit would be increased a bit.NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after,Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size? So it would scale dynamically and when everything is fine, every request would get what it asks but in a throttled situation like today, everyone would get a little bit in each request whenever there are tasks in the RTS queue. Instead of the current situation where most requests get noting and an occasional lucky one gets a lot. |
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. I totally agree on the including as many people as possible! I guess I just figured that a device could return a WU in 21 days. There is a limit of 10 days worth of WUs (I'm assuming as this was a big deal when I was on my slower machine as I would hit this limit before my 100 WU limit. If this is no longer the case, let me know). If a machine takes longer than 21 days (my suggested time out limit) to do 1 WU, then do they only get 1 WU at a time? This brings up so many questions... what is the slowest system? how long does it take to do 1 WU?? |
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. They actually done each day, for all hosts (from the slower to the fastest one), just look at the stats..... So no more load to the servers by doing this.
|
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. . . True, but that is because EVERY host has the potential to empty that buffer with a limit of 150 WUs per device. It is in no way because of spoofing. What is needed has been previously suggested, generally by the guys spoofing GPUs who you hold accountable for the problem, that in times of crisis such as now, there is a work fetch limit imposed such as 10 or 20 WUs per request. This would reduce impact on average and would be 'fairer' even by your definition. Slower hosts would refill their caches in a relatively short time and even the faster hosts would not be completely devoid of work, but the overall effectiveness would still be limited by the behaviour of the splitters and SETI servers. Stephen < shrug > |
|
Alien Seeker Send message Joined: 23 May 99 Posts: 57 Credit: 511,652 RAC: 32
|
With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778 According to the site, your computer has a average turnaround time of 1.17/1.24 days (CPU/GPU). Which isn't even twice as fast as my CPU-only, seriously-throttled, switched-off-at-night computers (1.50 days for one, 1.91 days for the other). So in the end, your superfast spoofed host keeps validation pending nearly as long as my slow computers, it just crunches many more tasks in the same duration. What weights heavily on the number of tasks/workunits around are ghosts, and the more in-progress tasks you have at a given time, the more likely you are to not realise some of them never actually reached your computer. Shortening the deadline to say, 2 or 3 weeks would help a lot without affecting even slower systems. Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn. My alternative profile |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.How old is an old computer? My older cruncher is 11 years old and its ancient Core 2 Duo CPU can crunch a slow AstroPulse task in 8 hours and other tasks in 1 to 2 hours. Single thread power of CPUs hasn't grown a lot in the last decade. They have just gained a lot more cores. That chip has the same wattage as the new 8 core Zen 2 chip in my other computer that has about 10 times its crunching power. So using very old hardware for number crunching is bad for climate (and for wallet too). |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO. . . It would certainly seem so ... Stephen :) |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. . . So how slow a computer would you need to take 12 weeks to process one WU????? Stephen . . Just curious .... :) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
As an example of the timing problems that a volunteer-based project like SETI has to navigate, I've just cleared a _5 task that has been hanging around for 9 days: 8538194490 6551171 12 Feb 2020, 17:20:04 UTC 13 Feb 2020, 11:18:10 UTC Error while computing 40.65 38.64 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8538194491 8889086 12 Feb 2020, 17:19:54 UTC 13 Feb 2020, 2:56:07 UTC Aborted 0.00 0.00 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8539994504 8740693 13 Feb 2020, 3:29:55 UTC 14 Feb 2020, 2:43:40 UTC Completed and validated 12,326.90 10,470.39 41.14 SETI@home v8 v8.08 (alt) windows_x86_64 8541276844 8637291 13 Feb 2020, 11:53:47 UTC 15 Feb 2020, 21:05:49 UTC Error while computing 2,728.58 13.72 --- SETI@home v8 Anonymous platform (NVIDIA GPU) 8551093378 8687393 15 Feb 2020, 21:30:46 UTC 21 Feb 2020, 19:36:02 UTC Error while computing 4,098.76 12.27 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8572354368 6910484 21 Feb 2020, 19:36:04 UTC 21 Feb 2020, 19:46:26 UTC Completed and validated 36.09 33.22 41.14 SETI@home v8 Anonymous platform (NVIDIA GPU)The three 'error' tasks were from NVIDIA GeForce GTX 960 (2048MB) driver: 436.48 OpenCL: 1.2 NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 441.66 OpenCL: 1.2 NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 441.66 OpenCL: 1.2 - so we're still suffering (and suffering badly) from NVidia's mistake. _3 and _4 between them held up the WU for eight of the nine days it's spent in the database. And just look at the runtime differential between the two valid instances. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.