The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033330 - Posted: 21 Feb 2020, 17:51:58 UTC - in response to Message 2033327. Last modified: 21 Feb 2020, 17:53:22 UTC While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. Every host has equal chance to get tasks in a scheduler request. In the throttled situations like this it's the fast ones who suffer because they need more work to keep running but only get the same trickle that everyone gets. If I wanted to grab an unfair share of the work, I wouldn't be spoofing my gpu count but running multiple instances of unmodified boinc instead. That would allow me to spam scheduler requests more frequently and grab the share of many computers to one computer. I actually considered that when I had bought my current gpu and started to have issues 'surviving' Tuesday downtimes as that trick would have also allowed me to multiply my cache size. But that would have been too dirty trick to my taste so I modified my client to report imaginary gpus instead. ID: 2033330 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22227 Credit: 416,307,556 RAC: 380	Message 2033332 - Posted: 21 Feb 2020, 18:01:27 UTC NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2033332 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2033333 - Posted: 21 Feb 2020, 18:11:13 UTC - in response to Message 2033327. While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. I seem to recall from another thread that SAH is only taking a few percent of the Breakthrough Listen data from Green Bank. That's my evidence. Plus, since I've been here we have never run out of tapes that I recall. Regardless, the servers cannot dish out the stack of tapes they have loaded since everyone's caches are dropping and I see plenty of tapes mounted and unprocessed. ID: 2033333 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2033334 - Posted: 21 Feb 2020, 18:12:09 UTC - in response to Message 2033332. NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. I think it's also worth ensuring that your work request is as 'quick to process' as possible. I happen to have seen that this afternoon. Fast cruncher had run itself dry while I was out: 21/02/2020 17:02:49 \| SETI@home \| [sched_op] NVIDIA GPU work request: 88128.00 seconds; 0.00 devices 21/02/2020 17:02:52 \| SETI@home \| Scheduler request completed: got 0 new tasks So I turned down the work cache from 0.5 days to 0.05 days: 21/02/2020 17:24:24 \| SETI@home \| [sched_op] NVIDIA GPU work request: 10368.00 seconds; 0.00 devices 21/02/2020 17:24:26 \| SETI@home \| Scheduler request completed: got 96 new tasks 21/02/2020 17:24:26 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 5039 seconds If it takes too long to carry out all the checks (has any of your other computers acted as wingmate on this WU?), the available tasks are likely to have been grabbed by a more agile computer while your particular scheduler instance is still thinking about it. ID: 2033334 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033336 - Posted: 21 Feb 2020, 18:23:15 UTC - in response to Message 2033250. more of a whine / observation / OCD thing but why are there 5 'chunks' of data waiting to be processed since late 2019? would flushing everything out of the repositories have a potential cleansing affect? i'm no DB nor systems design guy, just wondering..... it has come close a few times only to have another coupla days of data pushed in front - like today. . . There are 4 'tapes' that have been sitting on the splitters since October 2018 but never split despite being the oldest tapes mounted. In the last couple of months there is another tape that has joined this group so now there are 2 x Blc22, 2 x Blc34 and 1 x Blc62 tapes that are very old but will not split. I have no idea if the reason these tapes will not split has anything to do with the general malaise that is affecting the splitters and other functions. But I would still like to see them either kicked off to split or just kicked off if the data is faulty. Stephen < shrug > ID: 2033336 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2033338 - Posted: 21 Feb 2020, 18:29:57 UTC - in response to Message 2033328. Last modified: 21 Feb 2020, 18:32:57 UTC indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board! With my math those are 110 cores. Note that all the listed servers are dual socket ones. edit - whoops, i did forget to double it. yes indeed a dual socket 64-core epyc system would have more cores all in one box. still think it's more wise to spread it across 2-3 systems for the previous reasons mentioned. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2033338 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2033340 - Posted: 21 Feb 2020, 18:30:59 UTC - in response to Message 2033336. I just think the 'what tape shall I run next?' algorithm is running LIFO instead of FIFO. ID: 2033340 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033341 - Posted: 21 Feb 2020, 18:32:20 UTC - in response to Message 2033332. NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, and when the pot is empty there is a pause in delivery while it is refilled - which is why we see so many "project has no tasks" messages, even when there are thousands apparently available in the RTS. Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size? IMHO The real problem on SETI or the boinc itself is they are obsolete (software & hardware) for today's hardware and software's used by the volunteers. Try to keep very slow devices (who takes days to crunch a single WU) and superfast devices (few of them spoofed and who crunch a WU in a couple of seconds) is the real source of all our constant problems. The non test release the servers WU limits just make all worst. With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778 Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline. Besides the impossible solution of buy new hardware (lots of $ and time) the only real way that could be solved is by reducing even more the WU limits on the slow devices and drastically reduce the dead time of the WU's. Makes little or no sense at all, allow a host who produce a couple of WU per day have a 150 WU buffer. The number of WU available to all the hosts must be limited to the capacity of it to crunch & produce the valid WU. Maybe a 1 day will be enough. Please remember... i'm talking about 1000's of hosts who DL large caches and will never return them on time. So please stop blaming the, maybe the 30-40 hosts who run with spoofed clients, who DL and return his WU in a day or 2. They actually help to clear the DB. my 0.02 ID: 2033341 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2033342 - Posted: 21 Feb 2020, 18:38:47 UTC - in response to Message 2033336. more of a whine / observation / OCD thing but why are there 5 'chunks' of data waiting to be processed since late 2019? would flushing everything out of the repositories have a potential cleansing affect? i'm no DB nor systems design guy, just wondering..... it has come close a few times only to have another coupla days of data pushed in front - like today. . . There are 4 'tapes' that have been sitting on the splitters since October 2018 but never split despite being the oldest tapes mounted. In the last couple of months there is another tape that has joined this group so now there are 2 x Blc22, 2 x Blc34 and 1 x Blc62 tapes that are very old but will not split. I have no idea if the reason these tapes will not split has anything to do with the general malaise that is affecting the splitters and other functions. But I would still like to see them either kicked off to split or just kicked off if the data is faulty. Stephen < shrug > Before we got the new blc73s I had hope that we would split the old files in the queue, but alas that was not to be. I don't think files in the to-be-split queue would affect the process much. I don't think we are near the "limit" on things in that queue as I've seen it much longer in the past. There is the possibility that the files that are in the process of being split, but left hanging as the splitters move on to new files might be an issue but I have no idea. They increased the number of splitting processes a couple of months ago. There are now 16 processes working on 11 files versus the previous 14 processes working on 9 files. I have no idea how the splitters choose which file to work on next. It doesn't have a pattern anymore. My opinion, based on nothing, is that they don't watch the splitting queue that much. We usually have to let them know when files have gotten stuck splitting. They just don't monitor it that closely. My guess is that they have no idea that there are 5 "stale" files hanging out in the queue. I truly think that there is no problem caused by the stale files, but I will admit it still bothers me. ID: 2033342 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033344 - Posted: 21 Feb 2020, 18:40:49 UTC - in response to Message 2033309. Something else that could help is to end the spoofing. If a host has 4 GPUs they should only get WUs for those 4 GPUs. No more editing software to spoof that a host has 15, 20, 30 or more GPUs when they have no more than 8. Spoofing has actually allowed me to be nice to the servers and other users. I reduce my fake gpu count when the Tuesday outage has started so that when the outage ends, I'm still above my new cap so I'm only reporting results but not competing with the other hosts for new tasks. When my host finally starts asking for new tasks, it is only asking a few at the time matching the number it reported. And when this happens, the post-outage congestion is over already. Also I have configured my computers to report at most 100 results per scheduler request. So that they aren't flooding the server with a ridiculous bomb after the outage. . . Or do as many others do and set No New Tasks until the reporting crush has finished and the servers are getting back on their feet. These days that takes several hours. But either way the benefit is minimal because the majority of volunteers do none of the above and so the problem persists. As far as spoofing goes, while telling the system that 2 GPUs are really 64 GPUs is ridiculously extreme, most of the those spoofing such high number do have huge numbers of GPUs like 6 to 8 physical units. And the totally impost from this practice is also quite minimal because those spoofing are a tiny, tiny fraction of overall volunteers. But that is just my 2c worth. Stephen < shrug > ID: 2033344 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2033346 - Posted: 21 Feb 2020, 18:46:42 UTC - in response to Message 2033341. Last modified: 21 Feb 2020, 18:47:01 UTC So please stop blaming the, maybe the 30-40 hosts who run with spoofed clients, who DL and return his WU in a day or 2. They actually help to clear the DB. my 0.02 +1 If there is a limit of 10 days on the amount of WUs one can have (if you don't hit the #/cpu & #/gpu limit already), why can't they shorten the due date to 3 weeks?? I think this would get the WUs that are ghosted or on machines that have left the project back into circulation quicker. Some of my WUs have a due date 8 weeks out. Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking?? ID: 2033346 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2033348 - Posted: 21 Feb 2020, 19:09:13 UTC - in response to Message 2033346. Would this be an issue for those running multiple projects?? Is there some unintended consequence I'm overlooking?? No, none. Project deadlines are only applicable for each individual project. The backlash if Seti shortened their deadline would be hosts would do more Seti work more quickly and would not get preempted as much by other projects shorter deadline tasks. For the majority of projects, Seti has the longest deadlines except for some outliers like WCG and Climate Prediction which have almost year long deadlines. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2033348 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 2033349 - Posted: 21 Feb 2020, 19:11:43 UTC The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 2033349 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033351 - Posted: 21 Feb 2020, 19:23:32 UTC - in response to Message 2033327. While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. . . Sorry Rob but I have to disagree with you there. We have seen on a couple of occasions that if we get all the data that Greenbank produces in one day our whole setup is busy crunching for several months. The daily output from one GBT recorder takes several days to get through and there are 64 potential recorders which we can process. If/when we finally start processing data from Parkes (to which I am looking forward) that output will double. While our processing capacity has grown manifold over the decades this program has been operating so has the data which we need to process. If our daily capacity could grow 10 fold then we would still be lagging behind the data available. One thing in our favour is we do not see all the data from every day. We only some of the output from each observatory, and so far none from Parkes (had to throw that in). Eric and David have both stressed the need for more volunteers and they aint lying. . . As far as being greedy, since when it is greedy to want to do work for somebody else at no charge despite the cost to yourself? And these spoofing hosts you seem to feel are denying you the work you want are simply trying to keep the supply of work up to the level of productivity of which their machines are capable. I would call that efficiency not greed, and working towards having the processing power that the project requires. The issue is that the crunching power of the volunteer force has grown beyond the capacity of the server farm. While there is at least one very powerful new server under test it appears there are problems which are delaying its deployment in main. And there is a new data storage unit under construction that will be many times greater, and several times faster than that currently available, but this could still be a long time away from being deployed. The whole issue has been seriously aggravated by recent large runs of noisy data such as the Blc35 and Blc41 series which have at least in a large part been about 90% noise bombs. The servers are choking on the output of a much greater crunching force than ever before that is working on the project. There are probably some other issues biting at the ankles of the system since the unfortunate BOINC upgrade that kicked off most of these problems prior to Christmas and required a 'roll back', which I have no doubt has left several time bombs in the machinery that could also be part of the problem. The cure is a matter of time and money, and I cannot see the boffins at Berkeley actually asking all the volunteers to do less work. . . Sorry but you tripped a trigger with that message ... Stephen < shrug > ID: 2033351 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033352 - Posted: 21 Feb 2020, 19:25:09 UTC - in response to Message 2033349. Last modified: 21 Feb 2020, 19:27:39 UTC The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. Nice to see you around. Back to topic. Even the slowest computer or devices like cell phones, PI, etc. could crunch a WU in less than a month.... Then why keep a deadline of 2-3 months? Remember the DB needs to keep track the WU for all this time and if at the end of the time it was not returned, rinse & repeat. More 2-3 months? My point is simple: stop the fix limit of WU per GPU/CPU and allow the host DL only up to the number of Wu it crunch and return valid for an specific period of time (1 day for example). And reduce the death lines. That will squeeze the DB to a more manageable size for sure. Desperate times needs desperate measures...... ID: 2033352 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 2033353 - Posted: 21 Feb 2020, 19:31:13 UTC It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request. Methinks the overhead would be too high to do be worth it. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 2033353 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033354 - Posted: 21 Feb 2020, 19:32:30 UTC - in response to Message 2033329. While it MAY have more work than can be processed (a claim for which there is NO evidence) then, if there is a problem delivering that work to the users then it make no sense to attempt to grab all one can, and so turn the average user away because they can't get work due to the greed of a very vocal minority. I think everybody has about the same odds of hitting the servers when it has work in the RTS queue to hand out. I am far short of having a full cache, and most work requests are getting the 'project has no tasks available' response. But, about 20 minutes ago I got a 36 task hit to keep my cruncher going. This does not help those who have mega-crunchers very much. So, work is going out and being returned. Wish things were better, but it is what it is. Meow. . . Exactly, we are all in the same boat ... Stephen < shrug > ID: 2033354 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2033355 - Posted: 21 Feb 2020, 19:32:40 UTC - in response to Message 2033349. The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. but when the number of computers that need 4-6 weeks to complete 1 WU are almost non-existent (even a RPi can complete a WU in under a day, and the impact this kind of setting is having on the project... probably best to do what's in the best interest of the project. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2033355 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033357 - Posted: 21 Feb 2020, 19:35:30 UTC - in response to Message 2033341. Last modified: 21 Feb 2020, 19:35:50 UTC NO - if the serving pot was open then that would be true, but there is a limit of 200 tasks in the pot, and if ONE cruncher grabs 100 of them there are fewer left for anyone else coming along after, Why not simply increase the size of this pot? and make it intelligent with a constant refilling function? With out a fix size? The easiest solution would be to have a limit for number of task to be given for one host at a time. This limit could be the same as the size of the pot at first but whenever there has been a request that had to be offered 0 tasks because the pot was empty, this limit would be reduced a bit. And whenever there was still tasks left in the pot when it was refilled, the limit would be increased a bit. So it would scale dynamically and when everything is fine, every request would get what it asks but in a throttled situation like today, everyone would get a little bit in each request whenever there are tasks in the RTS queue. Instead of the current situation where most requests get noting and an occasional lucky one gets a lot. ID: 2033357 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2033358 - Posted: 21 Feb 2020, 19:36:47 UTC - in response to Message 2033349. The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. As it should be. Meow. I totally agree on the including as many people as possible! I guess I just figured that a device could return a WU in 21 days. There is a limit of 10 days worth of WUs (I'm assuming as this was a big deal when I was on my slower machine as I would hit this limit before my 100 WU limit. If this is no longer the case, let me know). If a machine takes longer than 21 days (my suggested time out limit) to do 1 WU, then do they only get 1 WU at a time? This brings up so many questions... what is the slowest system? how long does it take to do 1 WU?? ID: 2033358 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.