Message boards :
Number crunching :
Panic Mode On (115) Server Problems?
Message board moderation
Author | Message |
---|---|
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
Ghan-buri-Ghan Mike Send message Joined: 27 Dec 15 Posts: 123 Credit: 92,602,985 RAC: 172 |
About 280 of the 300 or so download errors that came my way have cycled off the system. Thank you sixth iteration wingmen.... |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The purgers, validators and deleters made a good whack at the backlogs today. I can actually view my tasks today for the first time in a week without timing out. Good job Seti staff! Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
halfempty Send message Joined: 2 Jun 99 Posts: 97 Credit: 35,236,901 RAC: 114 |
Seeing as the old one was closing in on one-thousand posts, I had to crunch a few units so I could start a new thread.Thank you for taking the time. 👠|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
BTW, there seems to be a Third Cache all set and ready to be used immediately. Here, https://setiathome.berkeley.edu/show_server_status.php, you see a cache named SETI@home v7 # with a mere 71 outstanding tasks. Do whatever you wish with those 71 tasks, then change the name to BLC_guppi, point all the gbt splitters to it, and point the Scheduler to that new cache. Then let's see if the Scheduler still can't find tasks to send when there are Hundreds of Thousands in the caches. Since the Arecibo and BLC Files use different Splitters, it should follow that they also use a different cache. Hopefully any New Data sources will be similar enough to use the BLC cache as well. Sounds good to me.I wouldn't blame the scheduler exactly for having problem with multiple WU types. More that its decision time for allocating tasks increases under high server loads causes timeouts - http or data reads. Also it seems to have a load watchdog, where as now, it likes to only give 1 task at a time per computer thread for downloads. Likely due to too many dropped calls to it.I remember not too long ago seeing return rates upward of 180-190k per hour during an Overflow storm. This was with Full services running, i.e. Scheduler, Uploads and Downloads with the only problem being the splitters couldn't keep up. Recent problems have happened with a Much lower Return Rate and even though the Scheduler and Download services were Stopped once the problem began. Once the Problem begins Most of the Server Load Stops, but the problem persists. Big difference there. Past test show No problem with High Server Loads as long as there is just One Type of Task in the Cache. Past tests show the problem only happens when more than One type of task is in the cache. Those tests are why we are now running Arecibo VLARs on the GPUs, the problem started when the Arecibo splitting began. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
Yes, it is... |
rob smith Send message Joined: 7 Mar 03 Posts: 22537 Credit: 416,307,556 RAC: 380 |
I believe that the decision was made to kill the v7 work, and re-issue any outstanding tasks "at a later date". One can assume that 71 tasks was seen as being so low that no further work was done as completing the job would entail more effort than it was worth. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I decided to go back and see just how high that return rate got a while back. I started thinking it was over 200000 per hour, and remembering how my 12 GPU machine was returning around 50 to 70 tasks every 5 minutes. It was quite a show, I didn't know which machine would crash first, the Server or my 12 GPU machine. It's here, Results received in last hour: 237,593. It lasted over 4 hours, and from the comments you can see the only tasks in the cache were BLCs. That's how fast you can go and survive with just BLCs in the cache. It would probably go even faster, if the Splitters could keep up. Just remember that the next time someone tries to tell you 120,000, without Scheduling & Downloads, is a heavy Server load. Right now it's around 135k. So...... when do we get the New Cache with just BLC tasks? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
So...... when do we get the New Cache with just BLC tasks? Or just fix the system so it works as is? As things stand, splitter output is tapering off and the Ready-to-send buffer is falling rapidly. Grab them while you can. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
But, but, it's been messed up for around 3 years. What makes you think it's going to be fixed anytime soon? It kinda reminds me of the Linux BOINC Manager and the Jumping Task Page that lasted around 3 Years. I was able to fix the Manager, I can't do nothin about the Server though. Kinda a shame the 'Running VLARs on the GPUs' didn't totally fix it, but, back then we were only receiving a few Arecibo Files a week or so. It was common to go a week without any Arecibo work, and it did seem to fix it as long as there were few periods of Arecibo tasks in the cache. Seems it's a random problem that gets more common the longer you have Arecibo tasks in the cache, and lately there have always been Arecibos in the cache. I looked back for the earliest mention of the problem and only found it starting in Jan 2017. I'm pretty sure it was around in 2016, but, I don't see it mentioned in the Panic Mode threads. At least it Should work by just separating the caches, I don't remember getting the problem with just BLCs in the Cache, it seemed the Arecibos were always present. We'll have to keep a lookout when it happens again to see what's in the cache. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
I looked back for the earliest mention of the problem and only found it starting in Jan 2017. I'm pretty sure it was around in 2016, but, I don't see it mentioned in the Panic Mode threads. From memory it was Dec 2016, there was some change in the Scheduler. People that chose to do AP work, and MB only if no AP was available, suddenly were no longer getting any work at all. They had to select MB as well, it couldn't just be a fall back option any more. In my case, I did MB only, no AP at all, but then I started having issues getting any MB work. It became necessary to change the Application preference settings allowing other work when preferred work was no longer available and allowing AP work (even though I didn't have the AP application installed) , then change them back again later when that stopped working, then change them back again later when that stopped working, etc, etc, etc. Then you came up with the triple update which helped get work, even with my original setting for no AP at all. Then even that stopped working, or it just got to be too much of a pain the arse (can't remember which), so I just installed the AP application & selected AP as well as MB. That made it possible to get MB work without buggering around with Application preferences or triple Updating all the time. And here we are now. Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
You are confusing 2 different types of problems. Jan 2017 was when Arecibo VLAR were not being sent to GPUs. Not far away from that message is this saying there was few BLC tasks to be had. The server sat there very happy in those days with a full cache and decreased load since the Nvidia cards were all but shutdown since no tasks were available to them. No we have heavy server loads and a decreasing cache with a abundance of Arecibo Shorty's in play and higher return rates than the other case. You might as well be comparing apples to woodpeckers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
We have had extended periods where the return rate was over 140k, and the Splitters and Validators and Assimilators and Purgers were all able to keep up. Lately, even with return rates of less than 100k, things have been falling over, regularly. There are issues above & beyond whatever happened 2 years ago, that's for sure. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Right, if you say so. Meanwhile, I'll keep an eye out for something I've been working on for THREE YEARS. The last problem was exactly like the others,. Work request were being ignored and then ever so often the Server would send you around a dozen tasks or so. I posted logs about it a few times in the Panic Mode thread myself. They are still there; https://setiathome.berkeley.edu/forum_thread.php?id=80573&postid=1842587#1842587 https://setiathome.berkeley.edu/forum_thread.php?id=80573&postid=1845428#1845428 https://setiathome.berkeley.edu/forum_thread.php?id=81086&postid=1851867#1851867 https://setiathome.berkeley.edu/forum_thread.php?id=81086&postid=1851885#1851885 There are more... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
Been getting sticky downloads for a while now. Suspend & re-enable network activity & they clear OK. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Well only a 5 hour outage but the usual 3 or 4 hours of post outage famine. "No Tasks Available". Stephen :( |
Wiggo Send message Joined: 24 Jan 00 Posts: 36853 Credit: 261,360,520 RAC: 489 |
Full caches here, I just have to get them to download. Cheers. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Full caches here, I just have to get them to download. . . Sometimes I wonder if you and I are connected to the same project .... :) Stephen ? |
Wiggo Send message Joined: 24 Jan 00 Posts: 36853 Credit: 261,360,520 RAC: 489 |
My main rig is finally downloading those w/u's. [edit] My other rig is now getting in on the act too. Cheers. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
My main rig is finally downloading those w/u's. . . OK I must be on lag here. While I was typing that last message one machine got a swag of new WUs. But now any more d/ls won't run. They are just hanging ... Stephen :( |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.