Panic Mode On (115) Server Problems?

Message boards : Number crunching : Panic Mode On (115) Server Problems?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 31 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1980181 - Posted: 13 Feb 2019, 22:54:53 UTC

Seeing as the old one was closing in on one-thousand posts, I had to crunch a few units so I could start a new thread.

ID: 1980181 · Report as offensive
Profile Ghan-buri-Ghan Mike

Send message
Joined: 27 Dec 15
Posts: 123
Credit: 92,602,985
RAC: 172
United States
Message 1980184 - Posted: 13 Feb 2019, 23:21:20 UTC

About 280 of the 300 or so download errors that came my way have cycled off the system. Thank you sixth iteration wingmen....
ID: 1980184 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1980214 - Posted: 14 Feb 2019, 2:40:26 UTC - in response to Message 1980184.  

The purgers, validators and deleters made a good whack at the backlogs today. I can actually view my tasks today for the first time in a week without timing out. Good job Seti staff!
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1980214 · Report as offensive
halfempty
Avatar

Send message
Joined: 2 Jun 99
Posts: 97
Credit: 35,236,901
RAC: 114
United States
Message 1980219 - Posted: 14 Feb 2019, 3:24:33 UTC - in response to Message 1980181.  

Seeing as the old one was closing in on one-thousand posts, I had to crunch a few units so I could start a new thread.
Thank you for taking the time. 👍
ID: 1980219 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1980262 - Posted: 14 Feb 2019, 9:17:47 UTC

I wouldn't blame the scheduler exactly for having problem with multiple WU types. More that its decision time for allocating tasks increases under high server loads causes timeouts - http or data reads. Also it seems to have a load watchdog, where as now, it likes to only give 1 task at a time per computer thread for downloads. Likely due to too many dropped calls to it.
I remember not too long ago seeing return rates upward of 180-190k per hour during an Overflow storm. This was with Full services running, i.e. Scheduler, Uploads and Downloads with the only problem being the splitters couldn't keep up. Recent problems have happened with a Much lower Return Rate and even though the Scheduler and Download services were Stopped once the problem began. Once the Problem begins Most of the Server Load Stops, but the problem persists. Big difference there. Past test show No problem with High Server Loads as long as there is just One Type of Task in the Cache. Past tests show the problem only happens when more than One type of task is in the cache. Those tests are why we are now running Arecibo VLARs on the GPUs, the problem started when the Arecibo splitting began.
BTW, there seems to be a Third Cache all set and ready to be used immediately. Here, https://setiathome.berkeley.edu/show_server_status.php, you see a cache named SETI@home v7 # with a mere 71 outstanding tasks. Do whatever you wish with those 71 tasks, then change the name to BLC_guppi, point all the gbt splitters to it, and point the Scheduler to that new cache. Then let's see if the Scheduler still can't find tasks to send when there are Hundreds of Thousands in the caches. Since the Arecibo and BLC Files use different Splitters, it should follow that they also use a different cache. Hopefully any New Data sources will be similar enough to use the BLC cache as well. Sounds good to me.
ID: 1980262 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1980292 - Posted: 14 Feb 2019, 14:54:51 UTC - in response to Message 1980267.  


Isn't that the remains of the old Seti V7? We are now on V8.


Yes, it is...
ID: 1980292 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22436
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1980294 - Posted: 14 Feb 2019, 14:59:41 UTC

I believe that the decision was made to kill the v7 work, and re-issue any outstanding tasks "at a later date". One can assume that 71 tasks was seen as being so low that no further work was done as completing the job would entail more effort than it was worth.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1980294 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1980519 - Posted: 15 Feb 2019, 15:41:53 UTC

I decided to go back and see just how high that return rate got a while back. I started thinking it was over 200000 per hour, and remembering how my 12 GPU machine was returning around 50 to 70 tasks every 5 minutes. It was quite a show, I didn't know which machine would crash first, the Server or my 12 GPU machine. It's here, Results received in last hour: 237,593. It lasted over 4 hours, and from the comments you can see the only tasks in the cache were BLCs. That's how fast you can go and survive with just BLCs in the cache. It would probably go even faster, if the Splitters could keep up. Just remember that the next time someone tries to tell you 120,000, without Scheduling & Downloads, is a heavy Server load. Right now it's around 135k.

So...... when do we get the New Cache with just BLC tasks?
ID: 1980519 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13833
Credit: 208,696,464
RAC: 304
Australia
Message 1980589 - Posted: 15 Feb 2019, 22:25:20 UTC - in response to Message 1980519.  

So...... when do we get the New Cache with just BLC tasks?

Or just fix the system so it works as is?

As things stand, splitter output is tapering off and the Ready-to-send buffer is falling rapidly. Grab them while you can.
Grant
Darwin NT
ID: 1980589 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1980603 - Posted: 15 Feb 2019, 23:16:43 UTC - in response to Message 1980589.  

But, but, it's been messed up for around 3 years. What makes you think it's going to be fixed anytime soon? It kinda reminds me of the Linux BOINC Manager and the Jumping Task Page that lasted around 3 Years. I was able to fix the Manager, I can't do nothin about the Server though. Kinda a shame the 'Running VLARs on the GPUs' didn't totally fix it, but, back then we were only receiving a few Arecibo Files a week or so. It was common to go a week without any Arecibo work, and it did seem to fix it as long as there were few periods of Arecibo tasks in the cache. Seems it's a random problem that gets more common the longer you have Arecibo tasks in the cache, and lately there have always been Arecibos in the cache. I looked back for the earliest mention of the problem and only found it starting in Jan 2017. I'm pretty sure it was around in 2016, but, I don't see it mentioned in the Panic Mode threads.

At least it Should work by just separating the caches, I don't remember getting the problem with just BLCs in the Cache, it seemed the Arecibos were always present. We'll have to keep a lookout when it happens again to see what's in the cache.
ID: 1980603 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13833
Credit: 208,696,464
RAC: 304
Australia
Message 1980605 - Posted: 15 Feb 2019, 23:26:29 UTC - in response to Message 1980603.  

I looked back for the earliest mention of the problem and only found it starting in Jan 2017. I'm pretty sure it was around in 2016, but, I don't see it mentioned in the Panic Mode threads.

From memory it was Dec 2016, there was some change in the Scheduler.
People that chose to do AP work, and MB only if no AP was available, suddenly were no longer getting any work at all. They had to select MB as well, it couldn't just be a fall back option any more.
In my case, I did MB only, no AP at all, but then I started having issues getting any MB work. It became necessary to change the Application preference settings allowing other work when preferred work was no longer available and allowing AP work (even though I didn't have the AP application installed) , then change them back again later when that stopped working, then change them back again later when that stopped working, etc, etc, etc. Then you came up with the triple update which helped get work, even with my original setting for no AP at all.
Then even that stopped working, or it just got to be too much of a pain the arse (can't remember which), so I just installed the AP application & selected AP as well as MB. That made it possible to get MB work without buggering around with Application preferences or triple Updating all the time.
And here we are now.
Grant
Darwin NT
ID: 1980605 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1980611 - Posted: 15 Feb 2019, 23:58:32 UTC - in response to Message 1980603.  

You are confusing 2 different types of problems.

Jan 2017 was when Arecibo VLAR were not being sent to GPUs. Not far away from that message is this saying there was few BLC tasks to be had. The server sat there very happy in those days with a full cache and decreased load since the Nvidia cards were all but shutdown since no tasks were available to them.

No we have heavy server loads and a decreasing cache with a abundance of Arecibo Shorty's in play and higher return rates than the other case.

You might as well be comparing apples to woodpeckers.
ID: 1980611 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13833
Credit: 208,696,464
RAC: 304
Australia
Message 1980618 - Posted: 16 Feb 2019, 0:23:47 UTC

We have had extended periods where the return rate was over 140k, and the Splitters and Validators and Assimilators and Purgers were all able to keep up. Lately, even with return rates of less than 100k, things have been falling over, regularly.
There are issues above & beyond whatever happened 2 years ago, that's for sure.
Grant
Darwin NT
ID: 1980618 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1980621 - Posted: 16 Feb 2019, 0:27:02 UTC - in response to Message 1980611.  
Last modified: 16 Feb 2019, 0:54:14 UTC

Right, if you say so. Meanwhile, I'll keep an eye out for something I've been working on for THREE YEARS.
The last problem was exactly like the others,. Work request were being ignored and then ever so often the Server would send you around a dozen tasks or so.
I posted logs about it a few times in the Panic Mode thread myself. They are still there;
https://setiathome.berkeley.edu/forum_thread.php?id=80573&postid=1842587#1842587
https://setiathome.berkeley.edu/forum_thread.php?id=80573&postid=1845428#1845428
https://setiathome.berkeley.edu/forum_thread.php?id=81086&postid=1851867#1851867
https://setiathome.berkeley.edu/forum_thread.php?id=81086&postid=1851885#1851885
There are more...
ID: 1980621 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13833
Credit: 208,696,464
RAC: 304
Australia
Message 1980802 - Posted: 17 Feb 2019, 5:17:15 UTC

Been getting sticky downloads for a while now.
Suspend & re-enable network activity & they clear OK.
Grant
Darwin NT
ID: 1980802 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1981161 - Posted: 19 Feb 2019, 21:40:28 UTC

. . Well only a 5 hour outage but the usual 3 or 4 hours of post outage famine. "No Tasks Available".

Stephen

:(
ID: 1981161 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36314
Credit: 261,360,520
RAC: 489
Australia
Message 1981164 - Posted: 19 Feb 2019, 21:42:48 UTC

Full caches here, I just have to get them to download.

Cheers.
ID: 1981164 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1981165 - Posted: 19 Feb 2019, 21:46:03 UTC - in response to Message 1981164.  

Full caches here, I just have to get them to download.

Cheers.


. . Sometimes I wonder if you and I are connected to the same project .... :)

Stephen

?
ID: 1981165 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36314
Credit: 261,360,520
RAC: 489
Australia
Message 1981166 - Posted: 19 Feb 2019, 22:03:49 UTC
Last modified: 19 Feb 2019, 22:05:57 UTC

My main rig is finally downloading those w/u's.

[edit] My other rig is now getting in on the act too.

Cheers.
ID: 1981166 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1981174 - Posted: 19 Feb 2019, 22:30:57 UTC - in response to Message 1981166.  

My main rig is finally downloading those w/u's.

[edit] My other rig is now getting in on the act too.

Cheers.


. . OK I must be on lag here. While I was typing that last message one machine got a swag of new WUs. But now any more d/ls won't run. They are just hanging ...

Stephen

:(
ID: 1981174 · Report as offensive
1 · 2 · 3 · 4 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (115) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.