The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030836 - Posted: 5 Feb 2020, 7:18:10 UTC - in response to Message 2030834.  

My Windows system has managed to pickup almost 200 WUs in the last 20min, my Linux system less than 6 since the resumption of services....
The same thing happens on the Mac. Back when I had a Windows machine next to the Mac, connected to the same router, I'd watch the Windows machine receive work every five minutes while the Mac was told there wasn't any work available. After the Windows machine had a full cache, then there was magically work available for the Mac. I watched this dozens of times, to the point I was sure it wasn't a coincidence, and it hasn't changed one bit.
I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?
Whatever it is that stops Anon Platform from getting work because other requests have already been filled by the time it gets around to the Anon Platform request may already be at work in the present Scheduler when it comes to processing work requests.
The order in which it determines eligibility for work, results in certain platforms not getting any under certain load conditions, eg extremely high (250k+) return rates.
Grant
Darwin NT
ID: 2030836 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030837 - Posted: 5 Feb 2020, 7:21:17 UTC

Return rate now down to around 235k, and both systems are now, very occasionally, getting some work.
Grant
Darwin NT
ID: 2030837 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030838 - Posted: 5 Feb 2020, 7:59:58 UTC - in response to Message 2030837.  

It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon.
ID: 2030838 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030839 - Posted: 5 Feb 2020, 8:02:00 UTC - in response to Message 2030836.  

I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?
Whatever it is that stops Anon Platform from getting work because other requests have already been filled by the time it gets around to the Anon Platform request may already be at work in the present Scheduler when it comes to processing work requests.
The order in which it determines eligibility for work, results in certain platforms not getting any under certain load conditions, eg extremely high (250k+) return rates.
I have often had one of my hosts getting work on every request while the other host stays dry. And they are both anonymous platform linux boxes. My theory is that because the clients are doing scheduler requests in a regular five minute cadence, then if there is a a big bunch of clients hitting the server at the same time my host hits it, this same bunch will be competing with my host on its next request too. And if my other host hits the server at a quiet point in time, It'll keep hitting this same 'hole' on the subsequent requests.
ID: 2030839 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030840 - Posted: 5 Feb 2020, 8:11:50 UTC - in response to Message 2030838.  
Last modified: 5 Feb 2020, 8:20:23 UTC

It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now.
Not just stock v Anon, but also OS, OS version etc, GPU type, GPU driver etc.
All the things the Scheduler goes through when deciding what to give or not give, and when it's under a heavy load the time it takes to run through all those things results in some system getting work where as others don't.

Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon.
Nope, not even close.
Since my earlier post it's only picked up a couple of dozen WUs, if that.
"Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work.


My Linux system did get some, but we're back to sticky downloads- timer counts away while not a bit gets transferred, and eventually it times out, rinse & repeat & you end up with extreme backoffs. A few retries managed to get those cleared.


Edit- Linux system just scored work on 2 consecutive requests (and downloaded without assistance). Windows system hasn't gotten any for over an hour.
*shrug*
Grant
Darwin NT
ID: 2030840 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030841 - Posted: 5 Feb 2020, 8:26:05 UTC

One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host.

When a client talks to the scheduler, it lists every task it has, not just the completed ones it is reporting. And this happens in quite verbose xml. So scheduler requests of hosts with big caches are huge! Taking more time to transfer over the net and more processing from the scheduler. This gives them more opportunities to fail.
ID: 2030841 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030842 - Posted: 5 Feb 2020, 8:33:25 UTC - in response to Message 2030840.  

"Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work.
Return rate stabilizing only means that the hosts have cleared their backlogs of unreported work from the downtime. It doesn't mean they have filled their caches.

If the ssp told the number of tasks handed out during the last hour, then that number stabilizing would mean the post downtime congestion is over.
ID: 2030842 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030844 - Posted: 5 Feb 2020, 8:43:05 UTC
Last modified: 5 Feb 2020, 8:44:06 UTC

What I see here is that the lower the client's RAC, the more likely it is that the box will get tasks. Platform doesn't seem to matter.
After an outage, my two heavy hitters will go 12-24 hours before getting any significant work, the other two low producers will have full caches within a couple hours. When the heavy hitters do start getting work, the lower of the two gets it first.
Too consistent to be coincidence.
ID: 2030844 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030845 - Posted: 5 Feb 2020, 8:45:53 UTC

Windows system finally starting to get some work again.
Grant
Darwin NT
ID: 2030845 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030846 - Posted: 5 Feb 2020, 8:48:50 UTC - in response to Message 2030842.  

"Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work.
Return rate stabilizing only means that the hosts have cleared their backlogs of unreported work from the downtime. It doesn't mean they have filled their caches.
Yep. Going to be a long time before caches are refilled- In progress is about 1 million below where it was before the outage.
And the splitters are yet to really get going- 20/s is better than nothing, but not a lot.

And the Validation backlog just keep reaching new highs.
Grant
Darwin NT
ID: 2030846 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030847 - Posted: 5 Feb 2020, 8:50:37 UTC - in response to Message 2030841.  

One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host.

When a client talks to the scheduler, it lists every task it has, not just the completed ones it is reporting. And this happens in quite verbose xml. So scheduler requests of hosts with big caches are huge! Taking more time to transfer over the net and more processing from the scheduler. This gives them more opportunities to fail.
True, but presently the problem has been with systems that have no work at all. And then the system that got work, gets more, while the other system still gets none.
Grant
Darwin NT
ID: 2030847 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030848 - Posted: 5 Feb 2020, 9:01:21 UTC - in response to Message 2030841.  
Last modified: 5 Feb 2020, 9:33:19 UTC

One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host.

When a client talks to the scheduler, it lists every task it has, not just the completed ones it is reporting. And this happens in quite verbose xml. So scheduler requests of hosts with big caches are huge! Taking more time to transfer over the net and more processing from the scheduler. This gives them more opportunities to fail.
I think it's not just the size of the cache you have - it's also the size of the cache you want. I've had some success by turning down my cache request to maybe an hour or less, when re-loading a fast machine from dry.

Get a few in, just to ensure the regular 'every 5 minutes' request, and then gradually ease the cache back upwards. Make it easy for the server - fewer potential candidate allocations to assess.

Might be the placebo effect, but it's worked again:
05/02/2020 09:21:29 | SETI@home | [sched_op] NVIDIA GPU work request: 41386.38 seconds; 0.00 devices
05/02/2020 09:21:31 | SETI@home | Scheduler request completed: got 0 new tasks
05/02/2020 09:26:37 | SETI@home | [sched_op] NVIDIA GPU work request: 16013.02 seconds; 0.00 devices
05/02/2020 09:26:40 | SETI@home | Scheduler request completed: got 75 new tasks
ID: 2030848 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030849 - Posted: 5 Feb 2020, 9:23:57 UTC - in response to Message 2030846.  

And the Validation backlog just keep reaching new highs.
The backlog of tasks I crunched during the downtime seem to have been validated now. My RAC is back where it was before the dt.
ID: 2030849 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030850 - Posted: 5 Feb 2020, 9:26:10 UTC

And we're back to sticking downloads again.
Grant
Darwin NT
ID: 2030850 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030851 - Posted: 5 Feb 2020, 9:46:26 UTC
Last modified: 5 Feb 2020, 10:02:34 UTC

Hey, this is nice. Seems the same setting that controls the Upload Retries also controls the Download Retries. Instead of Download retries in minutes, it's seconds. Download 'Project Backoffs' are minutes instead of Hours....this will work.

Except as usual, we are now Out Of Work, and my machines are still out of work.
ID: 2030851 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030855 - Posted: 5 Feb 2020, 10:55:22 UTC - in response to Message 2030809.  
Last modified: 5 Feb 2020, 11:07:50 UTC

Setting nnt until all work is reported has been very effective for me.


. . Reducing work report to 99 and setting NNT did not help here ... :(

Stephen

:(
ID: 2030855 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030856 - Posted: 5 Feb 2020, 10:56:21 UTC - in response to Message 2030822.  

I just noticed we are back. And it wasn't a multi-day shutdown. Just a basic long Tuesday

Tom.


. . Hmmmm, 12 hours is a little more than a basic outage :(

Stephen

:(
ID: 2030856 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030857 - Posted: 5 Feb 2020, 10:59:56 UTC - in response to Message 2030838.  

It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon.


. . I didn't start to get more than an odd task or 2 until 8:30am UTC. :(

Stephen

:(
ID: 2030857 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030858 - Posted: 5 Feb 2020, 11:02:29 UTC - in response to Message 2030839.  

I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?
Whatever it is that stops Anon Platform from getting work because other requests have already been filled by the time it gets around to the Anon Platform request may already be at work in the present Scheduler when it comes to processing work requests.
The order in which it determines eligibility for work, results in certain platforms not getting any under certain load conditions, eg extremely high (250k+) return rates.
I have often had one of my hosts getting work on every request while the other host stays dry. And they are both anonymous platform linux boxes. My theory is that because the clients are doing scheduler requests in a regular five minute cadence, then if there is a a big bunch of clients hitting the server at the same time my host hits it, this same bunch will be competing with my host on its next request too. And if my other host hits the server at a quiet point in time, It'll keep hitting this same 'hole' on the subsequent requests.


. . My slowest Linux host seems to find that sweet spot regularly and will get regular downloads when the other 3 Linux machines are getting nothing.. All on the same line ...

Stephen

? ?
ID: 2030858 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2030866 - Posted: 5 Feb 2020, 12:51:48 UTC

Game on, just got two healthy downloads back to back.
ID: 2030866 · Report as offensive
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.