Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 94 · Next
| Author | Message |
|---|---|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?My Windows system has managed to pickup almost 200 WUs in the last 20min, my Linux system less than 6 since the resumption of services....The same thing happens on the Mac. Back when I had a Windows machine next to the Mac, connected to the same router, I'd watch the Windows machine receive work every five minutes while the Mac was told there wasn't any work available. After the Windows machine had a full cache, then there was magically work available for the Mac. I watched this dozens of times, to the point I was sure it wasn't a coincidence, and it hasn't changed one bit. Whatever it is that stops Anon Platform from getting work because other requests have already been filled by the time it gets around to the Anon Platform request may already be at work in the present Scheduler when it comes to processing work requests. The order in which it determines eligibility for work, results in certain platforms not getting any under certain load conditions, eg extremely high (250k+) return rates. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
Return rate now down to around 235k, and both systems are now, very occasionally, getting some work. Grant Darwin NT |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?I have often had one of my hosts getting work on every request while the other host stays dry. And they are both anonymous platform linux boxes. My theory is that because the clients are doing scheduler requests in a regular five minute cadence, then if there is a a big bunch of clients hitting the server at the same time my host hits it, this same bunch will be competing with my host on its next request too. And if my other host hits the server at a quiet point in time, It'll keep hitting this same 'hole' on the subsequent requests. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now.Not just stock v Anon, but also OS, OS version etc, GPU type, GPU driver etc. All the things the Scheduler goes through when deciding what to give or not give, and when it's under a heavy load the time it takes to run through all those things results in some system getting work where as others don't. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon.Nope, not even close. Since my earlier post it's only picked up a couple of dozen WUs, if that. "Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work. My Linux system did get some, but we're back to sticky downloads- timer counts away while not a bit gets transferred, and eventually it times out, rinse & repeat & you end up with extreme backoffs. A few retries managed to get those cleared. Edit- Linux system just scored work on 2 consecutive requests (and downloaded without assistance). Windows system hasn't gotten any for over an hour. *shrug* Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host. When a client talks to the scheduler, it lists every task it has, not just the completed ones it is reporting. And this happens in quite verbose xml. So scheduler requests of hosts with big caches are huge! Taking more time to transfer over the net and more processing from the scheduler. This gives them more opportunities to fail. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
"Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work.Return rate stabilizing only means that the hosts have cleared their backlogs of unreported work from the downtime. It doesn't mean they have filled their caches. If the ssp told the number of tasks handed out during the last hour, then that number stabilizing would mean the post downtime congestion is over. |
Jimbocous ![]() Send message Joined: 1 Apr 13 Posts: 1861 Credit: 268,616,081 RAC: 1,349
|
What I see here is that the lower the client's RAC, the more likely it is that the box will get tasks. Platform doesn't seem to matter. After an outage, my two heavy hitters will go 12-24 hours before getting any significant work, the other two low producers will have full caches within a couple hours. When the heavy hitters do start getting work, the lower of the two gets it first. Too consistent to be coincidence.
|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
Windows system finally starting to get some work again. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
Yep. Going to be a long time before caches are refilled- In progress is about 1 million below where it was before the outage."Project has no tasks available" is the standard response, even though the return rate is now down to 130k, still not getting work.Return rate stabilizing only means that the hosts have cleared their backlogs of unreported work from the downtime. It doesn't mean they have filled their caches. And the splitters are yet to really get going- 20/s is better than nothing, but not a lot. And the Validation backlog just keep reaching new highs. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host.True, but presently the problem has been with systems that have no work at all. And then the system that got work, gets more, while the other system still gets none. Grant Darwin NT |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
One thing that could affect the failure rate of scheduler requests under heavy load is the size of the work cache of your host.I think it's not just the size of the cache you have - it's also the size of the cache you want. I've had some success by turning down my cache request to maybe an hour or less, when re-loading a fast machine from dry. Get a few in, just to ensure the regular 'every 5 minutes' request, and then gradually ease the cache back upwards. Make it easy for the server - fewer potential candidate allocations to assess. Might be the placebo effect, but it's worked again: 05/02/2020 09:21:29 | SETI@home | [sched_op] NVIDIA GPU work request: 41386.38 seconds; 0.00 devices 05/02/2020 09:21:31 | SETI@home | Scheduler request completed: got 0 new tasks 05/02/2020 09:26:37 | SETI@home | [sched_op] NVIDIA GPU work request: 16013.02 seconds; 0.00 devices 05/02/2020 09:26:40 | SETI@home | Scheduler request completed: got 75 new tasks |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
And the Validation backlog just keep reaching new highs.The backlog of tasks I crunched during the downtime seem to have been validated now. My RAC is back where it was before the dt. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13987 Credit: 208,696,464 RAC: 304
|
And we're back to sticking downloads again. Grant Darwin NT |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
Hey, this is nice. Seems the same setting that controls the Upload Retries also controls the Download Retries. Instead of Download retries in minutes, it's seconds. Download 'Project Backoffs' are minutes instead of Hours....this will work. Except as usual, we are now Out Of Work, and my machines are still out of work. |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
Setting nnt until all work is reported has been very effective for me. . . Reducing work report to 99 and setting NNT did not help here ... :( Stephen :( |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
I just noticed we are back. And it wasn't a multi-day shutdown. Just a basic long Tuesday . . Hmmmm, 12 hours is a little more than a basic outage :( Stephen :( |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon. . . I didn't start to get more than an odd task or 2 until 8:30am UTC. :( Stephen :( |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version?I have often had one of my hosts getting work on every request while the other host stays dry. And they are both anonymous platform linux boxes. My theory is that because the clients are doing scheduler requests in a regular five minute cadence, then if there is a a big bunch of clients hitting the server at the same time my host hits it, this same bunch will be competing with my host on its next request too. And if my other host hits the server at a quiet point in time, It'll keep hitting this same 'hole' on the subsequent requests. . . My slowest Linux host seems to find that sweet spot regularly and will get regular downloads when the other 3 Linux machines are getting nothing.. All on the same line ... Stephen ? ? |
|
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266
|
Game on, just got two healthy downloads back to back. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.