Panic Mode On (111) Server Problems?

Message boards : Number crunching : Panic Mode On (111) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 31 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928356 - Posted: 6 Apr 2018, 17:08:13 UTC - in response to Message 1928346.  

Since the routine is working "correctly" on two of my four crunchers, and "incorrectly" on the other two I would suggest there is something amiss in the communication between the cruncher and the calculation. It is worth noting that the two that are "incorrect" are my top two....
I see the greatest effect on my three fastest and most capable machines. The oldest and slowest crunchers in my farm are less effected and stay more at their cache allotments the longest times.

Since they have been attached to the project the longest, maybe the servers have stabilized on the "correct" identification of system parameters and performance capabilities.
That doesn't feel likely to me. Tuning takes place via averaging over 100 tasks, which for fast machines would take relatively no time at all.

I suspect it's more likely that when the feeder is struggling to cache enough usable tasks (VHAR getting gobbled up very quickly, VLAR being inappropriate for the request), a big machine with a big request may simply take too long for the request to be processed, and find that all potential tasks have been claimed and hoovered out from under its feet by smaller, nimbler, machines.

See if it works any better with the new database settings and a clean, guppi-only, RTS.
ID: 1928356 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928357 - Posted: 6 Apr 2018, 17:25:51 UTC - in response to Message 1928356.  

Since the routine is working "correctly" on two of my four crunchers, and "incorrectly" on the other two I would suggest there is something amiss in the communication between the cruncher and the calculation. It is worth noting that the two that are "incorrect" are my top two....
I see the greatest effect on my three fastest and most capable machines. The oldest and slowest crunchers in my farm are less effected and stay more at their cache allotments the longest times.

Since they have been attached to the project the longest, maybe the servers have stabilized on the "correct" identification of system parameters and performance capabilities.
That doesn't feel likely to me. Tuning takes place via averaging over 100 tasks, which for fast machines would take relatively no time at all.

I suspect it's more likely that when the feeder is struggling to cache enough usable tasks (VHAR getting gobbled up very quickly, VLAR being inappropriate for the request), a big machine with a big request may simply take too long for the request to be processed, and find that all potential tasks have been claimed and hoovered out from under its feet by smaller, nimbler, machines.

See if it works any better with the new database settings and a clean, guppi-only, RTS.

Yes, will see. The two Linux machines have full BLC only task caches right now. We'll see whether they can maintain if the RTS buffer has purged all the Arecibo tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928357 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928374 - Posted: 6 Apr 2018, 19:54:17 UTC - in response to Message 1928357.  

I suspect it's more likely that when the feeder is struggling to cache enough usable tasks (VHAR getting gobbled up very quickly, VLAR being inappropriate for the request), a big machine with a big request may simply take too long for the request to be processed, and find that all potential tasks have been claimed and hoovered out from under its feet by smaller, nimbler, machines.

What's the chance of convincing the code maintainers that the old Arecibo VLAR restriction is no longer necessary on modern hardware?

If they still are hesitant because they still want to allow older hardware to crunch, how hard would it be to reconfigure the code to enable that restriction only to say Fermi or Kepler or earlier. And allow Maxwell and Pascal to run the Arecibo VLARs. They obviously made the change way back when Fermi was the norm and Kepler was just being introduced. Where would I look for the code that puts the restriction in place? What would the variable be named so I can search for the code?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928374 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1928401 - Posted: 6 Apr 2018, 21:15:47 UTC - in response to Message 1928374.  

I suspect it's more likely that when the feeder is struggling to cache enough usable tasks (VHAR getting gobbled up very quickly, VLAR being inappropriate for the request), a big machine with a big request may simply take too long for the request to be processed, and find that all potential tasks have been claimed and hoovered out from under its feet by smaller, nimbler, machines.

What's the chance of convincing the code maintainers that the old Arecibo VLAR restriction is no longer necessary on modern hardware?

If they still are hesitant because they still want to allow older hardware to crunch, how hard would it be to reconfigure the code to enable that restriction only to say Fermi or Kepler or earlier. And allow Maxwell and Pascal to run the Arecibo VLARs. They obviously made the change way back when Fermi was the norm and Kepler was just being introduced. Where would I look for the code that puts the restriction in place? What would the variable be named so I can search for the code?


I`ve suggested that years ago but some didn`t like this idea.
I think new hardware can handle it.
AMD cards never had a problem with them.
OTOH the chance to find a signal is bigger in VLAR tasks IMHO so it would improve the science as well.


With each crime and every kindness we birth our future.
ID: 1928401 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928419 - Posted: 6 Apr 2018, 21:49:19 UTC - in response to Message 1928356.  
Last modified: 6 Apr 2018, 21:52:52 UTC

are you equipped to easily count exactly how many of each type of task are present on your machine? This has to be a local count - information from this website is no use here. BoincTasks is likely to be a much better tool for this purpose than BOINC Manager

Not easily; it's would be a case of assuming being up against the Server side limit, then subtracting each WU reported, and not replaced, on each Scheduler request.
Will give BoincTasks/BoincView a go if it will help sort this issue out.


I suspect it's more likely that when the feeder is struggling to cache enough usable tasks (VHAR getting gobbled up very quickly, VLAR being inappropriate for the request), a big machine with a big request may simply take too long for the request to be processed, and find that all potential tasks have been claimed and hoovered out from under its feet by smaller, nimbler, machines.

The problem it is that occurs even when there is no Arecibo work. The problem also occurs with requests for CPU work (although that's no where near as noticeable due to the low rate of return from most CPU systems). The problem also appears to affect v7 BOINC clients, but not those that use the older v6 BOINC clients.

It first began in Dec of 2016, When Eric was sorting out the stock rollout of SoG V22/v23 issue.
People that preferred AP work, and MB only if there was no AP work available, stopped getting any work when there was no AP work available. They had to change their preferences to accept both types of work in order to get MB work when no AP was available.
I frequently found my cache running out of work, even though I didn't do any AP work at all. Changing the application preference settings, saving, updating, changing them back, saving, updating, would usually get work to start flowing again. Then I found out about hitting update 3 times in succession which got the work flowing much faster.
In the end I installed the AP application & set my preferences to accept both types of work, and the issue of the falling cache levels was greatly mitigated. As yesterday has shown the issue is still there, but no where nearly as frequently, nor nearly as great a drop in cache as prior to me accepting AP work.

Prior to this issue, even with a Arecibo VLAR storm, the GPUs would always pick up some work after a few unsuccessful requests. Now both the GPU & GPUs can go with several requests, while returning completed work, and still not pick up any replacement work.


Having the relevant code pre the update, and comparing it to the code of the current scheduler version would (hopefully) make seeing what's going on much easier.
Looking at the Haveland graphs it's easy to see how significant the latest glitch was- a 200,000 drop in work in progress, before the Scheduler started dishing out work again to those affected.
Grant
Darwin NT
ID: 1928419 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1928421 - Posted: 6 Apr 2018, 21:58:29 UTC

You'd still have complaints if you set All the Hosts to run Arecibo VLAR on nVidia. It might work if you set it to run them on just the Anonymous platform Hosts. That way people would have a way to avoid them and the CUDA Special App only runs on Anonymous platform anyway. The only difference I see when running them with the Special App is they take about twice as long as the current versions of BLC tasks. Basically they run about the same as the original version of the BLC5 tasks. They are still nasty when running them with the other Apps.

So far my machines are maintaining a Full cache with just the BLC tasks, we'll see how long that lasts.
ID: 1928421 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1928439 - Posted: 6 Apr 2018, 22:29:40 UTC - in response to Message 1928312.  
Last modified: 6 Apr 2018, 23:10:38 UTC

. . In my case it is very easy ... ZERO. No new work for the past 3 to 4 hours and all tasks have finshed and been reported, none left but still just getting the message "no tasks sent". My CPU Q's are also filled with Arecibo VLARs.

. . So I have turned the machine off ...
Either I'm mis-understanding, or you're contradicting yourself. How can you say "all tasks have finished" and "my CPU Q's are filled" in the same breath?

If you're talking about two different machines, please make that clear.


. . Late at night and tired ... all GPU tasks had finished and the schedulers were refusing to send any more, but the CPU Q, on that machine and one other, was full and all Arecibo VLAR tasks. I have also discovered a flaw in the rescheduler I am using. If the GPU Q is completely empty and you attempt to move work from the CPU Q it cannot "decide" which app to assign them to and simply trashes them, turning them into ghosts. I had that happen just prior to this weeks outage and I had to fiddle around to recover the ghosted tasks when new work was finally available (nasty catch 22 on both counts, zero GPU tasks means cannot reschedule tasks and cannot perform ghost recovery ... aaarrghh!) Which is why I gave up and shut it down.

. . The machine beside it, which is GPU only, had a full cache and was getting regular top ups. There is no way that is reasonable behaviour by the schedulers in my book, one machine gets work, the one beside it is starved. Both have work request set to intervals/time amounts to match the server limits of 100 per GPU. The critical issue could be that one is crunching on CPU and brings us back to this, if CPU Q is full then no work is sent to the GPU, definitely a flaw if you ask me ...

Stephen

? ?
ID: 1928439 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1928446 - Posted: 6 Apr 2018, 22:39:52 UTC - in response to Message 1928342.  

I did the same just before setting out for my walk, at 15:25 local (14:25 UTC), and got two Arecibo VLARs. Everything else that downloaded while I was out (including the first at 15:30 local) have been guppies, so we must have been just on the end of it.

Log at All tasks for computer 7118033


. . I am seeing that too, the new tasks coming to the CPU are guppies and work is coming to the GPU now so it may have been the culprit, but why then was the GPU only cruncher getting regular re-supply?

Stephen
?
ID: 1928446 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928450 - Posted: 6 Apr 2018, 22:42:40 UTC - in response to Message 1928439.  

. . Late at night and tired ... all GPU tasks had finished and the schedulers were refusing to send any more, but the CPU Q was full and all Arecibo VLAR tasks. I have also discovered a flaw in the rescheduler I am using.

It's going to be a tricky issue to sort out (it's been over 12 months now since it started), and rescheduling causes all sorts of issues of it's own. That's just going to further complicate something that's already complicated enough as it is.
Grant
Darwin NT
ID: 1928450 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928456 - Posted: 6 Apr 2018, 23:01:40 UTC - in response to Message 1928450.  

I've basically stopped using the reschedulers for their original intended purpose. Since the lack of Arecibo tasks in the lasts couple of months compared to the steady influx of BLC work, I just don't see the need for the few Arecibo tasks I get to be moved. If you move too many, you negatively influence the calculated APR and can run into "time limit exceeded" errors.

Where would the code commit for the V22/V23 app fix back in December 2016 be located? Would it be in work_fetch.cpp? What keyword would you search for?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928456 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1928457 - Posted: 6 Apr 2018, 23:01:43 UTC - in response to Message 1928421.  

You'd still have complaints if you set All the Hosts to run Arecibo VLAR on nVidia. It might work if you set it to run them on just the Anonymous platform Hosts. That way people would have a way to avoid them and the CUDA Special App only runs on Anonymous platform anyway. The only difference I see when running them with the Special App is they take about twice as long as the current versions of BLC tasks. Basically they run about the same as the original version of the BLC5 tasks. They are still nasty when running them with the other Apps.

So far my machines are maintaining a Full cache with just the BLC tasks, we'll see how long that lasts.


. . That's a thought. But is there anything in the config that would let you choose to not take Arecibo VLARs on the GPUs? It would give protection to hosts running stock apps.

Stephen

. .
ID: 1928457 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1928462 - Posted: 6 Apr 2018, 23:18:48 UTC - in response to Message 1928450.  

. . Late at night and tired ... all GPU tasks had finished and the schedulers were refusing to send any more, but the CPU Q was full and all Arecibo VLAR tasks. I have also discovered a flaw in the rescheduler I am using.

It's going to be a tricky issue to sort out (it's been over 12 months now since it started), and rescheduling causes all sorts of issues of it's own. That's just going to further complicate something that's already complicated enough as it is.


. .True, bulk rescheduling from CPU to GPU can cause issues, but as needs must in a crisis. It certainly is NOT a long term solution.

Stephen

:(
ID: 1928462 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24879
Credit: 3,081,182
RAC: 7
Ireland
Message 1928471 - Posted: 6 Apr 2018, 23:49:11 UTC

After getting my cache filled, every time I set NNT then suspend Network activity. Wait until I have at least 10 tasks completed then...

07/04/2018 00:29:21 | SETI@home | work fetch resumed by user
07/04/2018 00:29:25 | | Resuming network activity
07/04/2018 00:29:52 | SETI@home | Reporting 9 completed tasks
07/04/2018 00:29:52 | SETI@home | Requesting new tasks for CPU
07/04/2018 00:29:54 | SETI@home | Scheduler request completed: got 9 new tasks

...works every time :-)

Miscalculated this time, thought I had 10 :-)
ID: 1928471 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928477 - Posted: 7 Apr 2018, 0:00:15 UTC - in response to Message 1928471.  

After getting my cache filled, every time I set NNT then suspend Network activity. Wait until I have at least 10 tasks completed then...

07/04/2018 00:29:21 | SETI@home | work fetch resumed by user
07/04/2018 00:29:25 | | Resuming network activity
07/04/2018 00:29:52 | SETI@home | Reporting 9 completed tasks
07/04/2018 00:29:52 | SETI@home | Requesting new tasks for CPU
07/04/2018 00:29:54 | SETI@home | Scheduler request completed: got 9 new tasks

...works every time :-)

Even when the Scheduler is having issues? (it's been OK for the last 10 hours or so).

Personally, it'd be nice if it worked the way it use to, not having to muck around with setting & unsetting NNT, or repeated hitting of Update to get things flowing again. Just set and forget.
Grant
Darwin NT
ID: 1928477 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24879
Credit: 3,081,182
RAC: 7
Ireland
Message 1928499 - Posted: 7 Apr 2018, 0:48:01 UTC - in response to Message 1928477.  

Even when the Scheduler is having issues? (it's been OK for the last 10 hours or so).
Yes, but not as fast as it normally does.
ID: 1928499 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928500 - Posted: 7 Apr 2018, 0:56:50 UTC - in response to Message 1928499.  

Even when the Scheduler is having issues? (it's been OK for the last 10 hours or so).
Yes, but not as fast as it normally does.

So when when others were having issues, you were getting work, but it was taking longer than normal for the Scheduler to respond to the request?
Usual response time for me from the Scheduler is 2 to 3 seconds, and that was also the response time for me when it wasn't giving out work. For Tbar, he was getting responses within 1 second when unable to get work.
Grant
Darwin NT
ID: 1928500 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24879
Credit: 3,081,182
RAC: 7
Ireland
Message 1928501 - Posted: 7 Apr 2018, 1:00:56 UTC - in response to Message 1928500.  

When it experienced bad issues, it took just over an hour.
5,1,1,78 then 15.
If it gets bad again, I'll keep an eye on the log.
ID: 1928501 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928506 - Posted: 7 Apr 2018, 1:17:38 UTC - in response to Message 1928501.  

If it gets bad again, I'll keep an eye on the log.

Thanks.
It may be of use.

Higher output systems send back a lot more data with a Scheduler request than lower output ones, and it's been the higher output machines that suffer from the problem more than the lower output ones (unfortunately my C2D has died so I couldn't see what, if any, effect this last hiccup would've had on it). Richard has suggested it may relate to the time taken to process the larger returned data for faster crunchers.

Yet the response for no work is coming within the same time frame that it normally takes to get work, and those still getting work are waiting longer to get the response.
The fact that those not getting work are getting a quick response, and those getting work a much longer response makes me thing it could relate to the order in which the Scheduler determines whether a host gets more work or not.
Grant
Darwin NT
ID: 1928506 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24879
Credit: 3,081,182
RAC: 7
Ireland
Message 1928516 - Posted: 7 Apr 2018, 1:43:18 UTC - in response to Message 1928506.  
Last modified: 7 Apr 2018, 1:44:53 UTC

Okay.

07/04/2018 02:39:10 | SETI@home | work fetch resumed by user
07/04/2018 02:39:15 | | Resuming network activity
07/04/2018 02:39:25 | SETI@home | Sending scheduler request: To fetch work.
07/04/2018 02:39:25 | SETI@home | Reporting 6 completed tasks
07/04/2018 02:39:25 | SETI@home | Requesting new tasks for CPU
07/04/2018 02:39:28 | SETI@home | Scheduler request completed: got 6 new tasks
07/04/2018 02:40:02 | SETI@home | work fetch suspended by user
07/04/2018 02:40:06 | | Suspending network activity - user request

I'll try a bigger hit next time around.
ID: 1928516 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1928520 - Posted: 7 Apr 2018, 1:50:22 UTC - in response to Message 1928516.  
Last modified: 7 Apr 2018, 1:50:53 UTC

I'll try a bigger hit next time around.

Personally most interested in how things are when the Scheduler is misbehaving, so i'd suggest relaxing till things fall over again.
:-)
Grant
Darwin NT
ID: 1928520 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (111) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.