Panic Mode On (111) Server Problems?

Message boards : Number crunching : Panic Mode On (111) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 31 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927189 - Posted: 30 Mar 2018, 0:02:25 UTC - in response to Message 1927188.  
Last modified: 30 Mar 2018, 0:04:32 UTC

Just a litany of "no work is available" messages.

Likewise.

Looking at the graphs you can see the In-progress taking a dive & the Ready-to-send buffer filling rapidly as the work stops being sent out.
Grant
Darwin NT
ID: 1927189 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1927192 - Posted: 30 Mar 2018, 0:12:40 UTC - in response to Message 1927189.  

I think your diagnosis is correct. Whatever is stuffing up the work going out is rapidly reducing the tasks in progress and the RTS buffer is growing overlarge again. The creation mechanism is not getting stopped on the buffer limit that was supposed to be in place.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1927192 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927196 - Posted: 30 Mar 2018, 0:27:06 UTC - in response to Message 1927192.  
Last modified: 30 Mar 2018, 0:44:23 UTC

The creation mechanism is not getting stopped on the buffer limit that was supposed to be in place.

Looks like it just kicked in at 610k. Stopped dead in it's tracks.
I still can't get any work, triple update not having any effect this time around. Deletion & purge backlogs continue to grow, but as people can't get any new work to replace what they've returned they are growing more slowly.
And at least the Replica is starting to catch up again.

Whatever they did during the extended outage, it doesn't appear to have helped at all. It seems to have made things even worse.

Edit- I've managed to pick up 2 WUs in the last hour and a quarter.
Grant
Darwin NT
ID: 1927196 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1927203 - Posted: 30 Mar 2018, 0:53:00 UTC - in response to Message 1927196.  

I wondered where the limit was. Thought it was just north of 600K. I was down to 50 tasks on one machine. Triple Update wasn't doing anything. Changing preferences was no good either. Finally resorted to exiting BOINC and waiting five minutes and then restarted BOINC. Got 123 tasks at initialization. Sometimes that's the only way I can get the schedulers to recognize my task deficit.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1927203 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927206 - Posted: 30 Mar 2018, 1:01:18 UTC - in response to Message 1927203.  

Finally resorted to exiting BOINC and waiting five minutes and then restarted BOINC. Got 123 tasks at initialization. Sometimes that's the only way I can get the schedulers to recognize my task deficit.

Or just more of the usual system weirdness. I just picked up 36 WUs over 2 requests after giving up on trying to get any.

I hope they sort things out soon- with deletions & purging not keeping up (let alone getting ahead) my Valids are over 2700 now. I keep expecting the whole thing to just lock up due to lack of disk space at any moment.
Grant
Darwin NT
ID: 1927206 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1927214 - Posted: 30 Mar 2018, 1:28:34 UTC

Just get 152 WU, somebody kick the servers.
ID: 1927214 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927291 - Posted: 30 Mar 2018, 9:12:44 UTC - in response to Message 1927214.  

Just get 152 WU, somebody kick the servers.

Yeah, after the last glitch had sorted itself out, the Scheduler has glitched again. Once more it's being random in it's decision to give or not to give new work. Mostly the answer is not.
Grant
Darwin NT
ID: 1927291 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1927292 - Posted: 30 Mar 2018, 9:56:50 UTC

I have no Idea what you guys are talking about, when I need work it comes weather it be cpu or gpu, don't care as long as cache is filled..
ID: 1927292 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9956
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1927297 - Posted: 30 Mar 2018, 10:49:47 UTC - in response to Message 1927292.  

I have no Idea what you guys are talking about, when I need work it comes weather it be cpu or gpu, don't care as long as cache is filled..

That is because, like me, you are not a "super-cruncher" with multiple GPU's that need constant "feeding" and some are ravenous :-)

Like you my mid range crunchers rarely have trouble filling their modest caches.

I almost feel embarrassed, till I look at their RAC' s :-0
ID: 1927297 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1927302 - Posted: 30 Mar 2018, 12:47:38 UTC - in response to Message 1927297.  
Last modified: 30 Mar 2018, 13:46:49 UTC

That is because, like me, you are not a "super-cruncher" with multiple GPU's that need constant "feeding" and some are ravenous :-)

Actually, you don't need to a be "super-cruncher" and have a multiple GPU host to be in problem with "feeding", just need to have a 1070 or up GPU (or a AMD equivalent) or a mid rangue to top CPU.
The source of the problem is 100 WU cache limit per GPU or CPU.
If you look the crunching time in my host of a WU you will easely see what i talk about.
The GPU crunch a WU in about 2 min (some even crunch faster), so a 100 WU cache holds for about 3 1/2 hours only, then any problem on the servers side makes the GPU running empty very fast.
If you extrapolate that to a 1080Ti host, who is at least 50% faster, the cache holds for less than 2 hours.
But even if you not have a GPU the same happening with the CPU's, is just less intense. If you have a 6 core CPU or +. With it the cache just holds for a little more, about 8 hrs (if you run 12 CPU WU at a time).
Now imagine a host with Titan V or a 12 + cores CPU.
So the real problem is: the technology reaches a level that makes the hosts so fast so they reach a point close to the capacity of the seti servers "feed" them due the 100WU botleneck. If there is any problem at the servers side not solved in few hours, the host will run empty.
On the other side, nobody ever promise us 24/7 work, we all know that, but for some of us (me highly included) when we see our host running empty that make us very sad. LOL

<edit> Please notice, I'm not complain about nothing on the project (besides Creditscrew but that is for anothe thread) just posting what i see from my user perspective. I clearely understand why the 100 WU limit exists.
ID: 1927302 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1927311 - Posted: 30 Mar 2018, 13:49:19 UTC - in response to Message 1927292.  

I have no Idea what you guys are talking about, when I need work it comes weather it be cpu or gpu, don't care as long as cache is filled..


. . It seems you and Wiggo both. Not so for all of us :(

Stephen

:)
ID: 1927311 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927377 - Posted: 30 Mar 2018, 21:22:52 UTC

And it appears like another Scheduler random allocation period is upon us.
Looking at the Haveland graphs, there have been 2 instances since the last weekly outage, about 9 hours apart and lasting for about 90min each. This latest one appears to have occurred 12 hours after the previous one and is just getting under way. Will be interesting to see if it lasts as long as the other 2.
Grant
Darwin NT
ID: 1927377 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1927383 - Posted: 30 Mar 2018, 21:53:45 UTC - in response to Message 1927377.  

Well, I got in from you-know-where at about that time. I've got three machines that I babysit fairly closely: today I've let each of them fetch about once every four hours while I flush the shorty dross out of the systems.

So I let them suck, and got

68+10
67
56

for back up to 200 tasks per machine in just four fetches. I really feel that understanding how BOINC works, and working with it - rather than against it - makes life much simpler. I simply bump cache from 0.25 to 1.25 - it's the simplest one character edit - and that's more than enough. Anyone with fast GPUs asking for more than a day is also asking for trouble.
ID: 1927383 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1927384 - Posted: 30 Mar 2018, 21:57:59 UTC - in response to Message 1927383.  

I wish I could reconcile what you state Richard with the reality of my machines. If I set the cache to 1 day, all I get is your computer has reached a limit of tasks in progress at each request. The caches quickly fall to zero.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1927384 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1927386 - Posted: 30 Mar 2018, 22:13:38 UTC - in response to Message 1927384.  

I wish I could reconcile what you state Richard with the reality of my machines. If I set the cache to 1 day, all I get is your computer has reached a limit of tasks in progress at each request. The caches quickly fall to zero.
I can only suggest that you enable work_fetch_debug in your logs, and post the results here for us to analyse.

The immediate late-night, post-pub guesswork suggests that have have been filling up to the max on CPU tasks as well: I don't do that, I use my CPUs for other projects. So, just possibly, turning off CPU fetch while stuffed would get you GPU tasks, and let you get below 1 day cached for CPU tasks (work out how many CPU tasks you process per day, with normal runtimes, and leave CPU fetch off until you're balanced again). Then, do the math, and set cache to the maximum that bumps the limit for GPUs, and accept the reduced CPU count.

But most of all - use your logs to work out what's going on.
ID: 1927386 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1927387 - Posted: 30 Mar 2018, 22:14:47 UTC - in response to Message 1927383.  

today I've let each of them fetch about once every four hours

Which is OK if the Scheduler is handing out work at that time. If it's not then they wouldn't get any till the Scheduler was back in the mood to give out work.
Since these latest issues are only lasting for about 90min (so far- see dips in In-progress work on Haveland graphs), it's no where near as bad as when it first occurred (Dec 1026) & was lasting for significant portions of a day (12hours or more).

And whatever this issue is, it seems to be different to the previous issue where mucking about with Application settings or triipple updating would result in getting some work (for a while at least). With this current issue nothing we do will get work to be allocated.

Instead of 90min, this more recent problem appears to have lasted for about 45-60min. I'm now getting work with each request, and the In-progress numbers on the graphs are recovering.
Grant
Darwin NT
ID: 1927387 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1927388 - Posted: 30 Mar 2018, 22:44:15 UTC - in response to Message 1927386.  

I've posted my work_fetch_debug outputs for you in the past. You never could explain why I have such difficulty getting work when you've stated my debug output says I should be getting work.

I've never tried to turn off cpu work. I process 250 cpu tasks per machine per day on my fastest machines. The Windows 7 machines do about 50 cpu per day. Only SETI crunches cpu tasks. So turning off cpu work would have some impact on RAC obviously.

I will try turning off cpu work when I get into the work limit per day situation again to see if that changes anything. So far today has been problem free except for a very brief episode that Grant mentioned that only lasted about 30 minutes for me while I got no tasks are available messages.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1927388 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1927391 - Posted: 30 Mar 2018, 22:52:28 UTC - in response to Message 1927388.  

I've posted my work_fetch_debug outputs for you in the past.
Well, I popped your ID and 'work_fetch_debug' into an advanced search, and checked the last 12 months: 11 mentions (including tonight), but no actual logs. It's too late on this side of the pond to start work on it now, but it's still a useful tool, worth trying sometime.
ID: 1927391 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1927410 - Posted: 31 Mar 2018, 0:48:43 UTC - in response to Message 1927383.  

Well, I got in from you-know-where at about that time. I've got three machines that I babysit fairly closely: today I've let each of them fetch about once every four hours while I flush the shorty dross out of the systems.

So I let them suck, and got

68+10
67
56

for back up to 200 tasks per machine in just four fetches. I really feel that understanding how BOINC works, and working with it - rather than against it - makes life much simpler. I simply bump cache from 0.25 to 1.25 - it's the simplest one character edit - and that's more than enough. Anyone with fast GPUs asking for more than a day is also asking for trouble.


. . Hi Richard,

. . I'm not sure that this theory holds up. My 2 Linux machines are my most productive and I have them set to 0.35/0.0 and 0.5/0.0 but they are having the same problems others are talking about. They do eventually replenish before running dry but I often find them down to about 60% of their intended Q sizes which are 200/16 and 100 respectively. Often "kicking" the servers with a premature work request will provoke a deluge of new work on the next legitimate request, but certainly not always. The troubling responses are "no work available" when the schedulers have 600K tasks or "you have reached your limit" when the caches are actually getting quite low. It seems evident to me that there is some kind of issue happening with the schedulers themselves and I am not convinced that BOINC is the source of the continuing problem. You certainly know the system at Seti HQ better than I do, but I can only judge by what I observe on my own rigs. The thing that is muddying the waters is the disparity amongst users on what can provoke some relief from the work starvation. Tricks that work for some users do not work for others and vice versa. I am guessing that is largely down to the differences between versions of BOINC in use. Wiggo constantly reminds us that he is having no problems and he is using an older version (it ends in .60 is all I can remember). I am almost tempted to try rolling back my version of BOINC :)

Stephen

? ?
ID: 1927410 · Report as offensive
Profile Chris904395093209d Project Donor
Volunteer tester

Send message
Joined: 1 Jan 01
Posts: 112
Credit: 29,923,129
RAC: 6
United States
Message 1927419 - Posted: 31 Mar 2018, 1:54:23 UTC

I checked 2 of my machines tonight, they both have received the 'project has no tasks available' message. My fastest PC takes 8 - 10 hours to run out of CPU work, so I'm sure my machines will eventually replenish. But I always find it strange when a known issue keeps happening and no real news as to why - at least not that I've seen anyway.
~Chris

ID: 1927419 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (111) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.