Panic Mode On (104) Server Problems?

Message boards : Number crunching : Panic Mode On (104) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 42 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1843144 - Posted: 20 Jan 2017, 3:08:01 UTC

Well I am finally at server limits on all machines. But it took 20 hours to fill the Windows 7 machines back to full strength. It will be interesting to see if I repeat this weeks outage problem next Tuesday. I'm curious whether my adding the RFC1323 options to the registry since then will have any effect next Tuesday.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1843144 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1843145 - Posted: 20 Jan 2017, 3:23:21 UTC - in response to Message 1843144.  

Well I am finally at server limits on all machines. But it took 20 hours to fill the Windows 7 machines back to full strength. It will be interesting to see if I repeat this weeks outage problem next Tuesday. I'm curious whether my adding the RFC1323 options to the registry since then will have any effect next Tuesday.

Just don't forget that you cannot expect an instant recovery after the outage.
Especially if the dang thing runs over 11 hours like this week's did.
Such long outages create many many hungry mouths to feed when coming back up.
So even if the ready to send cache starts to fill up, the scheduler and feeder are getting run ragged trying to fill work requests.
I realize I am pointing out the obvious.
And I am away at work during and after the outages, so I cannot monitor how quickly my rigs get their fill.
I do know that this week, the outage had ended not very much before I got home, and my caches were down.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1843145 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1843149 - Posted: 20 Jan 2017, 3:33:27 UTC

Yes you stated the obvious. I recognize the many mouths that need to be fed after the outage. What I can't explain is why the requests now are filled 4 tasks at a time instead of the normal 41-46 tasks at a time. Unless the feeder server is now fragmenting the requests to lower limits and feeding more simultaneous requests out of the 100 task buffer, I can't explain why the number of tasks delivered per request has dropped so significantly. Has the project increased the number of simultaneous connections to the server buffer?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1843149 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11416
Credit: 29,581,041
RAC: 66
United States
Message 1843150 - Posted: 20 Jan 2017, 3:34:36 UTC

To those crunchers who run out of work during the outrage I have a suggestion. Use E@H as your back up project, after all Seti owes them a lot of love, after all they have been so generous to allow us to run Nebula on their super computer.
ID: 1843150 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 1843162 - Posted: 20 Jan 2017, 4:26:18 UTC - in response to Message 1843149.  

... What I can't explain is why the requests now are filled 4 tasks at a time instead of the normal 41-46 tasks at a time. Unless the feeder server is now fragmenting the requests to lower limits and feeding more simultaneous requests out of the 100 task buffer, I can't explain why the number of tasks delivered per request has dropped so significantly. Has the project increased the number of simultaneous connections to the server buffer?

When I get a refill, I'm still seeing large numbers per session, like always, will often be ranging from 40-50s up to high 100s. Wondering if others who do not experience issue see this as well? Assuming so ... so doubtful it's any server changes.
ID: 1843162 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1843164 - Posted: 20 Jan 2017, 4:33:44 UTC - in response to Message 1843162.  

... What I can't explain is why the requests now are filled 4 tasks at a time instead of the normal 41-46 tasks at a time. Unless the feeder server is now fragmenting the requests to lower limits and feeding more simultaneous requests out of the 100 task buffer, I can't explain why the number of tasks delivered per request has dropped so significantly. Has the project increased the number of simultaneous connections to the server buffer?

When I get a refill, I'm still seeing large numbers per session, like always, will often be ranging from 40-50s up to high 100s. Wondering if others who do not experience issue see this as well? Assuming so ... so doubtful it's any server changes.

I doubt there have been any basic scheduler level changes made.
To what end?
Eric's got his hands full just trying to work out the v8 conversion and it's foibles.
I doubt anybody would try to throw another wrench in the works by messing with the basic foundation.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1843164 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1843240 - Posted: 20 Jan 2017, 16:52:02 UTC - in response to Message 1843150.  

Already do. Have Einstein and MilkyWay as backup projects. The problem now with those projects is that they are forcing my computers into High Priority mode for the first time. Any work onboard from those projects seems to force HP even with new requests with deadlines out 10-14 days in the future. Never seen BOINC behave like this before, it's lost its mind.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1843240 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1843262 - Posted: 20 Jan 2017, 18:50:50 UTC - in response to Message 1843240.  

All I can think of is it's showing HP mode to meet it's Resource Share value. Time for NNT mode.

Another note, My Linux box was having problems getting tasks before the weekly maintenance. Yes/no no/yes didn't seem to help. I tried switching to web based prefs only, and switched location too, didn't seem to help. Moved everything back to normal and just left as I had other things to do.

When I came back and maintenance had started, I had a full 200 tasks (which only lasted 4.5 hours, LOL) and has been fine ever since.

Maybe going to web prefs straightened it out? IDK, but something reset it.

Worth a try.
ID: 1843262 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843268 - Posted: 20 Jan 2017, 19:21:23 UTC

Woke up to find the cache running down again. Changed cache settings to No, Yes, No for a few Scheduler requests, still running down. Changed them back, and work started flowing again.
Grant
Darwin NT
ID: 1843268 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1843269 - Posted: 20 Jan 2017, 19:23:30 UTC - in response to Message 1843262.  
Last modified: 20 Jan 2017, 19:24:19 UTC

It's never been a problem before with MilkyWay just getting tasks constantly because of the hard server limit of 160 tasks per machine. I retire tasks faster than I ever let any task even come close to approaching its deadline which is two weeks out. I ran into issues with Einstein when they finished the BRP4G work and started on the FGRPB1G work which takes much longer per task and also takes a full CPU core now that it is OpenCL too. As usual, Einstein sent WAY TOO MUCH work initially and I quickly set it to NNT last month. I only brought in 60 tasks on the last request for work on 1/17 on each machine. All the tasks had a deadline of 1/31. This morning all machines thought they had to go HP on Einstein work. Really? Come on, 11 days till deadline and they think they have to finish the work in 3 days since getting it? I don't understand why BOINC is getting so discombobulated in its process priority. When I run Event Log options I don't see any red flags pop out indicating priority problems. I've never had any issues with High Priority before in running my 3 projects concurrently since 2011. This is something I've never experienced before. As I've said in this thread, BOINC has lost its mind.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1843269 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843302 - Posted: 20 Jan 2017, 22:05:05 UTC - in response to Message 1843268.  

Woke up to find the cache running down again. Changed cache settings to No, Yes, No for a few Scheduler requests, still running down. Changed them back, and work started flowing again.

A while later, the cache is running down again.
Flipped the application settings, and it fills back up & then stays there.
Grant
Darwin NT
ID: 1843302 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843309 - Posted: 20 Jan 2017, 22:20:16 UTC - in response to Message 1843269.  
Last modified: 20 Jan 2017, 22:21:48 UTC

This is something I've never experienced before.

Maybe this is why?
I ran into issues with Einstein when they finished the BRP4G work and started on the FGRPB1G work which takes much longer per task and also takes a full CPU core now that it is OpenCL too. As usual, Einstein sent WAY TOO MUCH work initially


I don't understand why BOINC is getting so discombobulated in its process priority.

Maybe this had an impact why?
and I quickly set it to NNT last month



Longer than expected run times & the loss of a CPU core (or more) would take a while for the Manager to figure out what's going on. Setting NNT would have impacted on that process. The result being going in to High Priority while it tries to sort things out.
It would have been worth not setting NNT and seeing if it was able to sort itself out sooner. Going with NNT throws a whole new spin on resource allocation as the Manager tries to balance resource share, deadlines and cache settings.

And as Jason keeps pointing out the estimate components of Credit New are involved in allocating work, the last week or so of all Arecibo, then back to Arecibo/Guppie work mix would have thrown work fetch estimations up, down & sideways which combined with longer than estimated run times & you setting NNT and the loss of a CPU core or 2 results in a recipe for Manager confusion while it tries to resolve all the contradictory & conflicting requirements.
Just a thought.
Grant
Darwin NT
ID: 1843309 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1843341 - Posted: 20 Jan 2017, 23:43:31 UTC - in response to Message 1843309.  


Maybe this had an impact why?
and I quickly set it to NNT last month



Longer than expected run times & the loss of a CPU core (or more) would take a while for the Manager to figure out what's going on. Setting NNT would have impacted on that process. The result being going in to High Priority while it tries to sort things out.
It would have been worth not setting NNT and seeing if it was able to sort itself out sooner. Going with NNT throws a whole new spin on resource allocation as the Manager tries to balance resource share, deadlines and cache settings.

And as Jason keeps pointing out the estimate components of Credit New are involved in allocating work, the last week or so of all Arecibo, then back to Arecibo/Guppie work mix would have thrown work fetch estimations up, down & sideways which combined with longer than estimated run times & you setting NNT and the loss of a CPU core or 2 results in a recipe for Manager confusion while it tries to resolve all the contradictory & conflicting requirements.
Just a thought.

Thanks for the comments, Grant. The only project I've ever had to set NNT on the past is Einstein as it consistently sends more work than can be finished by deadlines. Never needed it with MilkyWay because of the hard server limits per machine and rapid task completions. For a while with the BRP4G tasks with Einstein and a trick to reduce my BOINC disk limits, I was able to not set NNT on Einstein and keep it from sending too much work because of the large task sizes of that project. Now with the new work and MUCH smaller tasks, that trick doesn't work anymore and I'm forced to use NNT.

I think you have correctly deduced the issue of CreditNew being the cause of my problems. It makes sense what you said about the recent deluge of exclusive Arecibo work and now the return of the normal mix of Arecibo/Guppie work. That is what is confusing BOINC now. Unfortunately it is impacting even MilkyWay work now which is not expected because of the work limit.

In the meantime I have reduced my project usage by 50% on MilkyWay and Einstein to see if that helps. I won't see an immediate effect probably because it takes so long for CreditNew to figure out averages.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1843341 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843434 - Posted: 21 Jan 2017, 6:00:37 UTC - in response to Message 1843341.  

More Centurion issues?

The Server Status page shows a full Ready-to-send buffer, and it shows 7 GBT splitters running, but the number of files being split only shows 1 channel in progress. And I've been getting less and less Guppies over the last few hours.
Still getting new GBT work, it's just a considerably reduced percentage of the total than usual.
Grant
Darwin NT
ID: 1843434 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1843625 - Posted: 21 Jan 2017, 21:27:08 UTC - in response to Message 1843434.  

o splitters running could be the v8.23 ATI tasks going through. I seen a pile of resends from that earlier.
ID: 1843625 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843626 - Posted: 21 Jan 2017, 21:30:51 UTC - in response to Message 1843625.  
Last modified: 21 Jan 2017, 21:31:26 UTC

o splitters running could be the v8.23 ATI tasks going through. I seen a pile of resends from that earlier.

Server Status page still shows only 1 GBT channel being split, but over night most of the work I got was Guppie, so the earlier almost-all Arecibo downloads have been offset by the last few hours of downloads. And almost all of it is new work, very few resends.


And so far this morning I haven't had to play with the application settings to keep the cache full (fingers crossed).
Grant
Darwin NT
ID: 1843626 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1843632 - Posted: 21 Jan 2017, 21:37:29 UTC - in response to Message 1843626.  

o splitters running could be the v8.23 ATI tasks going through. I seen a pile of resends from that earlier.

Server Status page still shows only 1 GBT channel being split, but over night most of the work I got was Guppie, so the earlier almost-all Arecibo downloads have been offset by the last few hours of downloads. And almost all of it is new work, very few resends.


And so far this morning I haven't had to play with the application settings to keep the cache full (fingers crossed).

I do not know if it is just me, but every time I look I see 7 channels of GBT data being split???? or am I looking somewhere you are not?

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1843632 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36798
Credit: 261,360,520
RAC: 489
Australia
Message 1843633 - Posted: 21 Jan 2017, 21:43:48 UTC - in response to Message 1843632.  

o splitters running could be the v8.23 ATI tasks going through. I seen a pile of resends from that earlier.

Server Status page still shows only 1 GBT channel being split, but over night most of the work I got was Guppie, so the earlier almost-all Arecibo downloads have been offset by the last few hours of downloads. And almost all of it is new work, very few resends.


And so far this morning I haven't had to play with the application settings to keep the cache full (fingers crossed).

I do not know if it is just me, but every time I look I see 7 channels of GBT data being split???? or am I looking somewhere you are not?

The thickness of the dark green part at the beginning of a file being split bar will indicate how many splitters are working on that 1 file and sometimes all of them can be working on just 1 file. ;-)

Cheers.
ID: 1843633 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843635 - Posted: 21 Jan 2017, 21:48:01 UTC - in response to Message 1843632.  
Last modified: 21 Jan 2017, 21:50:12 UTC

I do not know if it is just me, but every time I look I see 7 channels of GBT data being split???? or am I looking somewhere you are not?

Must be a different place.

Computing, Server Status page. Scroll down to splitter status, Breakthrough listen.
It shows 7 files that have completed channels in them (light green), but it shows only 1 channel in progress (actually being split- dark green).
Scroll down to Multibeam (Arecibo) and it shows 6 files having channels been completed (light green), and 4 channels in progress (dark green).

Given the amount of GBT work I got overnight and the Ready-to-send buffer remains full, I suspect that more than one channel is being split, but it just isn't being displayed in the splitter status.
Grant
Darwin NT
ID: 1843635 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1843637 - Posted: 21 Jan 2017, 21:54:48 UTC - in response to Message 1843633.  

The thickness of the dark green part at the beginning of a file being split bar will indicate how many splitters are working on that 1 file and sometimes all of them can be working on just 1 file. ;-)

That's probably it- there are so many channels in a GBT file that a single channel is too small to see. And all the GBT work I've got (other than re-sends) is all from the one file.

So that little sliver of dark green at the start of blc2_2bit_guppi_57423_32060_HIP53824_0017 isn't a single splitter, but all of them on the one file.
Grant
Darwin NT
ID: 1843637 · Report as offensive
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 42 · Next

Message boards : Number crunching : Panic Mode On (104) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.