Panic Mode On (5) Server Problems!

Message boards : Number crunching : Panic Mode On (5) Server Problems!
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 18 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 642100 - Posted: 16 Sep 2007, 5:46:03 UTC


Yep, looks as though someone's been able to kick start things. I'll give it a few more hours to settle down & then enable network access again.
Grant
Darwin NT
ID: 642100 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 642127 - Posted: 16 Sep 2007, 6:37:04 UTC

I still have a couple stuck but most are getting through. Guess the two stuck are just being stubborn. Guess I'll worry about it when I sober.....uhh wake up.


PROUD MEMBER OF Team Starfire World BOINC
ID: 642127 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 642160 - Posted: 16 Sep 2007, 8:04:02 UTC


Just re-enabled network access; even though traffic is still extremely high (up & down) all results returned first try, no errors & same for new work downloading.
Grant
Darwin NT
ID: 642160 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 642493 - Posted: 16 Sep 2007, 18:09:20 UTC

It looks like Richard Haselgrove's picture was taken just before the end of the workday - at ten-'til-four, PM. It wasn't ten-twenty AM because the hourhand would've been positioned incorrectly. It would've been dark had the PMs and AMs been reversed. I conclude that the cops would have had a happy day unless all there had drunken in moderation or had taken a bus, taxi or train home.
ID: 642493 · Report as offensive
Profile Dingo
Volunteer tester
Avatar

Send message
Joined: 28 Jun 99
Posts: 104
Credit: 16,364,896
RAC: 1
Australia
Message 642567 - Posted: 16 Sep 2007, 19:43:59 UTC - in response to Message 642493.  

It looks like Richard Haselgrove's picture was taken just before the end of the workday - at ten-'til-four, PM. It wasn't ten-twenty AM because the hourhand would've been positioned incorrectly. It would've been dark had the PMs and AMs been reversed. I conclude that the cops would have had a happy day unless all there had drunken in moderation or had taken a bus, taxi or train home.


Seti comes back up quickly usually and it is very stable. I think that the admins do a great under pressure from all of us jumping in when the first sign of work not coming to us.


Proud Founder and member of



Have a look at my WebCam
ID: 642567 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 642890 - Posted: 17 Sep 2007, 9:24:18 UTC


Just had a look at the network graphs- things aren't looking healthy at all.
Grant
Darwin NT
ID: 642890 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 642893 - Posted: 17 Sep 2007, 9:49:34 UTC

And the splitters all went splat before the server status page last updated itself, three hours ago....
ID: 642893 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 642899 - Posted: 17 Sep 2007, 10:05:54 UTC - in response to Message 642893.  

And the splitters all went splat before the server status page last updated itself, three hours ago....

Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically.
Grant
Darwin NT
ID: 642899 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 642904 - Posted: 17 Sep 2007, 10:31:23 UTC

Please....Mister....

How's about just a few "good ole" stuck wus. Ah...the stuck wu.....those were the "good days".....

LOL
ID: 642904 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65746
Credit: 55,293,173
RAC: 49
United States
Message 643035 - Posted: 17 Sep 2007, 15:52:32 UTC - in response to Message 642899.  

And the splitters all went splat before the server status page last updated itself, three hours ago....

Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically.

As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 643035 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 643045 - Posted: 17 Sep 2007, 16:05:11 UTC - in response to Message 643035.  

And the splitters all went splat before the server status page last updated itself, three hours ago....

Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically.

As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on.

Maybe the patient should be named Abby Normal.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 643045 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 643226 - Posted: 17 Sep 2007, 21:37:19 UTC - in response to Message 643045.  

And the splitters all went splat before the server status page last updated itself, three hours ago....

Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically.

As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on.

Maybe the patient should be named Abby Normal.

... or Hans Delbrück.
ID: 643226 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 646250 - Posted: 22 Sep 2007, 2:01:27 UTC


What is it about the weekends? Are people making their caches even bigger still? That should increase the moaning about pending credit even further.

Network traffic has had a case of the stutters for about 19 hours, the Ready to Send buffer has been steadily dropping, the Result Creation Rate has increased in response, but still the In Progress number continues to climb (almost up to 2.2 million).
I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work.
Grant
Darwin NT
ID: 646250 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 646268 - Posted: 22 Sep 2007, 2:44:19 UTC - in response to Message 646250.  

I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work.

It would cause an absolutely huge swing in demand if everyone were running short queues.

I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued.

Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem.

ID: 646268 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65746
Credit: 55,293,173
RAC: 49
United States
Message 646295 - Posted: 22 Sep 2007, 3:21:37 UTC - in response to Message 646268.  
Last modified: 22 Sep 2007, 3:29:36 UTC

I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work.

It would cause an absolutely huge swing in demand if everyone were running short queues.

I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued.

Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem.

The bandwidth does seem a bit limited during the last download, Most of the time It's system connect, It's not really affecting My quads, But My dual cores lose some processing ability and PC5(E4300) is trying to get 5 WU's, Where as PC1(QX6700) only wants one WU right now. Below is an example from PC1, I put PC5 online to help increase My crunching and so far It hasn't done much besides try and climb a slippery slope very slowly before sliding back some. :(

9/21/2007 7:42:14 PM|SETI@home|[file_xfer] Started download of file 04mr07ab.20521.4162.15.6.51
9/21/2007 7:42:18 PM|SETI@home|Sending scheduler request: To report completed tasks
9/21/2007 7:42:18 PM|SETI@home|Reporting 1 tasks
9/21/2007 7:42:23 PM|SETI@home|Scheduler RPC succeeded [server version 511]
9/21/2007 7:42:23 PM|SETI@home|Deferring communication 11 sec, because requested by project
9/21/2007 7:43:01 PM|SETI@home|[file_xfer] Finished download of file 04mr07ab.20521.4162.15.6.51
9/21/2007 7:43:01 PM|SETI@home|[file_xfer] Throughput 8273 bytes/sec

I almost feel like going to No New Tasks and aborting the potential downloads for a while. :(
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 646295 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 646333 - Posted: 22 Sep 2007, 5:29:27 UTC - in response to Message 646295.  

The bandwidth does seem a bit limited during the last download, Most of the time It's system connect,

Yep, starting to get some "system connect" errors here as well, although so far they are all downloading OK after a couple of attempts.

Grant
Darwin NT
ID: 646333 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 646463 - Posted: 22 Sep 2007, 10:16:08 UTC - in response to Message 646268.  

I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work.

It would cause an absolutely huge swing in demand if everyone were running short queues.

I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued.

Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem.

I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply.

Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour.

There's a secondary effect because of the variable accuracy of the 'crunchability' estimates for different WUs, reflected in RDCF. When you complete a particularly indigestible WU, RDCF jumps up. If you have a large cache, that makes a big (absolute) difference to the estimated amount of work on hand, and work requests will be inhibited for a long time - several hours or even days. Then, as you crunch on 'sweeter' WUs, RDCF will fall, and you will start to request top-up work at a slightly greater rate than your actual crunching speed. So the work requests of large-cache hosts will be slightly more erratic, but the variation depends on what is being crunched, not on what is being issued. Demand is decoupled from supply, and I don't think the RDCF effect has a significant impact on server performance.

I'm concerned about the (very) large caches because of the stress they put on the Berkeley servers. Results in "progress" peaked at 2,155,052 overnight. That must put a tremendous strain on every component of the work allocation and download sub-system: everything from feeder queries to file system directory lookups. I would suggest that, even with the outages we've seen recently, a 2 or 3 day cache would be a more rational choice.
ID: 646463 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 646466 - Posted: 22 Sep 2007, 10:26:50 UTC - in response to Message 646463.  

I would suggest that, even with the outages we've seen recently, a 2 or 3 day cache would be a more rational choice.

My cache is set to 4 days, which with the RDCF moving about due to different Work Units it gives me around a 3.5 day turnaround time.
I've only run out of work twice in the last 3 years or so.
Grant
Darwin NT
ID: 646466 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 646484 - Posted: 22 Sep 2007, 12:11:48 UTC - in response to Message 646463.  

Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem.

I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply.

Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour.

Thanks Richard. I think I had that wrong.


ID: 646484 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 646570 - Posted: 22 Sep 2007, 15:21:35 UTC - in response to Message 646484.  

Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem.

I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply.

Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour.

Thanks Richard. I think I had that wrong.

Consider a machine which does work of AR (angle range) 0.41 in about 3 hours and has a DCF (Duration Correction Factor) which happens to be right for that AR. Then suppose the splitters start producing AR 1.5 work. The estimates say they will take 1/6 the time of AR 0.41 or 1/2 hour. So a request for 3 hours of work will get 6 WUs. Because the actual crunch time is about 1/4 the time of AR 0.41, completion of one of the AR 1.5 WUs will increase DCF by about 50%. If the queue looked like 1 day just before completion of that WU, it looks like 1.5 days after. That inhibits work request for about half a day if the desired queue was 1 day, and larger queue settings magnify that effect.

However, a host with a queue of 2 days may download a lot of the AR 1.5 WUs before actually crunching one. A host with maximized queue settings may go into EDF (Earliest Deadline First) and get the DCF adjustment fairly soon.
                                                                Joe
ID: 646570 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 18 · Next

Message boards : Number crunching : Panic Mode On (5) Server Problems!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.