Panic Mode On (5) Server Problems!

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 642100 - Posted: 16 Sep 2007, 5:46:03 UTC Yep, looks as though someone's been able to kick start things. I'll give it a few more hours to settle down & then enable network access again. Grant Darwin NT ID: 642100 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 642127 - Posted: 16 Sep 2007, 6:37:04 UTC I still have a couple stuck but most are getting through. Guess the two stuck are just being stubborn. Guess I'll worry about it when I sober.....uhh wake up. PROUD MEMBER OF Team Starfire World BOINC ID: 642127 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 642160 - Posted: 16 Sep 2007, 8:04:02 UTC Just re-enabled network access; even though traffic is still extremely high (up & down) all results returned first try, no errors & same for new work downloading. Grant Darwin NT ID: 642160 ·

Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0	Message 642493 - Posted: 16 Sep 2007, 18:09:20 UTC It looks like Richard Haselgrove's picture was taken just before the end of the workday - at ten-'til-four, PM. It wasn't ten-twenty AM because the hourhand would've been positioned incorrectly. It would've been dark had the PMs and AMs been reversed. I conclude that the cops would have had a happy day unless all there had drunken in moderation or had taken a bus, taxi or train home. ID: 642493 ·

Dingo Volunteer tester Send message Joined: 28 Jun 99 Posts: 104 Credit: 16,364,896 RAC: 1	Message 642567 - Posted: 16 Sep 2007, 19:43:59 UTC - in response to Message 642493. It looks like Richard Haselgrove's picture was taken just before the end of the workday - at ten-'til-four, PM. It wasn't ten-twenty AM because the hourhand would've been positioned incorrectly. It would've been dark had the PMs and AMs been reversed. I conclude that the cops would have had a happy day unless all there had drunken in moderation or had taken a bus, taxi or train home. Seti comes back up quickly usually and it is very stable. I think that the admins do a great under pressure from all of us jumping in when the first sign of work not coming to us. Proud Founder and member of Have a look at my WebCam ID: 642567 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 642890 - Posted: 17 Sep 2007, 9:24:18 UTC Just had a look at the network graphs- things aren't looking healthy at all. Grant Darwin NT ID: 642890 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 642893 - Posted: 17 Sep 2007, 9:49:34 UTC And the splitters all went splat before the server status page last updated itself, three hours ago.... ID: 642893 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 642899 - Posted: 17 Sep 2007, 10:05:54 UTC - in response to Message 642893. And the splitters all went splat before the server status page last updated itself, three hours ago.... Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically. Grant Darwin NT ID: 642899 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 642904 - Posted: 17 Sep 2007, 10:31:23 UTC Please....Mister.... How's about just a few "good ole" stuck wus. Ah...the stuck wu.....those were the "good days"..... LOL ID: 642904 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65746 Credit: 55,293,173 RAC: 49	Message 643035 - Posted: 17 Sep 2007, 15:52:32 UTC - in response to Message 642899. And the splitters all went splat before the server status page last updated itself, three hours ago.... Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically. As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 643035 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 643045 - Posted: 17 Sep 2007, 16:05:11 UTC - in response to Message 643035. And the splitters all went splat before the server status page last updated itself, three hours ago.... Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically. As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on. Maybe the patient should be named Abby Normal. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 643045 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 643226 - Posted: 17 Sep 2007, 21:37:19 UTC - in response to Message 643045. And the splitters all went splat before the server status page last updated itself, three hours ago.... Good thing it's Monday over there, another 4-5 hours & they can inspect the patient physically. As long as the Patient doesn't turn into a Munster named Herman, Couldn't resist :D, Hopefully what is off can be turned back on. Maybe the patient should be named Abby Normal. ... or Hans DelbrÃƒÂ¼ck. ID: 643226 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 646250 - Posted: 22 Sep 2007, 2:01:27 UTC What is it about the weekends? Are people making their caches even bigger still? That should increase the moaning about pending credit even further. Network traffic has had a case of the stutters for about 19 hours, the Ready to Send buffer has been steadily dropping, the Result Creation Rate has increased in response, but still the In Progress number continues to climb (almost up to 2.2 million). I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work. Grant Darwin NT ID: 646250 ·

archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0	Message 646268 - Posted: 22 Sep 2007, 2:44:19 UTC - in response to Message 646250. I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work. It would cause an absolutely huge swing in demand if everyone were running short queues. I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued. Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem. ID: 646268 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65746 Credit: 55,293,173 RAC: 49	Message 646295 - Posted: 22 Sep 2007, 3:21:37 UTC - in response to Message 646268. Last modified: 22 Sep 2007, 3:29:36 UTC I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work. It would cause an absolutely huge swing in demand if everyone were running short queues. I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued. Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem. The bandwidth does seem a bit limited during the last download, Most of the time It's system connect, It's not really affecting My quads, But My dual cores lose some processing ability and PC5(E4300) is trying to get 5 WU's, Where as PC1(QX6700) only wants one WU right now. Below is an example from PC1, I put PC5 online to help increase My crunching and so far It hasn't done much besides try and climb a slippery slope very slowly before sliding back some. :( 9/21/2007 7:42:14 PM\|SETI@home\|[file_xfer] Started download of file 04mr07ab.20521.4162.15.6.51 9/21/2007 7:42:18 PM\|SETI@home\|Sending scheduler request: To report completed tasks 9/21/2007 7:42:18 PM\|SETI@home\|Reporting 1 tasks 9/21/2007 7:42:23 PM\|SETI@home\|Scheduler RPC succeeded [server version 511] 9/21/2007 7:42:23 PM\|SETI@home\|Deferring communication 11 sec, because requested by project 9/21/2007 7:43:01 PM\|SETI@home\|[file_xfer] Finished download of file 04mr07ab.20521.4162.15.6.51 9/21/2007 7:43:01 PM\|SETI@home\|[file_xfer] Throughput 8273 bytes/sec I almost feel like going to No New Tasks and aborting the potential downloads for a while. :( The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 646295 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 646333 - Posted: 22 Sep 2007, 5:29:27 UTC - in response to Message 646295. The bandwidth does seem a bit limited during the last download, Most of the time It's system connect, Yep, starting to get some "system connect" errors here as well, although so far they are all downloading OK after a couple of attempts. Grant Darwin NT ID: 646333 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 646463 - Posted: 22 Sep 2007, 10:16:08 UTC - in response to Message 646268. I've been getting quite a few short Work Units lately, but i wouldn't have thought that alone would cause such a large increase in the demand for work. It would cause an absolutely huge swing in demand if everyone were running short queues. I infer from the fairly rapid swing that an appreciable fraction of the compute resource is short-queued. Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem. I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply. Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour. There's a secondary effect because of the variable accuracy of the 'crunchability' estimates for different WUs, reflected in RDCF. When you complete a particularly indigestible WU, RDCF jumps up. If you have a large cache, that makes a big (absolute) difference to the estimated amount of work on hand, and work requests will be inhibited for a long time - several hours or even days. Then, as you crunch on 'sweeter' WUs, RDCF will fall, and you will start to request top-up work at a slightly greater rate than your actual crunching speed. So the work requests of large-cache hosts will be slightly more erratic, but the variation depends on what is being crunched, not on what is being issued. Demand is decoupled from supply, and I don't think the RDCF effect has a significant impact on server performance. I'm concerned about the (very) large caches because of the stress they put on the Berkeley servers. Results in "progress" peaked at 2,155,052 overnight. That must put a tremendous strain on every component of the work allocation and download sub-system: everything from feeder queries to file system directory lookups. I would suggest that, even with the outages we've seen recently, a 2 or 3 day cache would be a more rational choice. ID: 646463 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 646466 - Posted: 22 Sep 2007, 10:26:50 UTC - in response to Message 646463. I would suggest that, even with the outages we've seen recently, a 2 or 3 day cache would be a more rational choice. My cache is set to 4 days, which with the RDCF moving about due to different Work Units it gives me around a 3.5 day turnaround time. I've only run out of work twice in the last 3 years or so. Grant Darwin NT ID: 646466 ·

archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0	Message 646484 - Posted: 22 Sep 2007, 12:11:48 UTC - in response to Message 646463. Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem. I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply. Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour. Thanks Richard. I think I had that wrong. ID: 646484 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 646570 - Posted: 22 Sep 2007, 15:21:35 UTC - in response to Message 646484. Since we spend a lot of time here decrying the "selfish" long-queue folks, perhaps this is a good moment to recognize that they are a moderating influence on this particular aspect of the problem. I'm one of the ones who has been implicitly criticising excessively large cache sizes, but I don't think that cache size, per se, affects work demand in the way you imply. Whether your cache is set at 0.1 day or 10 days, BOINC will request new work when the threshhold is crossed - when the work on hand drops below 0.099 or 9.999 days respectively. It's the size of the WU that's issued that determines how far above threshhold the queue becomes, and hence how long your computer will 'go quiet' in terms of work requests. If the scheduler is issuing 3-hour WUs (at your estimated crunch speed), then your work requests will be inhibited for 3 hours. If the scheduler is issuing 30-minute WUs, you'll be back for more in half an hour. Thanks Richard. I think I had that wrong. Consider a machine which does work of AR (angle range) 0.41 in about 3 hours and has a DCF (Duration Correction Factor) which happens to be right for that AR. Then suppose the splitters start producing AR 1.5 work. The estimates say they will take 1/6 the time of AR 0.41 or 1/2 hour. So a request for 3 hours of work will get 6 WUs. Because the actual crunch time is about 1/4 the time of AR 0.41, completion of one of the AR 1.5 WUs will increase DCF by about 50%. If the queue looked like 1 day just before completion of that WU, it looks like 1.5 days after. That inhibits work request for about half a day if the desired queue was 1 day, and larger queue settings magnify that effect. However, a host with a queue of 2 days may download a lot of the AR 1.5 WUs before actually crunching one. A host with maximized queue settings may go into EDF (Earliest Deadline First) and get the DCF adjustment fairly soon. Joe ID: 646570 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.