Panic Mode On (116) Server Problems?

Author	Message
Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1990934 - Posted: 21 Apr 2019, 13:22:30 UTC - in response to Message 1990888. For those interested about how much processing power is needed roughly to keep up with the incoming work a post dated 16 Jan 19 saying According to BoincsStats Seti@home has a RAC of around 200,000,000. A Titan V, or gtx 2080ti can get close to 2 credits per second. So 100,000,000 seconds of compute on one of those cards, in optimized OS using optimized application, could do all of 1 days Seti@Home crunching. So 1157 TitanV or gtx2080ti GPUs, crunching full-time, could do all the crunching. from Real time number crunching? There is interesting information in the thread . . But that is the crunching that the whole project is presently completing in one day, not the data that GBT is producing in one day :). Since the daily output data from one GBT channel (blcnn) currently takes about a week to 10 days to get through and there are at least 24 channels being recorded then it would take about 170 days or more to complete, which translates to something like 200,000 plus 'Titan V or Gtx2080ti' cards at full tilt to clear in one day. That would be "real time" processing. So when we have at least 100,000 active volunteers, each running at least 2 of the above cards working 24/7/52 we will be able to keep up with the input data from GBT. Of course we will need the same again to cater for the data from Parkes when it is finally online and another smaller contingent to deal with the data than comes from Arecibo. When we actually do have the volunteer workforce to do all that work, where the heck do we find the servers to cope with it???? . . Sadly 'real time' processing is at this point in time nothing more than a pipe dream. Stephen :( ID: 1990934 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1991025 - Posted: 22 Apr 2019, 1:39:55 UTC - in response to Message 1990934. I am sure if Jeff, Eric or Matt could give us a list of parts required there would be people prepared to start a fundraiser to pull a system that could cope with the work volume ID: 1991025 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1991089 - Posted: 22 Apr 2019, 13:18:50 UTC Last modified: 22 Apr 2019, 13:19:33 UTC That was easy. I just got 10 of my friends to stop crunching for SETI. Interesting reaction, mine was to convert my linux box to Windows 10, just brought a new SSD and Win 10 Licence, ( I am letting my tasks run down rather than abort) and will spruce up my 2 old Dell's, possible new MB's SSD's and modern processors, during the summer so that I don't have to put the heating on next winter ;-) ID: 1991089 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380	Message 1991152 - Posted: 23 Apr 2019, 4:55:21 UTC Right, after a few hours to cool things down this thread is back to life. Now, remember, don't stray too far discussing server issues, an do obey all the forum rules. And above all else have fun Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1991152 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1991173 - Posted: 23 Apr 2019, 9:14:21 UTC Last modified: 23 Apr 2019, 9:15:51 UTC It is nice to see the higher return rate of over 127,000 this I can only assume means that the multibeam work is being returned. My pending validation is the highest it has been in a long time at 180. This is due to having lots of multibeam work which form a band between 3 and 5 minutes per task ID: 1991173 ·

Sleepy Volunteer tester Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360	Message 1991196 - Posted: 23 Apr 2019, 18:54:40 UTC We are back! ID: 1991196 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1991210 - Posted: 23 Apr 2019, 21:59:07 UTC - in response to Message 1991196. We are back! HURRAY!!!! A proud member of the OFA (Old Farts Association). ID: 1991210 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1991308 - Posted: 24 Apr 2019, 15:53:43 UTC I'm seeing Stalled downloads on all machines. Of course, you can't download any new work with a stalled download... ID: 1991308 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1991309 - Posted: 24 Apr 2019, 16:04:00 UTC - in response to Message 1991308. Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1991309 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1991313 - Posted: 24 Apr 2019, 16:14:33 UTC - in response to Message 1991309. Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart. I'd be interested to hear how you diagnosed cause and effect - what was stalled, and how did restarting BOINC clear it? FWIW, I've had a look around the systems here - seven machines, four projects, and nothing is stuck, either upload or download. None of the machines have had a BOINC restart since Patch Wednesday, except the single one I'm testing #3076 on. Oh, and one I moved to a different room last week. ID: 1991313 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1991314 - Posted: 24 Apr 2019, 16:22:26 UTC Well, earlier this morning all was well. It just began recently, and it does clear them by selecting all of them and abusing the retry button. However, it is reoccurring; Wed 24 Apr 2019 12:17:17 PM EDT \| SETI@home \| Reporting 5 completed tasks Wed 24 Apr 2019 12:17:17 PM EDT \| SETI@home \| Requesting new tasks for NVIDIA GPU Wed 24 Apr 2019 12:17:17 PM EDT \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Wed 24 Apr 2019 12:17:17 PM EDT \| SETI@home \| [sched_op] NVIDIA GPU work request: 4619.87 seconds; 0.00 devices Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| Scheduler request completed: got 4 new tasks Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] Server version 709 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| Project requested delay of 303 seconds Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] estimated total CPU task duration: 0 seconds Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 1647 seconds Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task 21ap19ac.2151.10701.5.32.164_1 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_10622_HIP3805_0058.31543.0.23.46.245.vlar_0 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_08938_And_XIV_0053.31575.0.23.46.145.vlar_1 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_08938_And_XIV_0053.31555.818.23.46.194.vlar_1 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_11643_And_XI_0061.31652.0.23.46.252.vlar_0 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] Deferring communication for 00:05:03 Wed 24 Apr 2019 12:17:19 PM EDT \| SETI@home \| [sched_op] Reason: requested by project Wed 24 Apr 2019 12:17:21 PM EDT \| SETI@home \| Started download of 23ap19aa.6731.16018.10.37.59.vlar Wed 24 Apr 2019 12:17:21 PM EDT \| SETI@home \| Started download of 23ap19aa.27470.4566.8.35.63 Wed 24 Apr 2019 12:17:21 PM EDT \| SETI@home \| Started download of 23ap19aa.6731.16018.10.37.32.vlar Wed 24 Apr 2019 12:17:21 PM EDT \| SETI@home \| Started download of 23ap19aa.5757.16427.6.33.68.vlar Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Temporarily failed download of 23ap19aa.6731.16018.10.37.59.vlar: transient HTTP error Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Backing off 00:03:45 on download of 23ap19aa.6731.16018.10.37.59.vlar Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Temporarily failed download of 23ap19aa.27470.4566.8.35.63: transient HTTP error Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Backing off 00:02:14 on download of 23ap19aa.27470.4566.8.35.63 Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Temporarily failed download of 23ap19aa.6731.16018.10.37.32.vlar: transient HTTP error Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Backing off 00:03:37 on download of 23ap19aa.6731.16018.10.37.32.vlar Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Temporarily failed download of 23ap19aa.5757.16427.6.33.68.vlar: transient HTTP error Wed 24 Apr 2019 12:17:23 PM EDT \| SETI@home \| Backing off 00:03:12 on download of 23ap19aa.5757.16427.6.33.68.vlar ID: 1991314 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1991317 - Posted: 24 Apr 2019, 16:41:26 UTC - in response to Message 1991314. So, which 'transient HTTP error' is it throwing this time? ID: 1991317 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1991319 - Posted: 24 Apr 2019, 16:52:44 UTC - in response to Message 1991317. So, which 'transient HTTP error' is it throwing this time? . . That I cannot say, but I just noticed that it has happened here too at about 1:40 am AEST. It seems to have lasted for a short while but somehow self-corrected. Stephen ? ? ID: 1991319 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 1991320 - Posted: 24 Apr 2019, 19:48:04 UTC looks like we're back (forums at least) Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 1991320 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1991321 - Posted: 24 Apr 2019, 19:59:46 UTC - in response to Message 1991313. Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart. I'd be interested to hear how you diagnosed cause and effect - what was stalled, and how did restarting BOINC clear it? FWIW, I've had a look around the systems here - seven machines, four projects, and nothing is stuck, either upload or download. None of the machines have had a BOINC restart since Patch Wednesday, except the single one I'm testing #3076 on. Oh, and one I moved to a different room last week. What I am describing is not the normal "stalled" download. The tasks are labelled "active" with no backoff and 0% progress. The elapsed timer continues to count and by the time I notice them is in the order of several hours since they stalled during the night when I'm asleep. The fix is to stop BOINC, wait for the client to fully stop in the System Monitor and then restart BOINC. The host immediately finishes the "stuck" downloads as soon as the client initializes. Normally less than a half a dozen is common. The host is continuously downloading and uploading normally all during this time except for these "stuck" downloads. They are being ignored until BOINC is stopped and restarted. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1991321 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1991322 - Posted: 24 Apr 2019, 20:03:25 UTC - in response to Message 1991319. So, which 'transient HTTP error' is it throwing this time? . . That I cannot say, but I just noticed that it has happened here too at about 1:40 am AEST. It seems to have lasted for a short while but somehow self-corrected. Stephen ? ? I wonder if the cause was a problem with a bad hard drive in the Seti servers which is probably the reason for the RAID rebuild. A bad hard drive could have been the reason why my requests for a downloaded task goes unfulfilled from the servers until I ask for it again with a BOINC restart. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1991322 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1991326 - Posted: 24 Apr 2019, 21:07:10 UTC Looking at the Haveland graphs I see that the servers have been having conniptions again for the last couple of hours. I myself didn't have any stalled downloads, just the instant timeout resulting in daily backoffs variety. Much abuse of the retry button seems to be clearing the backlog, and has managed to result in 3 stalled downloads- where the Elapsed time counts down, but nothing is actually happening- after about 2min they eventually downloaded. Grant Darwin NT ID: 1991326 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1991329 - Posted: 24 Apr 2019, 21:40:31 UTC Well I havn't had any download backoffs for the last hour now so I guess that the problem earlier is somehow fixed. Cheers. ID: 1991329 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1991330 - Posted: 24 Apr 2019, 21:43:07 UTC - in response to Message 1991329. Well I havn't had any download backoffs for the last hour now so I guess that the problem earlier is somehow fixed. Or at least resolved itself. The last couple of requests for work have downloaded without assistance. Download speeds are still slower than usual, but at least they are downloading. Grant Darwin NT ID: 1991330 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 1991334 - Posted: 24 Apr 2019, 22:49:33 UTC could the download problems be caused by too many people trying to get WUs all at the same time? Too many connections at once? Is it usually after an outage like we had today and yesterday? ID: 1991334 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.