Panic Mode On (116) Server Problems?

Message boards : Number crunching : Panic Mode On (116) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 47 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1990934 - Posted: 21 Apr 2019, 13:22:30 UTC - in response to Message 1990888.  

For those interested about how much processing power is needed roughly to keep up with the incoming work a post dated 16 Jan 19 saying
According to BoincsStats Seti@home has a RAC of around 200,000,000.
A Titan V, or gtx 2080ti can get close to 2 credits per second.
So 100,000,000 seconds of compute on one of those cards,
in optimized OS using optimized application, could do all of 1 days Seti@Home crunching.
So 1157 TitanV or gtx2080ti GPUs, crunching full-time, could do all the crunching.

from Real time number crunching? There is interesting information in the thread


. . But that is the crunching that the whole project is presently completing in one day, not the data that GBT is producing in one day :). Since the daily output data from one GBT channel (blcnn) currently takes about a week to 10 days to get through and there are at least 24 channels being recorded then it would take about 170 days or more to complete, which translates to something like 200,000 plus 'Titan V or Gtx2080ti' cards at full tilt to clear in one day. That would be "real time" processing. So when we have at least 100,000 active volunteers, each running at least 2 of the above cards working 24/7/52 we will be able to keep up with the input data from GBT. Of course we will need the same again to cater for the data from Parkes when it is finally online and another smaller contingent to deal with the data than comes from Arecibo. When we actually do have the volunteer workforce to do all that work, where the heck do we find the servers to cope with it????

. . Sadly 'real time' processing is at this point in time nothing more than a pipe dream.

Stephen

:(
ID: 1990934 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1991025 - Posted: 22 Apr 2019, 1:39:55 UTC - in response to Message 1990934.  

I am sure if Jeff, Eric or Matt could give us a list of parts required there would be people prepared to start a fundraiser to pull a system that could cope with the work volume
ID: 1991025 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1991089 - Posted: 22 Apr 2019, 13:18:50 UTC
Last modified: 22 Apr 2019, 13:19:33 UTC

That was easy. I just got 10 of my friends to stop crunching for SETI.


Interesting reaction, mine was to convert my linux box to Windows 10, just brought a new SSD and Win 10 Licence, ( I am letting my tasks run down rather than abort) and will spruce up my 2 old Dell's, possible new MB's SSD's and modern processors, during the summer so that I don't have to put the heating on next winter ;-)
ID: 1991089 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22199
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1991152 - Posted: 23 Apr 2019, 4:55:21 UTC

Right, after a few hours to cool things down this thread is back to life.
Now, remember, don't stray too far discussing server issues, an do obey all the forum rules.
And above all else have fun
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1991152 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1991173 - Posted: 23 Apr 2019, 9:14:21 UTC
Last modified: 23 Apr 2019, 9:15:51 UTC

It is nice to see the higher return rate of over 127,000 this I can only assume means that the multibeam work is being returned. My pending validation is the highest it has been in a long time at 180. This is due to having lots of multibeam work which form a band between 3 and 5 minutes per task
ID: 1991173 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1991196 - Posted: 23 Apr 2019, 18:54:40 UTC

We are back!
ID: 1991196 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1991210 - Posted: 23 Apr 2019, 21:59:07 UTC - in response to Message 1991196.  

We are back!


HURRAY!!!!
A proud member of the OFA (Old Farts Association).
ID: 1991210 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1991308 - Posted: 24 Apr 2019, 15:53:43 UTC

I'm seeing Stalled downloads on all machines.
Of course, you can't download any new work with a stalled download...
ID: 1991308 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991309 - Posted: 24 Apr 2019, 16:04:00 UTC - in response to Message 1991308.  

Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991309 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1991313 - Posted: 24 Apr 2019, 16:14:33 UTC - in response to Message 1991309.  

Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart.
I'd be interested to hear how you diagnosed cause and effect - what was stalled, and how did restarting BOINC clear it? FWIW, I've had a look around the systems here - seven machines, four projects, and nothing is stuck, either upload or download. None of the machines have had a BOINC restart since Patch Wednesday, except the single one I'm testing #3076 on. Oh, and one I moved to a different room last week.
ID: 1991313 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1991314 - Posted: 24 Apr 2019, 16:22:26 UTC

Well, earlier this morning all was well. It just began recently, and it does clear them by selecting all of them and abusing the retry button.
However, it is reoccurring;
Wed 24 Apr 2019 12:17:17 PM EDT | SETI@home | Reporting 5 completed tasks
Wed 24 Apr 2019 12:17:17 PM EDT | SETI@home | Requesting new tasks for NVIDIA GPU
Wed 24 Apr 2019 12:17:17 PM EDT | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Wed 24 Apr 2019 12:17:17 PM EDT | SETI@home | [sched_op] NVIDIA GPU work request: 4619.87 seconds; 0.00 devices
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | Scheduler request completed: got 4 new tasks
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] Server version 709
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | Project requested delay of 303 seconds
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 1647 seconds
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 21ap19ac.2151.10701.5.32.164_1
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_10622_HIP3805_0058.31543.0.23.46.245.vlar_0
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_08938_And_XIV_0053.31575.0.23.46.145.vlar_1
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_08938_And_XIV_0053.31555.818.23.46.194.vlar_1
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task blc35_2bit_guppi_58406_11643_And_XI_0061.31652.0.23.46.252.vlar_0
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] Deferring communication for 00:05:03
Wed 24 Apr 2019 12:17:19 PM EDT | SETI@home | [sched_op] Reason: requested by project
Wed 24 Apr 2019 12:17:21 PM EDT | SETI@home | Started download of 23ap19aa.6731.16018.10.37.59.vlar
Wed 24 Apr 2019 12:17:21 PM EDT | SETI@home | Started download of 23ap19aa.27470.4566.8.35.63
Wed 24 Apr 2019 12:17:21 PM EDT | SETI@home | Started download of 23ap19aa.6731.16018.10.37.32.vlar
Wed 24 Apr 2019 12:17:21 PM EDT | SETI@home | Started download of 23ap19aa.5757.16427.6.33.68.vlar
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Temporarily failed download of 23ap19aa.6731.16018.10.37.59.vlar: transient HTTP error
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Backing off 00:03:45 on download of 23ap19aa.6731.16018.10.37.59.vlar
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Temporarily failed download of 23ap19aa.27470.4566.8.35.63: transient HTTP error
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Backing off 00:02:14 on download of 23ap19aa.27470.4566.8.35.63
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Temporarily failed download of 23ap19aa.6731.16018.10.37.32.vlar: transient HTTP error
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Backing off 00:03:37 on download of 23ap19aa.6731.16018.10.37.32.vlar
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Temporarily failed download of 23ap19aa.5757.16427.6.33.68.vlar: transient HTTP error
Wed 24 Apr 2019 12:17:23 PM EDT | SETI@home | Backing off 00:03:12 on download of 23ap19aa.5757.16427.6.33.68.vlar
ID: 1991314 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1991317 - Posted: 24 Apr 2019, 16:41:26 UTC - in response to Message 1991314.  

So, which 'transient HTTP error' is it throwing this time?
ID: 1991317 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1991319 - Posted: 24 Apr 2019, 16:52:44 UTC - in response to Message 1991317.  

So, which 'transient HTTP error' is it throwing this time?


. . That I cannot say, but I just noticed that it has happened here too at about 1:40 am AEST. It seems to have lasted for a short while but somehow self-corrected.

Stephen

? ?
ID: 1991319 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1991320 - Posted: 24 Apr 2019, 19:48:04 UTC

looks like we're back (forums at least)
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1991320 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991321 - Posted: 24 Apr 2019, 19:59:46 UTC - in response to Message 1991313.  

Yes all my hosts have stalled downloads with accumulated 6 hours of elapsed time trying to download. So right around 3 AM my local time the servers couldn't access the requested work. I've had this problem for about a week now. Solution is to stop and restart BOINC on the affected hosts and the tasks come down finally at restart.
I'd be interested to hear how you diagnosed cause and effect - what was stalled, and how did restarting BOINC clear it? FWIW, I've had a look around the systems here - seven machines, four projects, and nothing is stuck, either upload or download. None of the machines have had a BOINC restart since Patch Wednesday, except the single one I'm testing #3076 on. Oh, and one I moved to a different room last week.

What I am describing is not the normal "stalled" download. The tasks are labelled "active" with no backoff and 0% progress. The elapsed timer continues to count and by the time I notice them is in the order of several hours since they stalled during the night when I'm asleep. The fix is to stop BOINC, wait for the client to fully stop in the System Monitor and then restart BOINC. The host immediately finishes the "stuck" downloads as soon as the client initializes. Normally less than a half a dozen is common. The host is continuously downloading and uploading normally all during this time except for these "stuck" downloads. They are being ignored until BOINC is stopped and restarted.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991321 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991322 - Posted: 24 Apr 2019, 20:03:25 UTC - in response to Message 1991319.  

So, which 'transient HTTP error' is it throwing this time?


. . That I cannot say, but I just noticed that it has happened here too at about 1:40 am AEST. It seems to have lasted for a short while but somehow self-corrected.

Stephen

? ?

I wonder if the cause was a problem with a bad hard drive in the Seti servers which is probably the reason for the RAID rebuild. A bad hard drive could have been the reason why my requests for a downloaded task goes unfulfilled from the servers until I ask for it again with a BOINC restart.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991322 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1991326 - Posted: 24 Apr 2019, 21:07:10 UTC

Looking at the Haveland graphs I see that the servers have been having conniptions again for the last couple of hours.

I myself didn't have any stalled downloads, just the instant timeout resulting in daily backoffs variety. Much abuse of the retry button seems to be clearing the backlog, and has managed to result in 3 stalled downloads- where the Elapsed time counts down, but nothing is actually happening- after about 2min they eventually downloaded.
Grant
Darwin NT
ID: 1991326 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1991329 - Posted: 24 Apr 2019, 21:40:31 UTC

Well I havn't had any download backoffs for the last hour now so I guess that the problem earlier is somehow fixed.

Cheers.
ID: 1991329 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1991330 - Posted: 24 Apr 2019, 21:43:07 UTC - in response to Message 1991329.  

Well I havn't had any download backoffs for the last hour now so I guess that the problem earlier is somehow fixed.

Or at least resolved itself.
The last couple of requests for work have downloaded without assistance. Download speeds are still slower than usual, but at least they are downloading.
Grant
Darwin NT
ID: 1991330 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1991334 - Posted: 24 Apr 2019, 22:49:33 UTC

could the download problems be caused by too many people trying to get WUs all at the same time? Too many connections at once? Is it usually after an outage like we had today and yesterday?
ID: 1991334 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 47 · Next

Message boards : Number crunching : Panic Mode On (116) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.