Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 29 · Next

AuthorMessage
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11408
Credit: 29,581,041
RAC: 66
United States
Message 1901199 - Posted: 15 Nov 2017, 16:34:13 UTC

When reporting I got this:
11/15/2017 8:32:51 AM | SETI@home | Project is temporarily shut down for maintenance

ID: 1901199 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1901201 - Posted: 15 Nov 2017, 16:47:49 UTC

One server tuneup, coming right up.
Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1901201 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14672
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1901205 - Posted: 15 Nov 2017, 17:06:42 UTC - in response to Message 1901201.  

One server tuneup, coming right up.
Meow.
And already (provisionally) complete.
ID: 1901205 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 1901240 - Posted: 15 Nov 2017, 22:12:11 UTC - in response to Message 1900207.  

We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem.

I don't like the looks of WU 2739307008. One host successfully downloaded and ran it. Everybody else is getting D/L errors on it, 5 hosts so far. My machine's Event Log shows:

11/9/2017 7:17:34 AM | SETI@home | Started download of blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar
11/9/2017 7:17:38 AM | SETI@home | Finished download of blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar
11/9/2017 7:17:38 AM | SETI@home | [error] MD5 check failed for blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar
11/9/2017 7:17:38 AM | SETI@home | [error] expected 4bb0fee3928609f2b1df21e44ac13b4e, got 450a32005c6700d7ab95284edc959572
11/9/2017 7:17:38 AM | SETI@home | [error] Checksum or signature error for blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar
I just downloaded the WU manually and didn't seem to have any errors. The file doesn't appear to be truncated, either, ending with "" as the last line.
I just downloaded it as well. I got a manual MD5 of 450a32005c6700d7ab95284edc959572, the same as BOINC calculated for yours: that would suggest that the MD5 stored in the database when the file was created (so that the comparison can be done) might be corrupted.

EXCEPT: the person who downloaded the _0 replication got a clean download. _1 was a file size error (my download was 720,530 bytes, which is close enough without knowing the precise number of bytes expected): all the others got the MD5 error.

Which suggests that something was messing with, either, the database MD5 values, or, the stored files on disk. Neither really bears thinking about, and both are a long way outside our control. As Rob says, take the day off.

@SETIEric@qoto.org (Mastodon)

ID: 1901240 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1901242 - Posted: 15 Nov 2017, 22:20:41 UTC

I see we've got some blc_25 WUs now. Looks like they require more work than the blc_24s did.
Grant
Darwin NT
ID: 1901242 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14672
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1901245 - Posted: 15 Nov 2017, 22:32:12 UTC - in response to Message 1901240.  

We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem.
More disk drives was a line item in the fall fundraising drive, wasn't it?
ID: 1901245 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1901253 - Posted: 15 Nov 2017, 23:16:14 UTC - in response to Message 1901240.  

We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem.


. . Thanks f or the update Eric.

Stephen

:)
ID: 1901253 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1901770 - Posted: 19 Nov 2017, 2:30:34 UTC

The WU File deleters have been having issues for a few days now, however things appear to be getting worse with the backlog reaching new highs.
Hopefully we won't get to the point of running out of disk space till well after everyone's back at work at Berkeley.
Grant
Darwin NT
ID: 1901770 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1901825 - Posted: 19 Nov 2017, 9:00:38 UTC - in response to Message 1901253.  

We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem.


. . Thanks f or the update Eric.

Stephen

:)


. . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time.

Stephen

:)
ID: 1901825 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1901888 - Posted: 19 Nov 2017, 17:32:02 UTC - in response to Message 1901825.  

We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem.


. . Thanks f or the update Eric.

Stephen

:)


. . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time.

Stephen

:)

I too was wondering the same or more likely the re-apportioning of the memory Eric mentioned. I haven't had to kick the servers either. Smooth sailing for once and I really like it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1901888 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1901941 - Posted: 19 Nov 2017, 22:44:51 UTC - in response to Message 1901888.  


. . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time.
Stephen
:)

I too was wondering the same or more likely the re-apportioning of the memory Eric mentioned. I haven't had to kick the servers either. Smooth sailing for once and I really like it.


. . Hi Keith,

. . Nice to know the effect is not just with my rigs. I wonder if things have improved for grant too, he seemed to be suffering from the issue a lot.

Stephen

??
ID: 1901941 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1902009 - Posted: 20 Nov 2017, 6:52:21 UTC - in response to Message 1901941.  

. . Nice to know the effect is not just with my rigs. I wonder if things have improved for grant too, he seemed to be suffering from the issue a lot.

About the same.
Ok for days or weeks at time then the cache runs down for a bit & Tbars triple update gets it going again.
When there is no AP work, or mostly just GBT or Arecibo is when the issue generally occurs. The fact that there has been a steady flow of AP work for a while now is most likely why we're not having issues getting work.
Grant
Darwin NT
ID: 1902009 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 1902243 - Posted: 22 Nov 2017, 0:38:04 UTC

Back up over 90 minutes, still can't get any tasks for any of my three.
May be time to head to Einstein?
ID: 1902243 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1902246 - Posted: 22 Nov 2017, 0:43:50 UTC - in response to Message 1902243.  

Once I noticed that the servers were back up, I went over to my crunchers and forced a request-for-work. I started getting work right away (first 1, then 100+ a couple of times), Check to see if your machines are on a delay because they requested work when none was available.
ID: 1902246 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 1902248 - Posted: 22 Nov 2017, 1:04:15 UTC - in response to Message 1902246.  
Last modified: 22 Nov 2017, 1:30:27 UTC

Once I noticed that the servers were back up, I went over to my crunchers and forced a request-for-work. I started getting work right away (first 1, then 100+ a couple of times), Check to see if your machines are on a delay because they requested work when none was available.

Thanks, the first thing I do when it's back up is force an update on all boxes via BOINCTasks.
Every 305 seconds getting "No work available"
===
Disregard, flowing now ...
ID: 1902248 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1902250 - Posted: 22 Nov 2017, 1:19:01 UTC - in response to Message 1902243.  
Last modified: 22 Nov 2017, 1:28:05 UTC

Back up over 90 minutes, still can't get any tasks for any of my three.
May be time to head to Einstein?


. . I was surprised that I got some new work on the first try after the outage, but since then nada. Looking at the server page the RTS tasks are below 90K but the creation rate is only 8/sec. Oh well!

. . Have fun at Einstein :)

[edit] .. OK. so while I was reading and typing, a work request got "unable to communicate with project, server may be down" message followed on the next attempt by a large download. Maybe you should try again now :)

Stephen

:)
ID: 1902250 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1903134 - Posted: 27 Nov 2017, 8:08:05 UTC
Last modified: 27 Nov 2017, 8:08:21 UTC

AP Validators don't appear to be working too well- Number of AP WUs awaiting Validation and Assimilation are heading for orbit.
Grant
Darwin NT
ID: 1903134 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1903290 - Posted: 28 Nov 2017, 6:05:26 UTC

AP Awaiting Validation and WU Awaiting Assimilation continue to climb.
Grant
Darwin NT
ID: 1903290 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1903299 - Posted: 28 Nov 2017, 6:57:42 UTC - in response to Message 1903290.  

I guess no one from the project has noticed. Hopeful that it gets fixed with maintenance tomorrow.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1903299 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1903319 - Posted: 28 Nov 2017, 8:48:02 UTC - in response to Message 1903299.  

I guess no one from the project has noticed. Hopeful that it gets fixed with maintenance tomorrow.

Yep.
Hopefully the restart after the system shut down will kick things along.
Grant
Darwin NT
ID: 1903319 · Report as offensive
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 . . . 29 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.