Panic Mode On (108) Server Problems?

Author	Message
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1901199 - Posted: 15 Nov 2017, 16:34:13 UTC When reporting I got this: 11/15/2017 8:32:51 AM \| SETI@home \| Project is temporarily shut down for maintenance ID: 1901199 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1901201 - Posted: 15 Nov 2017, 16:47:49 UTC One server tuneup, coming right up. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1901201 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1901205 - Posted: 15 Nov 2017, 17:06:42 UTC - in response to Message 1901201. One server tuneup, coming right up. Meow. And already (provisionally) complete. ID: 1901205 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1901240 - Posted: 15 Nov 2017, 22:12:11 UTC - in response to Message 1900207. We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem. I don't like the looks of WU 2739307008. One host successfully downloaded and ran it. Everybody else is getting D/L errors on it, 5 hosts so far. My machine's Event Log shows: 11/9/2017 7:17:34 AM \| SETI@home \| Started download of blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar 11/9/2017 7:17:38 AM \| SETI@home \| Finished download of blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar 11/9/2017 7:17:38 AM \| SETI@home \| [error] MD5 check failed for blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar 11/9/2017 7:17:38 AM \| SETI@home \| [error] expected 4bb0fee3928609f2b1df21e44ac13b4e, got 450a32005c6700d7ab95284edc959572 11/9/2017 7:17:38 AM \| SETI@home \| [error] Checksum or signature error for blc24_2bit_guppi_57895_43958_HIP91357_0024.29905.818.23.46.0.vlar I just downloaded the WU manually and didn't seem to have any errors. The file doesn't appear to be truncated, either, ending with "" as the last line. I just downloaded it as well. I got a manual MD5 of 450a32005c6700d7ab95284edc959572, the same as BOINC calculated for yours: that would suggest that the MD5 stored in the database when the file was created (so that the comparison can be done) might be corrupted. EXCEPT: the person who downloaded the _0 replication got a clean download. _1 was a file size error (my download was 720,530 bytes, which is close enough without knowing the precise number of bytes expected): all the others got the MD5 error. Which suggests that something was messing with, either, the database MD5 values, or, the stored files on disk. Neither really bears thinking about, and both are a long way outside our control. As Rob says, take the day off. @SETIEric@qoto.org (Mastodon) ID: 1901240 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1901242 - Posted: 15 Nov 2017, 22:20:41 UTC I see we've got some blc_25 WUs now. Looks like they require more work than the blc_24s did. Grant Darwin NT ID: 1901242 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1901245 - Posted: 15 Nov 2017, 22:32:12 UTC - in response to Message 1901240. We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem. More disk drives was a line item in the fall fundraising drive, wasn't it? ID: 1901245 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1901253 - Posted: 15 Nov 2017, 23:16:14 UTC - in response to Message 1901240. We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem. . . Thanks f or the update Eric. Stephen :) ID: 1901253 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1901770 - Posted: 19 Nov 2017, 2:30:34 UTC The WU File deleters have been having issues for a few days now, however things appear to be getting worse with the backlog reaching new highs. Hopefully we won't get to the point of running out of disk space till well after everyone's back at work at Berkeley. Grant Darwin NT ID: 1901770 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1901825 - Posted: 19 Nov 2017, 9:00:38 UTC - in response to Message 1901253. We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem. . . Thanks f or the update Eric. Stephen :) . . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time. Stephen :) ID: 1901825 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1901888 - Posted: 19 Nov 2017, 17:32:02 UTC - in response to Message 1901825. We had a drive in the array that holds the workunits that was generating errors. I've kicked it out of the array. We'll see if the rebuild fixes the problem. . . Thanks f or the update Eric. Stephen :) . . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time. Stephen :) I too was wondering the same or more likely the re-apportioning of the memory Eric mentioned. I haven't had to kick the servers either. Smooth sailing for once and I really like it. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1901888 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1901941 - Posted: 19 Nov 2017, 22:44:51 UTC - in response to Message 1901888. . . For what it's worth, since the issue with the derelict Raid drive. and hopefully no jinxes involved here, I have not had to play kick the servers at all. Work requests are being met and my caches are full. It would seem, on the surface, that the malfunctioning drive may have been at the heart of the issue for some time. Stephen :) I too was wondering the same or more likely the re-apportioning of the memory Eric mentioned. I haven't had to kick the servers either. Smooth sailing for once and I really like it. . . Hi Keith, . . Nice to know the effect is not just with my rigs. I wonder if things have improved for grant too, he seemed to be suffering from the issue a lot. Stephen ?? ID: 1901941 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1902009 - Posted: 20 Nov 2017, 6:52:21 UTC - in response to Message 1901941. . . Nice to know the effect is not just with my rigs. I wonder if things have improved for grant too, he seemed to be suffering from the issue a lot. About the same. Ok for days or weeks at time then the cache runs down for a bit & Tbars triple update gets it going again. When there is no AP work, or mostly just GBT or Arecibo is when the issue generally occurs. The fact that there has been a steady flow of AP work for a while now is most likely why we're not having issues getting work. Grant Darwin NT ID: 1902009 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1902243 - Posted: 22 Nov 2017, 0:38:04 UTC Back up over 90 minutes, still can't get any tasks for any of my three. May be time to head to Einstein? ID: 1902243 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1902246 - Posted: 22 Nov 2017, 0:43:50 UTC - in response to Message 1902243. Once I noticed that the servers were back up, I went over to my crunchers and forced a request-for-work. I started getting work right away (first 1, then 100+ a couple of times), Check to see if your machines are on a delay because they requested work when none was available. ID: 1902246 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1902248 - Posted: 22 Nov 2017, 1:04:15 UTC - in response to Message 1902246. Last modified: 22 Nov 2017, 1:30:27 UTC Once I noticed that the servers were back up, I went over to my crunchers and forced a request-for-work. I started getting work right away (first 1, then 100+ a couple of times), Check to see if your machines are on a delay because they requested work when none was available. Thanks, the first thing I do when it's back up is force an update on all boxes via BOINCTasks. Every 305 seconds getting "No work available" === Disregard, flowing now ... ID: 1902248 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1902250 - Posted: 22 Nov 2017, 1:19:01 UTC - in response to Message 1902243. Last modified: 22 Nov 2017, 1:28:05 UTC Back up over 90 minutes, still can't get any tasks for any of my three. May be time to head to Einstein? . . I was surprised that I got some new work on the first try after the outage, but since then nada. Looking at the server page the RTS tasks are below 90K but the creation rate is only 8/sec. Oh well! . . Have fun at Einstein :) [edit] .. OK. so while I was reading and typing, a work request got "unable to communicate with project, server may be down" message followed on the next attempt by a large download. Maybe you should try again now :) Stephen :) ID: 1902250 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1903134 - Posted: 27 Nov 2017, 8:08:05 UTC Last modified: 27 Nov 2017, 8:08:21 UTC AP Validators don't appear to be working too well- Number of AP WUs awaiting Validation and Assimilation are heading for orbit. Grant Darwin NT ID: 1903134 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1903290 - Posted: 28 Nov 2017, 6:05:26 UTC AP Awaiting Validation and WU Awaiting Assimilation continue to climb. Grant Darwin NT ID: 1903290 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1903299 - Posted: 28 Nov 2017, 6:57:42 UTC - in response to Message 1903290. I guess no one from the project has noticed. Hopeful that it gets fixed with maintenance tomorrow. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1903299 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1903319 - Posted: 28 Nov 2017, 8:48:02 UTC - in response to Message 1903299. I guess no one from the project has noticed. Hopeful that it gets fixed with maintenance tomorrow. Yep. Hopefully the restart after the system shut down will kick things along. Grant Darwin NT ID: 1903319 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.