Panic Mode On (101) Server Problems?

Author	Message
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1739816 - Posted: 5 Nov 2015, 5:18:08 UTC Last modified: 5 Nov 2015, 5:24:04 UTC Drat. My lowly single-core machine picked up a re-send for one of the terminally-ill MBs. There goes the consecutive valid count. http://setiathome.berkeley.edu/workunit.php?wuid=1954413233 edit: Question: if the auto-corr config values (as mentioned by Richard) are zero instead of the values they should be... then theoretically, couldn't one just open the WU in a hex editor and put those values back to something non-zero so it would crunch properly? Surely it's not that simple of a fix though... edit2: I think I just understood from re-reading.. that's in the output result file that is from. So then it would still have to be something in the header for the WU itself that decides it can't run auto-corr, right? Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1739816 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1739821 - Posted: 5 Nov 2015, 5:57:54 UTC - in response to Message 1739815. Got to love the perversity of chance. I've got 2 systems, a Core 2 Duo & and i7. Naturally the i7 can do a lot more work than the C2D. With the present lack of work, the C2D gets work every 45min or so. The i7, every 2 (or more) hours. Gotta love it. Like how my Pentium D just sucked down APs, but the quad core Xeon, nope ... ID: 1739821 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1739823 - Posted: 5 Nov 2015, 6:06:45 UTC Well it looks as if the splitters are behaving themselves just now.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1739823 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1739826 - Posted: 5 Nov 2015, 6:20:03 UTC - in response to Message 1739823. Well it looks as if the splitters are behaving themselves just now.... I had been thinking that their splitting rate has been higher than it's been for some time (pretty much since the PFB splitters came in. Generally anything less than 5 splitters running & output is barely 27/s. Multiple splitters on the one channel, even less. So far all 7 splitters have been running on only 4 (even down to 2) files & still they're pumping out the work). I just didn't want to tempt fate. To add to the perversity of chance, now that so many caches are pretty much empty, 90% of the work I've been getting have been shorties. Although there are some GPU WUs i'll keep an eye on. 04mr11ae Estimated completion times for longer running GPU WUs are usually not much more than 35min. These ones are all around 45min. Grant Darwin NT ID: 1739826 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1739840 - Posted: 5 Nov 2015, 8:29:41 UTC It was a really nice flow lately, lots of APs gave me an nice RAC but now it's over again. No APs, no MBs, lots of invalid tasks >>>>> had to power down my cruncher once again. Just wish this project would be a bit more stable. ID: 1739840 ·

Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3	Message 1739844 - Posted: 5 Nov 2015, 8:37:35 UTC - in response to Message 1739840. NX-01 you can probably blame Matt he probably was the sucker left to do the programming changes It's called passing the buck hehehehehe Sorry Matt couldn't resit that one :-) ID: 1739844 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1739848 - Posted: 5 Nov 2015, 9:12:10 UTC - in response to Message 1739816. edit: Question: if the auto-corr config values (as mentioned by Richard) are zero instead of the values they should be... then theoretically, couldn't one just open the WU in a hex editor and put those values back to something non-zero so it would crunch properly? Surely it's not that simple of a fix though... edit2: I think I just understood from re-reading.. that's in the output result file that is from. So then it would still have to be something in the header for the WU itself that decides it can't run auto-corr, right? Yes, the autocorr settings I quoted were lifted from the downloaded data file before it was crunched. They could be edited between downloading and crunching, so that the proper analysis was done and reported. But there are two flies in that ointment - one potential, one certain. Potential: editing the WU data file would change its MD5 checksum. I think that's only checked as the download completes, but it might get checked when the task is launched as well (it probably should be). BOINC would be within its rights to reject the file for tampering. Certain: unless you could be certain that your wingmate had also edited the data, your result would be different from all the others, and would fail validation. ID: 1739848 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1739857 - Posted: 5 Nov 2015, 9:36:43 UTC - in response to Message 1739844. NX-01 you can probably blame Matt he probably was the sucker left to do the programming changes It's called passing the buck hehehehehe Sorry Matt couldn't resit that one :-) Don't they test changes on beta anymore before they go live on main? ID: 1739857 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1739858 - Posted: 5 Nov 2015, 9:39:35 UTC - in response to Message 1739857. Don't they test changes on beta anymore before they go live on main? Posted earlier in this very thread, Just so you know we're working on the splitter problem - a new bit of splitter code was put into play yesterday. It was working well enough in beta, but apparently it still wasn't ready for prime time. We have some debugging and cleaning up to do but we'll be back soon enough with more workunits.... - Matt Grant Darwin NT ID: 1739858 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1739864 - Posted: 5 Nov 2015, 10:52:17 UTC Last modified: 5 Nov 2015, 10:55:13 UTC Well, that's strange then. BTW: Everthing was running fine before, at least for me, so I wonder what problem they are trying to fix with the new code? (sorry if the answer is already in this thread somewhere). ID: 1739864 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1739866 - Posted: 5 Nov 2015, 11:06:45 UTC - in response to Message 1739864. Last modified: 5 Nov 2015, 11:11:53 UTC Well, that's strange then. BTW: Everthing was running fine before, at least for me, so I wonder what problem they are trying to fix with the new code? (sorry if the answer is already in this thread somewhere). They've been working slowly behind the scenes for most of this year, preparing the entire processing chain (telescope --> recorder --> splitter --> application(s) --> validator --> assimilator) to handle observations made at the Green Bank observatory. The new splitters are dual-purpose, designed to handle either Arecibo or Green Bank data as required. Edit - none of that is particularly new, I'm just repeating what Matt has posted in Technical News. See, for example, Jun 23 2015 and Aug 31 2015. ID: 1739866 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1739869 - Posted: 5 Nov 2015, 11:18:42 UTC Ok, thx Richard, hope they can fix everything soon. ID: 1739869 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1739959 - Posted: 5 Nov 2015, 20:30:54 UTC - in response to Message 1739848. Last modified: 5 Nov 2015, 20:32:35 UTC But there are two flies in that ointment - one potential, one certain. Potential: editing the WU data file would change its MD5 checksum. I think that's only checked as the download completes, but it might get checked when the task is launched as well (it probably should be). BOINC would be within its rights to reject the file for tampering. Certain: unless you could be certain that your wingmate had also edited the data, your result would be different from all the others, and would fail validation. The MD5s are easy enough.. just make the change to the header, re-MD5 the file, put the new hash into client_state. Unless that MD5 is cross-checked with the scheduler upon contact (I would hope that it does), then I can likely see that being a problem. As far as wingmates.. I know there's hardly any guarantee that random wingmates would ever respond, and then it would be even less likely that if anyone does respond, they would know how to fix their WU the same as you did. In the case of some of these WUs, you just need two of them out of the total of 10 to match. So if 6 or 7 of the wingmates never respond to PMs about it.. you just need one out of the total of 9 to respond and know how to do this. Of course, this is all hypothetical at best anyway, because I believe this totally falls into the category of tampering which is not only unethical but also prohibited. I was just wondering if there was technically something that could be done on the client-side to fix these broken WUs, and I suppose I already got my answer... theoretically, yes; realistically, no. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1739959 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1739973 - Posted: 5 Nov 2015, 22:12:56 UTC Last modified: 5 Nov 2015, 22:13:20 UTC Caches are slowly filling, only another 300,000 to go. Returned-per-hour is right up there, over 100,000/hr for the last 6 hours. Hopefully some of the new files will give a few more longer running WUs. Help reduce the load a bit. Grant Darwin NT ID: 1739973 ·

Starman Send message Joined: 15 May 99 Posts: 204 Credit: 81,351,915 RAC: 25	Message 1739984 - Posted: 5 Nov 2015, 23:27:34 UTC I'm still getting an unusually high number of invalid's. Just me or are others getting them as well. Thanks ID: 1739984 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1739985 - Posted: 5 Nov 2015, 23:36:13 UTC There are a lot of invalid tasks floating though the system due to a coding error, which I believe has been fixed. ID: 1739985 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1739987 - Posted: 5 Nov 2015, 23:44:50 UTC - in response to Message 1739985. There are a lot of invalid tasks floating though the system due to a coding error, which I believe has been fixed. I haven't noticed any errors on the WUs since they had a play with the splitter code to sort it out. Although I've already had several _9s on my systems, those automatic error WUs will be floating around for months. Grant Darwin NT ID: 1739987 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1739988 - Posted: 5 Nov 2015, 23:50:06 UTC I also got a lot more invalids today. Most of them have no autocorr still. With each crime and every kindness we birth our future. ID: 1739988 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1739995 - Posted: 6 Nov 2015, 0:14:17 UTC - in response to Message 1739984. I'm still getting an unusually high number of invalid's. Just me or are others getting them as well. Thanks Since the original splitter problem seems to have been fixed, everything you get from now on for those WUs will be resends, tasks _2 thru _9. Whereas a "good" WU only requires 2 hosts to put it to bed, these suckers need 5 times that many, all of it just wasted host processing. In the absence of any action by the admins to block the resends and stop wasting resources, a lot of those WUs will be circling the drain for many weeks to come. After getting stuck with about 30 Invalids on my xw9400 in the initial wave, I've since managed to abort about 150 of those garbage tasks before they could run, freeing up a lot of processing time for actual productive work! ID: 1739995 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1739996 - Posted: 6 Nov 2015, 0:14:37 UTC - in response to Message 1739988. I also got a lot more invalids today. Most of them have no autocorr still. Resends (or the dregs of your cache, depending on how fast you process work), will take months to clear them all out. And probably 90% of all your current Inconclusives will end up being Invalid as well. Grant Darwin NT ID: 1739996 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.