Two wrongs make a right

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1460370 - Posted: 3 Jan 2014, 18:25:44 UTC Here's one of those situations where runaway rigs become more than just an annoyance. On WU 1392974451, 3 out of 4 reporting hosts got -9 overflows (mine being the only one that didn't). As luck would have it, 2 of those hosts apparently got the same Autocorr count of 30 and validated against each other. Both of these hosts (6772486 and 6062303) appear to be generating nothing but -9 overflows, so I tend to think that the result that my machine returned is actually the accurate one, but unfortunately the bad result got validated and that's what will end up in the science database. Looking at the task list for 6062303, I can see at least a couple more WUs (1392177317 and 1384362440) where the non-overflow result was marked invalid while the apparently bogus overflow result received the validator's blessing. I didn't take the time to dig any deeper, because I find the whole situation rather depressing. I sure wish the staff would take a more serious look at the way these runaway rigs are pushing aside good results and corrupting the science database. ID: 1460370 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1460371 - Posted: 3 Jan 2014, 18:35:54 UTC - in response to Message 1460370. On the Cuda side of things, I definitely feel while your result looks solid, that failsafes in overflow situations are warranted, so have been working toward implementing that. Non-trivial and taking time but definitely worthwhile. Are there known issues or particular modes of failure known for those agreeing ati/amd style configs ? I find it pretty odd that two obviously failing runs would overflow on autocorrelations when there are none, and then match at least weakly. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1460371 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1460391 - Posted: 3 Jan 2014, 20:50:35 UTC Autocorrs with peak powers over 100 are so unlikely that might be considered as a sanity check level. Similar "too good to be true" levels for other signal types would be possible. Even without that, the simple fact that CPU processing didn't overflow is a strong indicator that the GPU results are wrong. IMO it would be good to insert a check in the Validator to not accept overflowed results if there's a non-overflow one. That's not really targeted against GPU work, even CPU processing gone bad tends to create false signals and overflow. Joe ID: 1460391 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1461947 - Posted: 8 Jan 2014, 22:22:19 UTC Last modified: 8 Jan 2014, 22:31:15 UTC In looking again at the task list for 6062303, I can see that he's still occasionally getting validations when his - 9 overflows (w/ Autocorr: 30) happen to get paired with another ATI card that's doing the same thing, such as WU 1397034403, leaving what should be the valid result (in this case: 9,1,0,10,0) from an otherwise reliable host out in the cold. Interestingly, I noticed another WU (1395333372) where the two overflow tasks validated and the non-overflow one got left out, but there's still another task "in progress", despite the validation. Perhaps this is an artifact from the server issues of the last couple of days, but it might be interesting to see what happens if this "extra" task gets returned. Can it validate against the task that's already been marked Invalid? Edit: Actually, now I see 2 more WUs (1395333360 and 1395333366) in exactly the same situation, and with what appears to be the same cast of characters. The poor host #6942594 has gotten the shaft all 3 times (and so, in all probability, has the science database). ID: 1461947 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1462774 - Posted: 10 Jan 2014, 21:11:39 UTC - in response to Message 1461947. Interestingly, I noticed another WU (1395333372) where the two overflow tasks validated and the non-overflow one got left out, but there's still another task "in progress", despite the validation. Perhaps this is an artifact from the server issues of the last couple of days, but it might be interesting to see what happens if this "extra" task gets returned. Can it validate against the task that's already been marked Invalid? Well, it appears that the verdict is in, at least for WU (1395333372). The _3 came back with the same counts as the _1: Spike count: 1 Autocorr count: 0 Pulse count: 0 Triplet count: 1 Gaussian count: 0 but since the _1 had already been marked as Invalid due to the two -9 overflows (w/ Autocorr count: 30) validating against each other, it was also marked as Invalid. I suspect the same thing will happen shortly to the other two WUs I referenced. To my mind, that's JUST PLAIN WRONG! Apparently good results are getting thrown out and bad results are polluting the science database. And unless runaway rigs don't somehow get choked off, it'll just keep on happening. ID: 1462774 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1463941 - Posted: 13 Jan 2014, 14:52:33 UTC Has anybody talked to Eric about this? David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1463941 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1464453 - Posted: 15 Jan 2014, 2:15:05 UTC - in response to Message 1463941. Has anybody talked to Eric about this? From the resounding silence, I gather the answer to that is probably "no". Another check of the current task list for 6062303 shows 4 newly "validated" WUs (1402088696, 1402797374, 1402238793, and 1402238781) where ATI overflows with Autocorr=30 validated while a more plausible result was marked Invalid. In addition, I also see several WUs (1403003019, 1403003013, 1403003083, 1403003077, 1403003071, 1403003065) where two ATI Autocorr=30 overflows validated on the _0 and _1 tasks, so there wasn't even a chance for a 3rd opinion. This just represents the damage being done to the science database by one runaway machine in a 24-hour stretch. I would guess that if you looked at the hosts that have validated against 6062303 in these instances (which I haven't, as yet), you could probably build a whole chain of hosts doing somewhat the same thing. I REALLY think the admins/scientists need to take a serious look at this. ID: 1464453 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.