Message boards :
Number crunching :
Invalid, due to difference of opinion.
Message board moderation
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
WU result marked as Invalid for me (CUDA50), validated for the others (OpenCL ATI). I came up with 4 triplets, they came up with 30 Autocorrelation counts. Luckily 0.28 isn't a great loss, even with the poor pay rate of Credit New. Grant Darwin NT |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
Hi, I had a wu marked "invalid". When i checked it, one wingman had finished it and a third still calculating. If a WU get two different results, why is it marked invalid BEFORE the third one confirms, which is right? http://setiathome.berkeley.edu/result.php?resultid=4115671844 |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
If a WU get two different results ... http://setiathome.berkeley.edu/workunit.php?wuid=1773860844 They look the same according to stderr_txt (Spike count: 3) I may only guess that your result file (which is a different file than stderr.txt) had some garbage or was truncated (missing some lines/bytes at the end). Result files have names like: 26no12ag.13645.20272.438086664198.12.68_1_0 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
WU http://setiathome.berkeley.edu/workunit.php?wuid=1777150544 Worth to get task for checking with CPU. http://boinc2.ssl.berkeley.edu/sah/download_fanout/291/21jn12aa.27468.9985.438086664200.12.14 (unfortunately, task removed already) |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Since both 'other' computers: http://setiathome.berkeley.edu/results.php?hostid=6158143&offset=0&show_names=0&state=5&appid= http://setiathome.berkeley.edu/results.php?hostid=6775319&offset=0&show_names=0&state=5&appid= ... have many 'Invalid' tasks and the 'Valid' GPU tasks have the same short 'Run time' of < 1 minute you can guess which result was OK ;) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Since both 'other' computers: Thanks for observation. Also, in initial result both ATi devices are from the same model: BeaverCreek Worth to check then, if those devices tend to produce invalids or it was just such unfortunate coincidence. Situation when false positives validate agains each other not new. We saw the same with NV GPUs for CUDA already for Spikes if I recall correctly. Seems the same detected with ATi GPUs and Autocorrelation now. Worth to spot what could act as sign of issue for sanity check to disallow such false positives reach validator. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Also, this example is good illustration of how current quota management system acts. IMHO, it acts incorrectly. Host has many invalids being checked versus different devices. But once a time unfortunate event happens - it matched versus similarly broken device. And what happens?: false result validates that immediately gives great boost to host's daily quota. And that, in turn, rises chances to find similarly broken host again. Hence, quota system allows positive feedback loop here that acts like invalid results amplifier. IMO worth to approach BOINC devs with idea of additional measures to ensure device variety. Currently we have such variety on host ID level only. Same host can't get tasks from same WU. But in case of broken device model such devices can reside in different hosts. Maybe worth to add variety on device level too (as usual we come to conclusion that basic entity BOINC should manage is device, not host. Conclusion that made over 10 years ago...). |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Also, this example is good illustration of how current quota management system acts. IMHO, it acts incorrectly. Host has many invalids being checked versus different devices. But once a time unfortunate event happens - it matched versus similarly broken device. And what happens?: false result validates that immediately gives great boost to host's daily quota. And that, in turn, rises chances to find similarly broken host again. Hence, quota system allows positive feedback loop here that acts like invalid results amplifier. BOINC (at the server/administration level) already has two features available: Homogeneous Redundancy and Homogeneous App Version. What you're suggesting is effectively the inverse of those - a sort of anti-homogeneous redundancy, or enforced divergency: that might well work for this project, which does indeed have a wide range of divergent hardware and software available for processing. But just as with Einstein's Locality Scheduling, I'd be worried that the extra decision-making in the scheduler would add to the server workloads, and possibly delay work allocation until an allowable host requested work if, say, a malformed workunit caused a succession of errors. It's an interesting idea, and by all means float it past Eric and the rest of the BOINC development team, but be prepared to consider the response that it might cause more problems than it solves. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
0 No homogeneous redundancy (all hosts are numerically equivalent) [b]1 A fine-grained classification with 80 classes (4 OS and 20 CPU types). [/b]2 A coarse-grained classification in which there are 4 classes: Windows, Linux, Mac-PPC and Mac-Intel. As one can see, no mention of GPU (!) And GPU diversity is much bigger than any CPU ones. That makes this feature quite inefficient at least in its current form. What I propose is quite opposite direction indeed but I don't see how it could increase scheduler logic and overhead too much over this particular option. For this option to work one needs to classify hosts by devices, then chose hosts with proper devices (in this case, similar ones). For my proposal same classification by devices required, here both are identical. Then instead of similar one need dissimilar ones. Same expenses. Moreover, to take best of 2 approaches one could leave coarse grained similarity approach (to ensure numerical similarity) but required different devices to reduce chances to get particular model-induced issues. But that would require more fine graining that could increase overhead indeed.
Again this could work with addition of diversity by device inside same app class (inside same GPU vendor) to reduce numerical divergency. But, AFAIK SETI project has not enabled both these options. So we have all that numerical diversity (and time to time suffer from it via increased inconclusive rate). Hence, as first approach we could just get something similar to one of those options but enabled for GPU (and with opposite sign)(hence perhaps it should base on secondary one). Also, no need to think in absolute terms. If server can't find appropriate combo it could just send to any. But preferably send with such criteria. This will not worsen task allocation and queues. And of course just as any policy it should be made switchable. If project needs it it uses it. It's not for hardwired default approach of course.
Any volunteers to represent it and promote to BOINC team review? |
Cavalary Send message Joined: 15 Jul 99 Posts: 104 Credit: 7,507,548 RAC: 38 |
Been wondering about this actually, whenever I checked my results and saw an inconclusive due to wingman's 30 spikes on a GPU that outputs loads of invalids of that type, so what if the third would be the same. Don't recall seeing 30 autocorrs, but do lately see a fair number of inconclusives due to 1-2 autocorrs (or 1-2 more) on a GPU, though in those cases all validate after the third since that seems to be the only difference. So yes, Raistmer's idea seems very good, and I say try to pair GPU and CPU if possible, then if not and since GPUs can crunch so much more, at least different chipset makers, possibly different generations too. Oh, and likely different driver versions even before generations. |
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66 |
Good ideas but due the cost of the man power and machinery I don't see it happening, keep in mind this project is big science on a small budget. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
WU This might be the same problem I brought up over 15 months ago in Two wrongs make a right. At one point I had a list of about a dozen of these ATI rigs that were validating against each other with Autocorr counts of 30 that were clearly bogus, driving out legitimate results. Since there didn't seem to be any interest in fixing the problem, I gave up trying to track them. I think most of those have either been cleaned up or gone away by now, but new ones like these two have probably come along. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
This might be the same problem I brought up over 15 months ago in Two wrongs make a right. I reckon it is; I remembered seeing a post about something similar but couldn't remember just what or when it was. And I didn't realise it was made so long ago; I though it was only a few months a go. Grant Darwin NT |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
WU Actually, your highlight of that issue along with similar reports for iGPUs ultimatelly led to development additional check inside app itself to prevent false results even reach validator. This bunch of checks we internally call as sanity checks. Some sanity checks were implemented both for MultiBeam and AstroPulse with main constrains developed by Josef. So, not useless. Another side of issue that not all such results can be easely spotted and marked as invalids programatically. Human can look other host results and on whole set of data decide what is right and what is not. Application should decide during task computations. Hence I asked to spot any signs that can distinguish false positives of such kind from regular overflow (overflow itself can't be enough). |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
WU Those sanity checks were completed at about rev 2421 last June, so in line with the Official Seti v7 binary vs. Optimized one thread it would be good to get the release builds updated past their current rev 1831. First step will have to be getting the Beta splitter configuration fixed so the pfb_splitters used there consistently produce WUs with the intended analysis_cfg parameters. When I first saw this thread yesterday the task details had already been purged. If any one managed to capture those I'd be very interested in the signals as shown in the stderr sections of the OpenCL ATi tasks. Those would reveal whether the Autocorr sanity check would have been effective for this specific case. Assuming nobody saved those details, consider this a request to do so in future similar cases. Joe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
When I first saw this thread yesterday the task details had already been purged. If any one managed to capture those I'd be very interested in the signals as shown in the stderr sections of the OpenCL ATi tasks. Those would reveal whether the Autocorr sanity check would have been effective for this specific case. Assuming nobody saved those details, consider this a request to do so in future similar cases.Joe Okay, I found my old list and quickly found a host, 6772486, that still appears to be doing its damage. The first example I see is WU 1776725339, which has two overflow tasks (29 Autocorr, 1 triplet) from consistently bad ATI rigs causing a non-overflow task (3 spikes, 4 pulses, 3 triplets) from a normally clean rig to be marked Invalid. If I take the time, I can probably find many more. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
When I first saw this thread yesterday the task details had already been purged. If any one managed to capture those I'd be very interested in the signals as shown in the stderr sections of the OpenCL ATi tasks. Those would reveal whether the Autocorr sanity check would have been effective for this specific case. Assuming nobody saved those details, consider this a request to do so in future similar cases.Joe Yes, hosts 6772486 and 7278254 are definitely producing false overflows on Autocorrs which would be caught by the added sanity check. Also host 6320677 which I found while looking at results from the other two. The Autocorr sanity check for OpenCL builds after rev 2421 is an overflow with one or more Autocorr peaks above 100. Given that pair of conditions the task will be errored out. It's done that way because I did spot one case where a single Autocorr peak well above 100 was found by reliable hosts on a non-overflow task, one running a CPU build IIRC. Joe |
cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0 |
Hi Cavalary,
Bear in mind that in the case of Nvidia GPU's if a user has a modern/recent model GPU and a much older one in their rig, they can only install the drivers for the new model. And cannot revert to an older driver. So it may well be that NV users will pretty much have 3xx.xx series drivers installed. What the score is for ATI GPU's I have no idea, never used one. So for NV at least there probably isn't much driver diversity. Regards, Cliff, Been there, Done that, Still no damm T shirt! |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Those sanity checks were completed at about rev 2421 last June, so in line with the Official Seti v7 binary vs. Optimized one thread it would be good to get the release builds updated past their current rev 1831. First step will have to be getting the Beta splitter configuration fixed so the pfb_splitters used there consistently produce WUs with the intended analysis_cfg parameters.Joe Any progress in getting those sanity checks implemented? The reason I ask is that I found this morning that I was on the losing end of yet another WU where two ATI rigs with Autocorr counts of 30 invalidated my task with 2 Spikes and 6 Triplets. In following the trail of those two rigs, I turned up 11 more (before I quit looking) which are happily validating against each other, often trashing what are likely good science results. At least 3 of those 11 are new machines that have been signed up in the last couple of months, and none of them are ones that were on the list I made early last year. For what it's worth, the IDs are: 7456351, 7537898, 6889285, 6641343, 7453260, 7114470, 7010985, 7556063, 7504170, 7084572, 6759774, 7553275, 6722861. Of course, while the sanity checks would be a big help, the real solution might be to figure out what could be causing all these ATI rigs, and only ATI rigs, to be producing these wacko (yet consistent) Autocorr results in the first place? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Those sanity checks were completed at about rev 2421 last June, so in line with the Official Seti v7 binary vs. Optimized one thread it would be good to get the release builds updated past their current rev 1831. First step will have to be getting the Beta splitter configuration fixed so the pfb_splitters used there consistently produce WUs with the intended analysis_cfg parameters.Joe All 13 of the ATI hosts you found are using Catalyst 11.10 or 11.11 drivers which have AMD APP SDK 2.5 support. AMD APP SDK 2.6 or better (Catalyst 11.12+) is needed for the Windows SaHv7 OpenCL 7.03 builds to do proper Autocorr processing. Those 7.03 builds are from rev 1831, builds from rev 1870 or later will error all SaHv7 tasks on too old drivers. That will drive the "Max tasks per day" for their SaHv7 OpenCL app versions down to 1, so even if the user never updates the drivers the only remaining issue would be relatively few tasks being sent to the host and returned as errors. The current Windows 7.07 OpenCL ATI versions under test at Beta are from rev 2929. Joe |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.