Message boards :
Number crunching :
Strange result, how is this possible?
Message board moderation
Author | Message |
---|---|
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31122 Credit: 53,134,872 RAC: 32 ![]() ![]() |
IIRC -9 errors, if real, indicate the work unit was polluted with RFI. IIRC the system grants credit for the time spent crunching even though the result is unusable because crunch time was needed to know to throw the W/U away. ![]() |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
How can the Cuda machine here wuid=666294818, validate and be given credit, when its result is a "-9 result_overflow", when the other Cuda which btw also owerflowed, got an invalid result? A few extra things to consider, - '-9 overflow' is not technically an 'error' but an 'Informational Message', indicating there are more than the allocated result space for reportable signals present. This would often indicated RFI (as mentioned) but can have other causes (not just GPUs either) - The selected 'Canonical' result, i.e. for entry into the science database, was one of the CPU ones, the other two granted were at least 'weakly similar', the remaining erroneous one was not ( i.e. was 'different' ) - There are known precision and signal ordering differences that can be exposed with the Cuda apps compared to the CPU apps, particularly in the case of overflow ('correct' or not), and with signals near thresholds. This, IMO, is directly related to the CPU codebase being relatively mature, having some 10-11 years (x 10's to hundreds of peoples' contributions) toward refinement, with the Cuda codebase having more like on the order of 6 months x a few people. - Given that the selected Canonical result was clearly 'the right' one (in this particular case), the science is not polluted by the clearly erroneous result, or even the weak similarity overflow one. To my mind this case is actually an example of the system working as it should. Jason [Edit:] a bit more info from: http://www.boinc-wiki.info/Canonical_Result So the only things you can be certain about are: "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Dave Stegner ![]() Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 ![]() ![]() |
Sten-Arne's post got me to looking around at some of my pendings. I sure have a lot of inconclusive, never have had before. My machines have not changed configuration in over a year and have never had issues with validating. But, pair it with a Cuda and look out. http://setiathome.berkeley.edu/workunit.php?wuid=668605751 I guess I will need to monitor this closely, as I am not interested in spending money on machines and electricity for unstable results. Things should get really fun when ATI apps are released, MB will be more complicated and AP will have a chance to become unstable also. Dave |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Sten-Arne's post got me to looking around at some of my pendings. I sure have a lot of inconclusive, never have had before. My machines have not changed configuration in over a year and have never had issues with validating. But, pair it with a Cuda and look out. Now this example is a much clearer case of one host generating spurious results (for whatever reason). The validator needs a third result to decide which is really 'correct' :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Dave Stegner ![]() Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 ![]() ![]() |
I agree, in the interest of the science we need a third. But, as stated 1 machine is running a stable and proven app the other is running Cuda. If the third proves that the stable app is correct, nothing will be done about the cuda machine. I think that was the op's point. It is not only a waste of resources on the client's part, it is a huge waste on Seti's part. Dave |
![]() ![]() Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 ![]() ![]() |
How can the Cuda machine here wuid=666294818, validate and be given credit, when its result is a "-9 result_overflow", when the other Cuda which btw also owerflowed, got an invalid result? It appears that the results are similar enough that it granted you credit. the 2 you matched with both had 20 spikes you had 30 the other CUDA found 7 pulses ![]() In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I agree, in the interest of the science we need a third. OK, I agree in principle the owner of the machine needs to look at his system, no argument there. But let's put all this in a little scientific perspective. We're talking about a data reduction mechanism using redundancy to validate the data. That inbuilt redundancy in itself could be viewed as an inefficiency, since if every host was truly reliable we wouldn't need such waste. I view it as the Boinc mechanisms protecting the scientific integrity of the data, and using a resource that it has ample supply of to do so (compute power). Since very few of us CPU users use ECC RAM, fault tolerant hard drive arrays, etc, there are occasions CPU hosts can go haywire too... Just ask msattler, LoL When push comes to shove, none of the signals in any of those results means more than 'there might have been something at that point at this time'. To be a 'valuable' detection of a candidate, potentially for re-observation, the project has determined there must be 'persistency' as well, which incidentally is something the famous 'WoW! signal' never achieved. Boinc is designed to inherently mistrust the results being returned, and these are examples where the system is working, rather than broken IMO. Show me a clear example of a bodgy result being selected as canonical and I'll join you jumping up and down. I'm sure some exist, but am fairly confident other steps in the science catch such 'issues'. Having said all that, there is a tradeoff going on. New potent compute power is being introduced, that is based on historic supercomputer designs, but is fundamentally (relatively) new to this particular application. These are vastly different programming techniques to traditional CPU programming, and there are liable to be growing pains. No doubts some have expectations for software to be 'perfect' before any kind of release, but sadly it just doesn't work that way, and I suggest those expectations for something this complex are not realistic. So the choices become to either write the progress off as 'a bad idea', or go through the growing pains with sufficient safeguards in place to avoid 'contamination'. The reason there isn't really a 'middle ground' in this process, as such, is that you cannot find & fix errors that don't occur. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31122 Credit: 53,134,872 RAC: 32 ![]() ![]() |
FYI I believe the canonical result is the first of the matching results returned. There are two possible errors that could get into the science. False positive False negative In the first case it will sort itself out at the re-observation stage. In the second case, we might miss ET unless he keeps sending. ![]() |
Dave Stegner ![]() Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 ![]() ![]() |
If I am reading the intent of the original post correctly, we are takling about machines that create issues. It does not matter if a machine is using an old app (which create problems also) or a bleeding edge supercomputer, If it is creating issues it should be cut off. Dave |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 ![]() ![]() |
Sten-Arne's post got me to looking around at some of my pendings. I sure have a lot of inconclusive, never have had before. My machines have not changed configuration in over a year and have never had issues with validating. But, pair it with a Cuda and look out. I was worried about trashing units when I upgraded, and yes I trashed two (on CPU), entirely my fault. But I tried to keep damage to a minimum. Yes, I do get a few "-9 reult_overflow" results, Hopefully these are caused by the actual workunits not by errors on my machine, on which I keep a regular watch over, that includes monitoring temps on both GPU's and CPU. Dust bunnies are regularly evicted. I did not realise that some machines could be performing that badly. Kevin ![]() ![]() ![]() |
![]() ![]() Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54 ![]() ![]() |
Sten-Arne's post got me to looking around at some of my pendings. I sure have a lot of inconclusive, never have had before. My machines have not changed configuration in over a year and have never had issues with validating. But, pair it with a Cuda and look out. I to keep a close eye on my machines. Ive dug around some wingmans and sometimes its down right scary how some trash work by the thousands. And most wont even answer a PM. They jusy happily trash work. Im wondering how many have the new fermis and installed the wrong opp apps on them? ![]() Old James |
JohnDK ![]() ![]() ![]() Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 ![]() ![]() |
Seems this computer is one one those Sten-Arne is talking about http://setiathome.berkeley.edu/show_host_detail.php?hostid=5293938 |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 ![]() ![]() |
I am running Lunatics not stock apps on a pair of 470's, so I actually have the new Fermis with non stock apps running without trashing work by the thousands. The machine mentioned above is not running Fermis and is running stock apps, So it may not be down to the actual software or hardware but how some users actually set up there hardware or software. It ain't what you got, Its the way that you use it. Kevin ![]() ![]() ![]() |
![]() ![]() Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 ![]() ![]() |
Which I believe was Sten-Arne's point. But how do we insure this doesn't happen? I recently had a problem on a couple of my machines, but worked hard to recover the WU's and in the end succeeded. I have to admit it is odd to spend time and money on a fast cruncher and then never check it. Bernie |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 ![]() ![]() |
There are some of us that have built - upgraded machines for doing SETI - other Boinc work, and then there are those that have got fast machines (probabaly mainly for gaming) that think they are being helpful. Some of these will hopefully mature into dedicated crunchers, but others will probably forget they even installed it in the first place and the only time we will loose them is when they upgrade their machines and forget to install it again. Unfortunately this is probably a never ending cycle. Kevin ![]() ![]() ![]() |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
The host in question, does run FERMI's; GTX480(2x), maybe it's running to many WU's at a time, see this: Stderr output <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 2 CUDA device(s): Device 1 : GeForce GTX 480 totalGlobalMem = 1576468480 sharedMemPerBlock = 49152 regsPerBlock = 32768 warpSize = 32 memPitch = 2147483647 maxThreadsPerBlock = 1024 clockRate = 810000 totalConstMem = 65536 major = 2 minor = 0 textureAlignment = 512 deviceOverlap = 1 multiProcessorCount = 15 Device 2 : GeForce GTX 480 totalGlobalMem = 1576468480 sharedMemPerBlock = 49152 regsPerBlock = 32768 warpSize = 32 memPitch = 2147483647 maxThreadsPerBlock = 1024 clockRate = 810000 totalConstMem = 65536 major = 2 minor = 0 textureAlignment = 512 deviceOverlap = 1 multiProcessorCount = 15 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 480 is okay SETI@home using CUDA accelerated device GeForce GTX 480 V12 modification by Raistmer Priority of worker thread rised successfully Priority of process adjusted successfully Total GPU memory 1576468480 free GPU memory 1063374848 setiathome_enhanced 6.02 Visual Studio/Microsoft C++ Build features: Non-graphics CUDA VLAR autokill enabled FFTW USE_SSE x86 CPUID: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 libboinc: 6.3.22 Work Unit Info: ............... WU true angle range is : 0.426153 After app init: total GPU memory 1576468480 free GPU memory 969003008 SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. Flopcounter: 204566200.959168 Spike count: 0 Pulse count: 31 Triplet count: 0 Gaussian count: 0 Something is using much memory.............?! ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14687 Credit: 200,643,578 RAC: 874 ![]() ![]() |
The host in question, does run FERMI's; GTX480(2x), maybe it's running to many WU's at a time, see this: No, somebody has deliberately installed an 'optimised' (tweaked to reduce errors, but no real speedup) application, incompatible with their graphics card. Raistmer, do you have any way to autokill your autokill application? ;-))) |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
You're right, this doesn't look good and will put the wrong result in the DataBase!
You were just ahead of me ;-), Richard.......And this is even worse! ![]() |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 ![]() ![]() |
The host in question, does run FERMI's; GTX480(2x), maybe it's running to many WU's at a time, see this: The machine that I was looking at was http://setiathome.berkeley.edu/workunit.php?wuid=668605751 5625026 Owner ******* Created 2 Dec 2010 6:54:43 UTC Total credit 101,626 Average credit 3,606.75 Cross project credit CPU type GenuineIntel Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz [Family 6 Model 26 Stepping 5] Number of processors 8 Coprocessors [4] NVIDIA GeForce GTX 295 (869MB) driver: 26099 Operating System Microsoft Windows 7 Ultimate x64 Edition, (06.01.7600.00) BOINC version 6.10.58 Memory 12279.12 MB Cache 256 KB Measured floating point speed 2911.97 million ops/sec Measured integer speed 9213.48 million ops/sec Average upload rate 39.93 KB/sec Average download rate 381.27 KB/sec Average turnaround time 0.03 days Application details Show Tasks 4300 ATM he has 6 valid tasks from 17 Dec Kevin ![]() ![]() ![]() |
JohnDK ![]() ![]() ![]() Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 ![]() ![]() |
The host in question, does run FERMI's; GTX480(2x), maybe it's running to many WU's at a time, see this: Or how about the SETI project tries to contact these host owners to correct the problem, and if that doesn't help for whatever reason, block these hosts? Can they block hosts if they want? |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.