Outnumbered by cuda errors?

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845356 - Posted: 26 Dec 2008, 17:45:08 UTC - in response to Message 845352. How can I find if a work unit is VHAR or not? If it is VLAR I look for <rsc_fpops_est>80360000000000.000000</rsc_fpops_est> in client_state.xml and cancel those immediately. Is there a way to know if a work unit is VHAR beforehand? Thanks AR is bigger than ~2,5. But don't abort all VHARs. Some of them doesn't give overflow. We still need to figure out tru VHAR - overflow relations for CUDA. So just look on them closely. BTW, here is some nice script for fast VLAR/VHAR tasks finding in BOINC cache ID: 845356 ·

maceda Volunteer tester Send message Joined: 27 Sep 99 Posts: 3 Credit: 25,114,284 RAC: 0	Message 845397 - Posted: 26 Dec 2008, 20:49:33 UTC - in response to Message 845356. AR is bigger than ~2,5. But don't abort all VHARs. Some of them doesn't give overflow. We still need to figure out tru VHAR - overflow relations for CUDA. So just look on them closely. BTW, here is some nice script for fast VLAR/VHAR tasks finding in BOINC cache OK. IÃ‚Â´ll leave VLAR for now, but IÃ‚Â´m killing all VHAR work units I receive. By the way, someone at Seti might have noticed this since today I have only received 3 VHAR work units vs. dozens for yesterday and the day before. It should be fairly trivial for them not to send VHAR work units to cuda clients. Thanks. ID: 845397 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 845409 - Posted: 26 Dec 2008, 21:07:43 UTC - in response to Message 845322. Last modified: 26 Dec 2008, 21:14:52 UTC A teammate get errors and -9 result_overflow's with the CUDA.. hostid=4710849 Only known bugs? "Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr. VLAR are <= 0.05 ? VHAR are >= 2.5 ? BTW. Why we are now testing BUG-app here in MAIN? It's not possible to test again in BETA? If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it.. ID: 845409 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845416 - Posted: 26 Dec 2008, 21:24:11 UTC - in response to Message 845409. Last modified: 26 Dec 2008, 21:25:28 UTC VLAR are <= 0.05 ? VHAR are >= 2.5 ? Approx. Why we are now testing BUG-app here in MAIN? Because beta corrupted by adaptive replication mode.... And because we ALREADY have CUDA release here, on main. With all its bugs onboard. It's not possible to test again in BETA? Possible but testing hindered (again, adaptive replication mode). Actually it's possible to use my mod both on main and beta, I will do this for example (w/o AP beta testing though). If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it.. Yes! And it's the great evil :) But as I already siad we already have CUDA MB here with that bug inside it. So the sooner we eleminate it or at least will know what tasks we should avoid while doing task with CUDA MB the sooner this dreadful possibility will be diminished. ID: 845416 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 845445 - Posted: 26 Dec 2008, 22:36:41 UTC - in response to Message 845416. Last modified: 26 Dec 2008, 22:45:19 UTC If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it.. Yes! And it's the great evil :) But as I already siad we already have CUDA MB here with that bug inside it. So the sooner we eleminate it or at least will know what tasks we should avoid while doing task with CUDA MB the sooner this dreadful possibility will be diminished. But to eliminate this possible worst case.. it would be better to 'call back' the SETI@home-CUDA-app here in MAIN until she's BUG-free. BTW. Your app is less buggy as the officially app? ID: 845445 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845450 - Posted: 26 Dec 2008, 23:00:48 UTC - in response to Message 845445. It's just equally buggy with stock app :) But it has logging ability now and allows full using of CPU+GPU combo. ID: 845450 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 845458 - Posted: 26 Dec 2008, 23:28:19 UTC Two of the main problems are shown in this extract from a work unit of mine. CPU time 15.39063 stderr out <core_client_version>6.5.0</core_client_version> <![CDATA[ <stderr_txt> cudaAcc_initializeDevice: Found 1 CUDA device(s): Device 1 : GeForce 8800 GT cudaAcc_initializeDevice is determiming what CUDA device to use... user specified SETI to use CUDA device 1: GeForce 8800 GT SETI@home using CUDA accelerated device GeForce 8800 GT setiathome_enhanced 6.02 Visual Studio/Microsoft C++ libboinc: 6.3.22 Work Unit Info: ............... WU true angle range is : 0.299785 Optimal function choices: ----------------------------------------------------- name ----------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00019 0.00000 v_ChirpData 0.01489 0.00000 v_Transpose4 0.00445 0.00000 FPU opt folding 0.00289 0.00000 SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. Flopcounter: 27859105038.632969 Spike count: 23 Pulse count: 7 Triplet count: 0 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 0.0917045547844134 Granted credit 77.1124957440435 The other two results for this task both had 1 spike, 1 pulse, 0 triples and 2 gaussian. So fair enough they should get credit, however results which are clearly invalid, even my own should clearly not be. ID: 845458 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845475 - Posted: 27 Dec 2008, 0:39:16 UTC - in response to Message 845458. Last modified: 27 Dec 2008, 0:50:38 UTC Just illustration to probability of 2 CUDA results validating against each other: http://setiathome.berkeley.edu/workunit.php?wuid=385313683 My GPU found 2 signals, 8800 gave overflowed result. Interesting, third host will be CUDA too?... Something wrong with that 3% estimation IMHO... And another 2-CUDA And another http://setiathome.berkeley.edu/workunit.php?wuid=385313673 One more http://setiathome.berkeley.edu/workunit.php?wuid=385313677 And more http://setiathome.berkeley.edu/workunit.php?wuid=385313649 All these WUs are 2-CUDA results comparison, and all failed because 8800GT returned overflow while my GPU returned some signals but non-overflow. 1) We can't count on 3% total CUDA share. It's non independent probability! Just recall - BOINC pairs similar to similar. So CUDA almost SHOULD be paired with another CUDA ! It's VERY PROBABLE that CUDA result will validate agains another CUDA result. So chances of database pollution MUCH HIGHER than10e-3! 2) One GPU returned overflow while another returned non-overflowed result. What it means ? At least some hardware dependance for this error! Maybe that 8800GT overheated? Maybe most of these overflows from hardware instability still? .... ID: 845475 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845482 - Posted: 27 Dec 2008, 0:57:59 UTC - in response to Message 845475. Last modified: 27 Dec 2008, 0:59:30 UTC And look on this. CUDA result: Spike count: 2 Pulse count: 1 Triplet count: 1 Gaussian count: 2 CPU result: Spike count: 2 Pulse count: 1 Triplet count: 2 Gaussian count: 2 Results differ by one triplet count. And CPU host was restarted twice, not CUDA (!) Restert can underestimate reported signals but it can't overestimete them. Fortunately, I have this task in storage so will do standalone testing for this WU. ADDON: Just keep in mind, my GPU highly underclocked. So hardware problems are very unlikely. If even such GPU will give errors time to time, what about heavely OCed gamers GPUs... ID: 845482 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 845490 - Posted: 27 Dec 2008, 1:34:07 UTC My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. ID: 845490 ·

alpina Send message Joined: 18 Dec 08 Posts: 22 Credit: 32,011 RAC: 0	Message 845493 - Posted: 27 Dec 2008, 1:40:19 UTC - in response to Message 845490. My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. And still, you seem to have a very high failure rate. How hot does your GPU get? Just to exclude the possibility that overheating is causing this. ID: 845493 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 845551 - Posted: 27 Dec 2008, 5:09:41 UTC - in response to Message 845458. Two of the main problems are shown in this extract from a work unit of mine. ... Spike count: 23 Pulse count: 7 Triplet count: 0 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 0.0917045547844134 Granted credit 77.1124957440435[/i] The other two results for this task both had 1 spike, 1 pulse, 0 triples and 2 gaussian. So fair enough they should get credit, however results which are clearly invalid, even my own should clearly not be. I'd say maybe 3, since you haven't volunteered to be a Beta tester. But since you are doing Beta testing, I think it is wise for the project to run a script to grant the credit; there are many who will only continue testing if they get credits for it. Note that on December 17th, the project received a revised set of the CUDA source code from an NVIDIA engineer. Those sources were used to produce the version 6.06 being tested at SETI Beta, but testing was obviously incomplete on the 6.05 build. I believe that's why 6.05 was released here, and the project is running a credit granting script to make it pay. Only cases where a dubious result is chosen as canonical are of any scientific concern, and the project design requires persistence to consider any potential signal worth a second look. Joe ID: 845551 ·

Riil Volunteer tester Send message Joined: 9 Mar 04 Posts: 9 Credit: 327,611 RAC: 9	Message 845611 - Posted: 27 Dec 2008, 9:24:49 UTC I've got 8800GT. It's about 56 C when busy. It gets only short WUs to crunch properly. Bigger WUs are crunched with errors :/ Time to quit with CUDA ??? ID: 845611 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845647 - Posted: 27 Dec 2008, 12:33:33 UTC - in response to Message 845490. My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. "Stock" freq is just the freq setted by card manufacturer. No guaranties that very your chip can do permanent calculations on such frequency. In general, CUDA is some new mode for video cards, maybe they just not good enough to support this mode as it should be. Nobody gaming 24/7, right? And if after many hours of gaming someone discovers few invalid dots on the screen he will think that it's "pink elephants" from fatigue, not GPU failures ;) :))))) ID: 845647 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845648 - Posted: 27 Dec 2008, 12:37:21 UTC - in response to Message 845551. Note that on December 17th, the project received a revised set of the CUDA source code from an NVIDIA engineer. Those sources were used to produce the version 6.06 being tested at SETI Beta, but testing was obviously incomplete on the 6.05 build. I believe that's why 6.05 was released here, and the project is running a credit granting script to make it pay. Only cases where a dubious result is chosen as canonical are of any scientific concern, and the project design requires persistence to consider any potential signal worth a second look. Joe Joe, rev380 dated 17 December. My build based on this revision... And it manifests all these bugs too. So 6.06 doesn't fix this VLAR/overflow issues. ID: 845648 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 845657 - Posted: 27 Dec 2008, 13:13:48 UTC - in response to Message 845647. My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. "Stock" freq is just the freq setted by card manufacturer. No guaranties that very your chip can do permanent calculations on such frequency. In case the GPU overheats to over a maximum set by the combination of the drivers and the VBIOS, it'll clock down automatically on clock speed and voltage. As long as you don't have another program constantly running that'll keep the clock speed and voltage up, that is. Just as your CPU needs adequate cooling, your GPU needs it as well. Especially when you use passive cooling (a heat sink, no fan). When there is a fan on your GPU, it needs to be able to get rid of the heated air and suck in cooler air. So any obstructions around the card are bad. Obstructions are: other cards, cables, RAM, the CPU, the case. As for gaming 24/7 as a comparison, even if you were throwing games at it 24/7, the GPU would not be under constant load. I have tested playing Need for Speed Most Wanted, Oblivion, Fallout3, Far Cry 2, Crysis, Crysis: Warhead and Red Alert 3 on my Sapphire HD3850 512MB, while I had GPU-Z on in the background -- it logging to a file on the hard drive. Checking the file I see that the GPU load never comes above 60%, while it's not continuously either. It happens in bursts, with enough pauses between to see the temperature go down. Maximum temperature was something in the region of 88C, on a 750MB map in Crysis. If you want to compare Seti CUDA to something, then compare it to a heavy 3D gaming benchmark. ID: 845657 ·

Matthias Lehmkuhl Volunteer tester Send message Joined: 5 Oct 99 Posts: 28 Credit: 10,832,348 RAC: 53	Message 845659 - Posted: 27 Dec 2008, 13:37:06 UTC I got also different results on one MB WU wuid=384773618 MB CUDA result (wingman) SETI@Home Informational message -9 result_overflow Flopcounter: 331517032.000000 Spike count: 30 Pulse count: 0 Triplet count: 0 Gaussian count: 0 called boinc_finish MB R-2.4V\|xB\|FFT:IPP_SSE2\|Ben-Joe (my) Spikes Pulses Triplets Gaussians Flops 2 3 0 0 19390523747313 third result is send out, but not finished/reported yet. To no CUDA computer. Matthias ID: 845659 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 845663 - Posted: 27 Dec 2008, 13:48:11 UTC I ran the CUDA app through BootCamp. I turned all fan up to 1500RPM in order to thinks cool as it was the first time the card would do anything of note. Everything else was running normally, there was no lag with anything else. I didn't even alter the performance settings on the card, they remained on a mid point between quality and performance. Has anyone else tried to run CUDA through BootCamp? Or am I the first idiot to do so? ID: 845663 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845671 - Posted: 27 Dec 2008, 14:33:27 UTC - in response to Message 845663. Last modified: 27 Dec 2008, 14:34:46 UTC Has anyone else tried to run CUDA through BootCamp? Or am I the first idiot to do so? :) unknown app for me. I underclock and monitor GPU through Asus SmartDoctor utility supplied with videocard. Now I discovered that RivaTuner can underclock even further not to 450MHz but even to 300MHz of engine frequency. I try to slowdown GPU as possible to rule out any slight possibility of hardware failures. This card has nice big cooler, no passive cooling. ID: 845671 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 845751 - Posted: 27 Dec 2008, 20:08:52 UTC - in response to Message 845409. Last modified: 27 Dec 2008, 20:13:38 UTC ... errors and -9 result_overflow's with the CUDA.. ... ... "Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr. VLAR are <= 0.05 VHAR are >= 2.5 ... Maybe the SETI@home-CUDA-app is more buggy.. Two -9 result_overflow-error with AR 0.415774: resultid=1091939868 resultid=1091939855 ID: 845751 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.