Message boards :
Number crunching :
Outnumbered by cuda errors?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
How can I find if a work unit is VHAR or not? AR is bigger than ~2,5. But don't abort all VHARs. Some of them doesn't give overflow. We still need to figure out tru VHAR - overflow relations for CUDA. So just look on them closely. BTW, here is some nice script for fast VLAR/VHAR tasks finding in BOINC cache |
maceda Send message Joined: 27 Sep 99 Posts: 3 Credit: 25,114,284 RAC: 0 |
OK. I´ll leave VLAR for now, but I´m killing all VHAR work units I receive. By the way, someone at Seti might have noticed this since today I have only received 3 VHAR work units vs. dozens for yesterday and the day before. It should be fairly trivial for them not to send VHAR work units to cuda clients. Thanks. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
A teammate get errors and -9 result_overflow's with the CUDA.. VLAR are <= 0.05 ? VHAR are >= 2.5 ? BTW. Why we are now testing BUG-app here in MAIN? It's not possible to test again in BETA? If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it.. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
VLAR are <= 0.05 ? Approx.
Because beta corrupted by adaptive replication mode.... And because we ALREADY have CUDA release here, on main. With all its bugs onboard.
Possible but testing hindered (again, adaptive replication mode). Actually it's possible to use my mod both on main and beta, I will do this for example (w/o AP beta testing though).
Yes! And it's the great evil :) But as I already siad we already have CUDA MB here with that bug inside it. So the sooner we eleminate it or at least will know what tasks we should avoid while doing task with CUDA MB the sooner this dreadful possibility will be diminished. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
But to eliminate this possible worst case.. it would be better to 'call back' the SETI@home-CUDA-app here in MAIN until she's BUG-free. BTW. Your app is less buggy as the officially app? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It's just equally buggy with stock app :) But it has logging ability now and allows full using of CPU+GPU combo. |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
Two of the main problems are shown in this extract from a work unit of mine. CPU time 15.39063 stderr out <core_client_version>6.5.0</core_client_version> <![CDATA[ <stderr_txt> cudaAcc_initializeDevice: Found 1 CUDA device(s): Device 1 : GeForce 8800 GT cudaAcc_initializeDevice is determiming what CUDA device to use... user specified SETI to use CUDA device 1: GeForce 8800 GT SETI@home using CUDA accelerated device GeForce 8800 GT setiathome_enhanced 6.02 Visual Studio/Microsoft C++ libboinc: 6.3.22 Work Unit Info: ............... WU true angle range is : 0.299785 Optimal function choices: ----------------------------------------------------- name ----------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00019 0.00000 v_ChirpData 0.01489 0.00000 v_Transpose4 0.00445 0.00000 FPU opt folding 0.00289 0.00000 SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. Flopcounter: 27859105038.632969 Spike count: 23 Pulse count: 7 Triplet count: 0 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 0.0917045547844134 Granted credit 77.1124957440435 The other two results for this task both had 1 spike, 1 pulse, 0 triples and 2 gaussian. So fair enough they should get credit, however results which are clearly invalid, even my own should clearly not be. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Just illustration to probability of 2 CUDA results validating against each other: http://setiathome.berkeley.edu/workunit.php?wuid=385313683 My GPU found 2 signals, 8800 gave overflowed result. Interesting, third host will be CUDA too?... Something wrong with that 3% estimation IMHO... And another 2-CUDA And another http://setiathome.berkeley.edu/workunit.php?wuid=385313673 One more http://setiathome.berkeley.edu/workunit.php?wuid=385313677 And more http://setiathome.berkeley.edu/workunit.php?wuid=385313649 All these WUs are 2-CUDA results comparison, and all failed because 8800GT returned overflow while my GPU returned some signals but non-overflow. 1) We can't count on 3% total CUDA share. It's non independent probability! Just recall - BOINC pairs similar to similar. So CUDA almost SHOULD be paired with another CUDA ! It's VERY PROBABLE that CUDA result will validate agains another CUDA result. So chances of database pollution MUCH HIGHER than10e-3! 2) One GPU returned overflow while another returned non-overflowed result. What it means ? At least some hardware dependance for this error! Maybe that 8800GT overheated? Maybe most of these overflows from hardware instability still? .... |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
And look on this. CUDA result: Spike count: 2 Pulse count: 1 Triplet count: 1 Gaussian count: 2 CPU result: Spike count: 2 Pulse count: 1 Triplet count: 2 Gaussian count: 2 Results differ by one triplet count. And CPU host was restarted twice, not CUDA (!) Restert can underestimate reported signals but it can't overestimete them. Fortunately, I have this task in storage so will do standalone testing for this WU. ADDON: Just keep in mind, my GPU highly underclocked. So hardware problems are very unlikely. If even such GPU will give errors time to time, what about heavely OCed gamers GPUs... |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. |
alpina Send message Joined: 18 Dec 08 Posts: 22 Credit: 32,011 RAC: 0 |
My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. And still, you seem to have a very high failure rate. How hot does your GPU get? Just to exclude the possibility that overheating is causing this. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Two of the main problems are shown in this extract from a work unit of mine. I'd say maybe 3, since you haven't volunteered to be a Beta tester. But since you are doing Beta testing, I think it is wise for the project to run a script to grant the credit; there are many who will only continue testing if they get credits for it. Note that on December 17th, the project received a revised set of the CUDA source code from an NVIDIA engineer. Those sources were used to produce the version 6.06 being tested at SETI Beta, but testing was obviously incomplete on the 6.05 build. I believe that's why 6.05 was released here, and the project is running a credit granting script to make it pay. Only cases where a dubious result is chosen as canonical are of any scientific concern, and the project design requires persistence to consider any potential signal worth a second look. Joe |
Riil Send message Joined: 9 Mar 04 Posts: 9 Credit: 327,611 RAC: 9 |
I've got 8800GT. It's about 56 C when busy. It gets only short WUs to crunch properly. Bigger WUs are crunched with errors :/ Time to quit with CUDA ??? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. "Stock" freq is just the freq setted by card manufacturer. No guaranties that very your chip can do permanent calculations on such frequency. In general, CUDA is some new mode for video cards, maybe they just not good enough to support this mode as it should be. Nobody gaming 24/7, right? And if after many hours of gaming someone discovers few invalid dots on the screen he will think that it's "pink elephants" from fatigue, not GPU failures ;) :))))) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Joe, rev380 dated 17 December. My build based on this revision... And it manifests all these bugs too. So 6.06 doesn't fix this VLAR/overflow issues. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down. In case the GPU overheats to over a maximum set by the combination of the drivers and the VBIOS, it'll clock down automatically on clock speed and voltage. As long as you don't have another program constantly running that'll keep the clock speed and voltage up, that is. Just as your CPU needs adequate cooling, your GPU needs it as well. Especially when you use passive cooling (a heat sink, no fan). When there is a fan on your GPU, it needs to be able to get rid of the heated air and suck in cooler air. So any obstructions around the card are bad. Obstructions are: other cards, cables, RAM, the CPU, the case. As for gaming 24/7 as a comparison, even if you were throwing games at it 24/7, the GPU would not be under constant load. I have tested playing Need for Speed Most Wanted, Oblivion, Fallout3, Far Cry 2, Crysis, Crysis: Warhead and Red Alert 3 on my Sapphire HD3850 512MB, while I had GPU-Z on in the background -- it logging to a file on the hard drive. Checking the file I see that the GPU load never comes above 60%, while it's not continuously either. It happens in bursts, with enough pauses between to see the temperature go down. Maximum temperature was something in the region of 88C, on a 750MB map in Crysis. If you want to compare Seti CUDA to something, then compare it to a heavy 3D gaming benchmark. |
Matthias Lehmkuhl Send message Joined: 5 Oct 99 Posts: 28 Credit: 10,832,348 RAC: 53 |
I got also different results on one MB WU wuid=384773618 MB CUDA result (wingman) SETI@Home Informational message -9 result_overflow Flopcounter: 331517032.000000 Spike count: 30 Pulse count: 0 Triplet count: 0 Gaussian count: 0 called boinc_finish MB R-2.4V|xB|FFT:IPP_SSE2|Ben-Joe (my) Spikes Pulses Triplets Gaussians Flops 2 3 0 0 19390523747313 third result is send out, but not finished/reported yet. To no CUDA computer. Matthias |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
I ran the CUDA app through BootCamp. I turned all fan up to 1500RPM in order to thinks cool as it was the first time the card would do anything of note. Everything else was running normally, there was no lag with anything else. I didn't even alter the performance settings on the card, they remained on a mid point between quality and performance. Has anyone else tried to run CUDA through BootCamp? Or am I the first idiot to do so? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
:) unknown app for me. I underclock and monitor GPU through Asus SmartDoctor utility supplied with videocard. Now I discovered that RivaTuner can underclock even further not to 450MHz but even to 300MHz of engine frequency. I try to slowdown GPU as possible to rule out any slight possibility of hardware failures. This card has nice big cooler, no passive cooling. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
... errors and -9 result_overflow's with the CUDA.. Maybe the SETI@home-CUDA-app is more buggy.. Two -9 result_overflow-error with AR 0.415774: resultid=1091939868 resultid=1091939855 |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.