Outnumbered by cuda errors?

Message boards : Number crunching : Outnumbered by cuda errors?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845356 - Posted: 26 Dec 2008, 17:45:08 UTC - in response to Message 845352.  

How can I find if a work unit is VHAR or not?

If it is VLAR I look for <rsc_fpops_est>80360000000000.000000</rsc_fpops_est> in client_state.xml and cancel those immediately.

Is there a way to know if a work unit is VHAR beforehand?

Thanks

AR is bigger than ~2,5. But don't abort all VHARs. Some of them doesn't give overflow. We still need to figure out tru VHAR - overflow relations for CUDA.
So just look on them closely.
BTW, here is some nice script for fast VLAR/VHAR tasks finding in BOINC cache
ID: 845356 · Report as offensive
maceda
Volunteer tester

Send message
Joined: 27 Sep 99
Posts: 3
Credit: 25,114,284
RAC: 0
Mexico
Message 845397 - Posted: 26 Dec 2008, 20:49:33 UTC - in response to Message 845356.  


AR is bigger than ~2,5. But don't abort all VHARs. Some of them doesn't give overflow. We still need to figure out tru VHAR - overflow relations for CUDA.
So just look on them closely.
BTW, here is some nice script for fast VLAR/VHAR tasks finding in BOINC cache


OK. I´ll leave VLAR for now, but I´m killing all VHAR work units I receive. By the way, someone at Seti might have noticed this since today I have only received 3 VHAR work units vs. dozens for yesterday and the day before. It should be fairly trivial for them not to send VHAR work units to cuda clients.

Thanks.

ID: 845397 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 845409 - Posted: 26 Dec 2008, 21:07:43 UTC - in response to Message 845322.  
Last modified: 26 Dec 2008, 21:14:52 UTC

A teammate get errors and -9 result_overflow's with the CUDA..

hostid=4710849

Only known bugs?

"Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr.


VLAR are <= 0.05 ?
VHAR are >= 2.5 ?


BTW.
Why we are now testing BUG-app here in MAIN?
It's not possible to test again in BETA?
If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it..
ID: 845409 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845416 - Posted: 26 Dec 2008, 21:24:11 UTC - in response to Message 845409.  
Last modified: 26 Dec 2008, 21:25:28 UTC

VLAR are <= 0.05 ?
VHAR are >= 2.5 ?

Approx.


Why we are now testing BUG-app here in MAIN?

Because beta corrupted by adaptive replication mode....
And because we ALREADY have CUDA release here, on main. With all its bugs onboard.


It's not possible to test again in BETA?

Possible but testing hindered (again, adaptive replication mode).
Actually it's possible to use my mod both on main and beta, I will do this for example (w/o AP beta testing though).


If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it..

Yes! And it's the great evil :) But as I already siad we already have CUDA MB here with that bug inside it. So the sooner we eleminate it or at least will know what tasks we should avoid while doing task with CUDA MB the sooner this dreadful possibility will be diminished.
ID: 845416 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 845445 - Posted: 26 Dec 2008, 22:36:41 UTC - in response to Message 845416.  
Last modified: 26 Dec 2008, 22:45:19 UTC


If two -9_result_overflow-error will compared.. hey - and the WOW-signal was in this WU.. nobody will know it..

Yes! And it's the great evil :) But as I already siad we already have CUDA MB here with that bug inside it. So the sooner we eleminate it or at least will know what tasks we should avoid while doing task with CUDA MB the sooner this dreadful possibility will be diminished.


But to eliminate this possible worst case.. it would be better to 'call back' the SETI@home-CUDA-app here in MAIN until she's BUG-free.


BTW.
Your app is less buggy as the officially app?
ID: 845445 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845450 - Posted: 26 Dec 2008, 23:00:48 UTC - in response to Message 845445.  

It's just equally buggy with stock app :)
But it has logging ability now and allows full using of CPU+GPU combo.
ID: 845450 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 845458 - Posted: 26 Dec 2008, 23:28:19 UTC

Two of the main problems are shown in this extract from a work unit of mine.

CPU time 15.39063
stderr out
<core_client_version>6.5.0</core_client_version>
<![CDATA[
<stderr_txt>
cudaAcc_initializeDevice: Found 1 CUDA device(s):
Device 1 : GeForce 8800 GT
cudaAcc_initializeDevice is determiming what CUDA device to use...
user specified SETI to use CUDA device 1: GeForce 8800 GT
SETI@home using CUDA accelerated device GeForce 8800 GT
setiathome_enhanced 6.02 Visual Studio/Microsoft C++
libboinc: 6.3.22

Work Unit Info:
...............
WU true angle range is : 0.299785
Optimal function choices:
-----------------------------------------------------
name
-----------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00019 0.00000
v_ChirpData 0.01489 0.00000
v_Transpose4 0.00445 0.00000
FPU opt folding 0.00289 0.00000
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.

Flopcounter: 27859105038.632969

Spike count: 23
Pulse count: 7
Triplet count: 0
Gaussian count: 0
called boinc_finish

</stderr_txt>
]]>
Validate state Valid
Claimed credit 0.0917045547844134
Granted credit 77.1124957440435


The other two results for this task both had 1 spike, 1 pulse, 0 triples and 2 gaussian. So fair enough they should get credit, however results which are clearly invalid, even my own should clearly not be.


ID: 845458 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845475 - Posted: 27 Dec 2008, 0:39:16 UTC - in response to Message 845458.  
Last modified: 27 Dec 2008, 0:50:38 UTC

Just illustration to probability of 2 CUDA results validating against each other:
http://setiathome.berkeley.edu/workunit.php?wuid=385313683
My GPU found 2 signals, 8800 gave overflowed result. Interesting, third host will be CUDA too?...
Something wrong with that 3% estimation IMHO...
And another 2-CUDA
And another http://setiathome.berkeley.edu/workunit.php?wuid=385313673
One more http://setiathome.berkeley.edu/workunit.php?wuid=385313677
And more http://setiathome.berkeley.edu/workunit.php?wuid=385313649

All these WUs are 2-CUDA results comparison, and all failed because 8800GT returned overflow while my GPU returned some signals but non-overflow.

1) We can't count on 3% total CUDA share. It's non independent probability! Just recall - BOINC pairs similar to similar. So CUDA almost SHOULD be paired with another CUDA ! It's VERY PROBABLE that CUDA result will validate agains another CUDA result. So chances of database pollution MUCH HIGHER than10e-3!

2) One GPU returned overflow while another returned non-overflowed result. What it means ? At least some hardware dependance for this error! Maybe that 8800GT overheated? Maybe most of these overflows from hardware instability still? ....
ID: 845475 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845482 - Posted: 27 Dec 2008, 0:57:59 UTC - in response to Message 845475.  
Last modified: 27 Dec 2008, 0:59:30 UTC

And look on this.
CUDA result:
Spike count: 2
Pulse count: 1
Triplet count: 1
Gaussian count: 2

CPU result:
Spike count: 2
Pulse count: 1
Triplet count: 2
Gaussian count: 2

Results differ by one triplet count.
And CPU host was restarted twice, not CUDA (!) Restert can underestimate reported signals but it can't overestimete them.

Fortunately, I have this task in storage so will do standalone testing for this WU.

ADDON: Just keep in mind, my GPU highly underclocked. So hardware problems are very unlikely. If even such GPU will give errors time to time, what about heavely OCed gamers GPUs...
ID: 845482 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 845490 - Posted: 27 Dec 2008, 1:34:07 UTC

My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down.
ID: 845490 · Report as offensive
alpina

Send message
Joined: 18 Dec 08
Posts: 22
Credit: 32,011
RAC: 0
Belgium
Message 845493 - Posted: 27 Dec 2008, 1:40:19 UTC - in response to Message 845490.  

My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down.

And still, you seem to have a very high failure rate. How hot does your GPU get? Just to exclude the possibility that overheating is causing this.
ID: 845493 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 845551 - Posted: 27 Dec 2008, 5:09:41 UTC - in response to Message 845458.  

Two of the main problems are shown in this extract from a work unit of mine.
...
Spike count: 23
Pulse count: 7
Triplet count: 0
Gaussian count: 0
called boinc_finish

</stderr_txt>
]]>
Validate state Valid
Claimed credit 0.0917045547844134
Granted credit 77.1124957440435[/i]

The other two results for this task both had 1 spike, 1 pulse, 0 triples and 2 gaussian. So fair enough they should get credit, however results which are clearly invalid, even my own should clearly not be.

I'd say maybe 3, since you haven't volunteered to be a Beta tester. But since you are doing Beta testing, I think it is wise for the project to run a script to grant the credit; there are many who will only continue testing if they get credits for it.

Note that on December 17th, the project received a revised set of the CUDA source code from an NVIDIA engineer. Those sources were used to produce the version 6.06 being tested at SETI Beta, but testing was obviously incomplete on the 6.05 build. I believe that's why 6.05 was released here, and the project is running a credit granting script to make it pay. Only cases where a dubious result is chosen as canonical are of any scientific concern, and the project design requires persistence to consider any potential signal worth a second look.
                                                             Joe
ID: 845551 · Report as offensive
Riil
Volunteer tester

Send message
Joined: 9 Mar 04
Posts: 9
Credit: 327,611
RAC: 9
Poland
Message 845611 - Posted: 27 Dec 2008, 9:24:49 UTC

I've got 8800GT. It's about 56 C when busy. It gets only short WUs to crunch properly. Bigger WUs are crunched with errors :/ Time to quit with CUDA ???
ID: 845611 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845647 - Posted: 27 Dec 2008, 12:33:33 UTC - in response to Message 845490.  

My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down.


"Stock" freq is just the freq setted by card manufacturer. No guaranties that very your chip can do permanent calculations on such frequency.
In general, CUDA is some new mode for video cards, maybe they just not good enough to support this mode as it should be.
Nobody gaming 24/7, right? And if after many hours of gaming someone discovers few invalid dots on the screen he will think that it's "pink elephants" from fatigue, not GPU failures ;) :)))))
ID: 845647 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845648 - Posted: 27 Dec 2008, 12:37:21 UTC - in response to Message 845551.  


Note that on December 17th, the project received a revised set of the CUDA source code from an NVIDIA engineer. Those sources were used to produce the version 6.06 being tested at SETI Beta, but testing was obviously incomplete on the 6.05 build. I believe that's why 6.05 was released here, and the project is running a credit granting script to make it pay. Only cases where a dubious result is chosen as canonical are of any scientific concern, and the project design requires persistence to consider any potential signal worth a second look.
                                                             Joe


Joe, rev380 dated 17 December. My build based on this revision... And it manifests all these bugs too. So 6.06 doesn't fix this VLAR/overflow issues.
ID: 845648 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 845657 - Posted: 27 Dec 2008, 13:13:48 UTC - in response to Message 845647.  

My 8800GT is at stock speed, so no OC. I also have the fans a little higher RPM when crunching to keep temps down.


"Stock" freq is just the freq setted by card manufacturer. No guaranties that very your chip can do permanent calculations on such frequency.

In case the GPU overheats to over a maximum set by the combination of the drivers and the VBIOS, it'll clock down automatically on clock speed and voltage. As long as you don't have another program constantly running that'll keep the clock speed and voltage up, that is.

Just as your CPU needs adequate cooling, your GPU needs it as well. Especially when you use passive cooling (a heat sink, no fan). When there is a fan on your GPU, it needs to be able to get rid of the heated air and suck in cooler air. So any obstructions around the card are bad. Obstructions are: other cards, cables, RAM, the CPU, the case.

As for gaming 24/7 as a comparison, even if you were throwing games at it 24/7, the GPU would not be under constant load. I have tested playing Need for Speed Most Wanted, Oblivion, Fallout3, Far Cry 2, Crysis, Crysis: Warhead and Red Alert 3 on my Sapphire HD3850 512MB, while I had GPU-Z on in the background -- it logging to a file on the hard drive. Checking the file I see that the GPU load never comes above 60%, while it's not continuously either. It happens in bursts, with enough pauses between to see the temperature go down. Maximum temperature was something in the region of 88C, on a 750MB map in Crysis.

If you want to compare Seti CUDA to something, then compare it to a heavy 3D gaming benchmark.
ID: 845657 · Report as offensive
Matthias Lehmkuhl Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 5 Oct 99
Posts: 28
Credit: 10,832,348
RAC: 53
Germany
Message 845659 - Posted: 27 Dec 2008, 13:37:06 UTC

I got also different results on one MB WU
wuid=384773618

MB CUDA result (wingman)
SETI@Home Informational message -9 result_overflow
Flopcounter: 331517032.000000

Spike count: 30
Pulse count: 0
Triplet count: 0
Gaussian count: 0
called boinc_finish


MB R-2.4V|xB|FFT:IPP_SSE2|Ben-Joe (my)

Spikes Pulses Triplets Gaussians Flops
2 3 0 0 19390523747313

third result is send out, but not finished/reported yet. To no CUDA computer.

Matthias

ID: 845659 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 845663 - Posted: 27 Dec 2008, 13:48:11 UTC

I ran the CUDA app through BootCamp. I turned all fan up to 1500RPM in order to thinks cool as it was the first time the card would do anything of note. Everything else was running normally, there was no lag with anything else. I didn't even alter the performance settings on the card, they remained on a mid point between quality and performance.

Has anyone else tried to run CUDA through BootCamp? Or am I the first idiot to do so?



ID: 845663 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845671 - Posted: 27 Dec 2008, 14:33:27 UTC - in response to Message 845663.  
Last modified: 27 Dec 2008, 14:34:46 UTC


Has anyone else tried to run CUDA through BootCamp? Or am I the first idiot to do so?


:) unknown app for me. I underclock and monitor GPU through Asus SmartDoctor utility supplied with videocard. Now I discovered that RivaTuner can underclock even further not to 450MHz but even to 300MHz of engine frequency. I try to slowdown GPU as possible to rule out any slight possibility of hardware failures. This card has nice big cooler, no passive cooling.
ID: 845671 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 845751 - Posted: 27 Dec 2008, 20:08:52 UTC - in response to Message 845409.  
Last modified: 27 Dec 2008, 20:13:38 UTC

... errors and -9 result_overflow's with the CUDA..

...

...

"Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr.


VLAR are <= 0.05
VHAR are >= 2.5

...


Maybe the SETI@home-CUDA-app is more buggy..

Two -9 result_overflow-error with AR 0.415774:

resultid=1091939868
resultid=1091939855
ID: 845751 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Outnumbered by cuda errors?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.