Better error than wrong overflow for CUDA MB


log in

Advanced search

Message boards : Number crunching : Better error than wrong overflow for CUDA MB

Author Message
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3397
Credit: 46,360,978
RAC: 9,939
Russia
Message 849602 - Posted: 5 Jan 2009, 9:20:49 UTC
Last modified: 5 Jan 2009, 9:37:15 UTC

There are many examples already when two CUDA MB invalid overflows outnumber valid CPU app result and recive validation whereas non-overflowed CPU result is declared invalid.

IMHO, to not to confuse validator with these invalid overflows from CUDA MB it's better to return "computational error" on overflowed result than invalid overflowed result itself. That way task will be reassigned to another host and there is a chance it will be processed correctly. This will keep database from pollution by invalid overflows.

How to achieve that (these recommendations are for CUDA MB only, overflow from CPU app is not an error):
1) You need run BOINC from task cache with network connection disabled (to have control on uploaded results).
2) Once in few days (depends of value of your cache settings) you need to look into project folder and delete all results that have file size near 35kb (or greater). These are overflows.
3) Then you can enable network connection for uploading, reporting and refilling cache.
4) Those overflows can't be uploaded (result files are deleted), so BOINC will retry upload again and again. When all other uploads/downloads will be finished you can select all these invalid result transfers and abort them.
5) server will mark corresponding results as errored, and will not try to validate wrong results.
6) There is additional advantage of such method versus aborting tasks before any computations:
stderr will be reported (but not result itself). So it easy to establish the reason of task abortion - stderr will say "overflow".

Example:

stderr out <core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
Device 1 : GeForce 9600 GSO
totalGlobalMem = 402653184
sharedMemPerBlock = 16384
regsPerBlock = 8192
warpSize = 32
memPitch = 262144
maxThreadsPerBlock = 512
clockRate = 1700000
totalConstMem = 65536
major = 1
minor = 1
textureAlignment = 256
deviceOverlap = 0
multiProcessorCount = 12
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce 9600 GSO is okay
SETI@home using CUDA accelerated device GeForce 9600 GSO
Rise priority modification by Raistmer based on rev380 of SETI@home sources
Priority of worker thread rised successfully
Total GPU memory 402653184 free GPU memory 355926016
setiathome_enhanced 6.02 Visual Studio/Microsoft C++
libboinc: 6.3.22

Work Unit Info:
...............
WU true angle range is : 4.307220
Optimal function choices:
-----------------------------------------------------
name
-----------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00020 0.00000
v_ChirpData 0.01351 0.00000
v_Transpose4 0.00342 0.00000
FPU opt folding 0.00285 0.00000
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.

Flopcounter: 1777564580.377074

Spike count: 0
Pulse count: 0
Triplet count: 31
Gaussian count: 0
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>03no08aa.12883.266870.6.11.168_2_0</file_name>
<error_code>-197</error_code>
<error_message>user requested transfer abort</error_message>
</file_xfer_error>

</message>
<upload_error>
<file_name>03no08aa.12883.266870.6.11.168_2_0</file_name>
<error_code>-197</error_code>
</upload_error>
]]>

Validate state Invalid

Profile Byron S Goodgame
Volunteer tester
Avatar
Send message
Joined: 16 Jan 06
Posts: 1151
Credit: 3,936,993
RAC: 0
United States
Message 850758 - Posted: 8 Jan 2009, 6:32:09 UTC

Just bringing this thread back up closer to the top.
____________

Profile Byron S Goodgame
Volunteer tester
Avatar
Send message
Joined: 16 Jan 06
Posts: 1151
Credit: 3,936,993
RAC: 0
United States
Message 851462 - Posted: 9 Jan 2009, 23:41:31 UTC

Guess I've done something wrong, even though I think I've followed the instrctions. I have this task which produced an overflow. I went into the seti folder and deleted the file 16no08ai.21753.20931.9.8.156_0_0 which was 40KB. I still see 16no08ai.21753.20931.9.8.156 in the seti folder with a size of 367 KB. The file is no longer in the transfer window, and in the Tasks window shows 16no08ai.21753.20931.9.8.156_0 with a status of Uploading. I've tried to abort the task now and get no change.

Sorry this is the first one I've come to that I've had to do this with, thought it seemed simple enough. I still have the file I deleted sitting in the recycle bin and the other one still in the Seti folder, so what do I do now and what did I do wrong?
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3397
Credit: 46,360,978
RAC: 9,939
Russia
Message 851641 - Posted: 10 Jan 2009, 11:31:26 UTC - in response to Message 851462.
Last modified: 10 Jan 2009, 11:33:53 UTC

Guess I've done something wrong, even though I think I've followed the instrctions. I have this task which produced an overflow. I went into the seti folder and deleted the file 16no08ai.21753.20931.9.8.156_0_0 which was 40KB. I still see 16no08ai.21753.20931.9.8.156 in the seti folder with a size of 367 KB. The file is no longer in the transfer window, and in the Tasks window shows 16no08ai.21753.20931.9.8.156_0 with a status of Uploading. I've tried to abort the task now and get no change.

Sorry this is the first one I've come to that I've had to do this with, thought it seemed simple enough. I still have the file I deleted sitting in the recycle bin and the other one still in the Seti folder, so what do I do now and what did I do wrong?


Yes, you can't abort job that marked "uploading", it'OK.
BUT when you will connect to internet, this task result can't be uploaded (no file). BOINC will complain with smth like upload handle missing and set this result to retry of upload. And then you can abort (not job itself, but transfer) this result transfer in transfer tab.

(BTW, give try to my new build - it should has less amount of overflows although VLAR bugs and other issues still remains)

Profile Byron S Goodgame
Volunteer tester
Avatar
Send message
Joined: 16 Jan 06
Posts: 1151
Credit: 3,936,993
RAC: 0
United States
Message 851642 - Posted: 10 Jan 2009, 11:43:26 UTC - in response to Message 851641.
Last modified: 10 Jan 2009, 11:52:27 UTC

Thanks Raistmer, I appreciate the info. I think the problem was after I deleted the task I rebooted some time later because my system had been running for a few days and I wanted to give it a better chance at completing more tasks without overflows. When I brought Boinc back up, the task was no longer in the transfer tab for me to abort. It didn't occur to me that would happen.

Already have the new app on and running. Put it in the middle of a task I was running and it did fine. So far so good, but I haven't done any of the tasks that usually give me overflows yet, so I don't have any info on that for ya yet. Hopefully there will be no news to report :)

Edit: Also BTW I've uploaded the entire app in a zip file onto the web and left a link for it in one of the threads here, for some folks that were having trouble getting email back from Lunatics, but still needed the app. I also have a link to your message telling about it so they can also visit Lunatics as well. If it's a problem that I've posted let me know, and I'll take it down.
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3397
Credit: 46,360,978
RAC: 9,939
Russia
Message 851644 - Posted: 10 Jan 2009, 11:49:54 UTC - in response to Message 851642.

So far so good, but I haven't done any of the tasks that usually give me overflows yet, so I don't have any info on that for ya yet. Hopefully there will be no news to report :)

Actually will be, no miracles still ;)

These WUs from my test cases collection are completed OK now by CUDA MB:

03no08aa.11005.10706.3.11.75
03no08aa.5874.273823.14.11.250
03no08aa.5874.274232.14.11.89

and these still have some bugs (including driver crash on VLAR):

03dc08ad.15767.890.15.8.213
03dc08ae.15056.890.16.8.52
03no08aa.5874.274232.14.11.84
15no08ac.10856.20256.16.8.135
23ap08aa.4504.481.10.11.187
23ap08aa.4504.890.10.11.63

(look Lunatics thread for this test cases download)

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3397
Credit: 46,360,978
RAC: 9,939
Russia
Message 851645 - Posted: 10 Jan 2009, 11:52:51 UTC - in response to Message 851642.
Last modified: 10 Jan 2009, 11:53:32 UTC


Edit: Also BTW I've uploaded the entire app in a zip file onto the web and left a link for it in one of the threads here, for some folks that were having trouble getting email back from Lunatics, but still needed the app. I also have a link to your message telling about it so they can also visit Lunatics as well. If it's a problem that I've posted let me know, and I'll take it down.


No prob with me at least :) Final goal is to produce more valid results for SETI - your link serves that aim very well, thanks :)

Message boards : Number crunching : Better error than wrong overflow for CUDA MB

Copyright © 2014 University of California