Message boards :
Number crunching :
Better error than wrong overflow for CUDA MB
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
There are many examples already when two CUDA MB invalid overflows outnumber valid CPU app result and recive validation whereas non-overflowed CPU result is declared invalid. IMHO, to not to confuse validator with these invalid overflows from CUDA MB it's better to return "computational error" on overflowed result than invalid overflowed result itself. That way task will be reassigned to another host and there is a chance it will be processed correctly. This will keep database from pollution by invalid overflows. How to achieve that (these recommendations are for CUDA MB only, overflow from CPU app is not an error): 1) You need run BOINC from task cache with network connection disabled (to have control on uploaded results). 2) Once in few days (depends of value of your cache settings) you need to look into project folder and delete all results that have file size near 35kb (or greater). These are overflows. 3) Then you can enable network connection for uploading, reporting and refilling cache. 4) Those overflows can't be uploaded (result files are deleted), so BOINC will retry upload again and again. When all other uploads/downloads will be finished you can select all these invalid result transfers and abort them. 5) server will mark corresponding results as errored, and will not try to validate wrong results. 6) There is additional advantage of such method versus aborting tasks before any computations: stderr will be reported (but not result itself). So it easy to establish the reason of task abortion - stderr will say "overflow". Example: stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 1 CUDA device(s): Device 1 : GeForce 9600 GSO totalGlobalMem = 402653184 sharedMemPerBlock = 16384 regsPerBlock = 8192 warpSize = 32 memPitch = 262144 maxThreadsPerBlock = 512 clockRate = 1700000 totalConstMem = 65536 major = 1 minor = 1 textureAlignment = 256 deviceOverlap = 0 multiProcessorCount = 12 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce 9600 GSO is okay SETI@home using CUDA accelerated device GeForce 9600 GSO Rise priority modification by Raistmer based on rev380 of SETI@home sources Priority of worker thread rised successfully Total GPU memory 402653184 free GPU memory 355926016 setiathome_enhanced 6.02 Visual Studio/Microsoft C++ libboinc: 6.3.22 Work Unit Info: ............... WU true angle range is : 4.307220 Optimal function choices: ----------------------------------------------------- name ----------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00020 0.00000 v_ChirpData 0.01351 0.00000 v_Transpose4 0.00342 0.00000 FPU opt folding 0.00285 0.00000 SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. Flopcounter: 1777564580.377074 Spike count: 0 Pulse count: 0 Triplet count: 31 Gaussian count: 0 called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>03no08aa.12883.266870.6.11.168_2_0</file_name> <error_code>-197</error_code> <error_message>user requested transfer abort</error_message> </file_xfer_error> </message> <upload_error> <file_name>03no08aa.12883.266870.6.11.168_2_0</file_name> <error_code>-197</error_code> </upload_error> ]]> Validate state Invalid |
Byron S Goodgame Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0 |
|
Byron S Goodgame Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0 |
Guess I've done something wrong, even though I think I've followed the instrctions. I have this task which produced an overflow. I went into the seti folder and deleted the file 16no08ai.21753.20931.9.8.156_0_0 which was 40KB. I still see 16no08ai.21753.20931.9.8.156 in the seti folder with a size of 367 KB. The file is no longer in the transfer window, and in the Tasks window shows 16no08ai.21753.20931.9.8.156_0 with a status of Uploading. I've tried to abort the task now and get no change. Sorry this is the first one I've come to that I've had to do this with, thought it seemed simple enough. I still have the file I deleted sitting in the recycle bin and the other one still in the Seti folder, so what do I do now and what did I do wrong? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Guess I've done something wrong, even though I think I've followed the instrctions. I have this task which produced an overflow. I went into the seti folder and deleted the file 16no08ai.21753.20931.9.8.156_0_0 which was 40KB. I still see 16no08ai.21753.20931.9.8.156 in the seti folder with a size of 367 KB. The file is no longer in the transfer window, and in the Tasks window shows 16no08ai.21753.20931.9.8.156_0 with a status of Uploading. I've tried to abort the task now and get no change. Yes, you can't abort job that marked "uploading", it'OK. BUT when you will connect to internet, this task result can't be uploaded (no file). BOINC will complain with smth like upload handle missing and set this result to retry of upload. And then you can abort (not job itself, but transfer) this result transfer in transfer tab. (BTW, give try to my new build - it should has less amount of overflows although VLAR bugs and other issues still remains) |
Byron S Goodgame Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0 |
Thanks Raistmer, I appreciate the info. I think the problem was after I deleted the task I rebooted some time later because my system had been running for a few days and I wanted to give it a better chance at completing more tasks without overflows. When I brought Boinc back up, the task was no longer in the transfer tab for me to abort. It didn't occur to me that would happen. Already have the new app on and running. Put it in the middle of a task I was running and it did fine. So far so good, but I haven't done any of the tasks that usually give me overflows yet, so I don't have any info on that for ya yet. Hopefully there will be no news to report :) Edit: Also BTW I've uploaded the entire app in a zip file onto the web and left a link for it in one of the threads here, for some folks that were having trouble getting email back from Lunatics, but still needed the app. I also have a link to your message telling about it so they can also visit Lunatics as well. If it's a problem that I've posted let me know, and I'll take it down. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
So far so good, but I haven't done any of the tasks that usually give me overflows yet, so I don't have any info on that for ya yet. Hopefully there will be no news to report :) Actually will be, no miracles still ;) These WUs from my test cases collection are completed OK now by CUDA MB: 03no08aa.11005.10706.3.11.75 03no08aa.5874.273823.14.11.250 03no08aa.5874.274232.14.11.89 and these still have some bugs (including driver crash on VLAR): 03dc08ad.15767.890.15.8.213 03dc08ae.15056.890.16.8.52 03no08aa.5874.274232.14.11.84 15no08ac.10856.20256.16.8.135 23ap08aa.4504.481.10.11.187 23ap08aa.4504.890.10.11.63 (look Lunatics thread for this test cases download) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
No prob with me at least :) Final goal is to produce more valid results for SETI - your link serves that aim very well, thanks :) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.