Modified SETI MB CUDA + opt AP package for full GPU utilization

Author	Message
Byron S Goodgame Volunteer tester Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0	Message 848202 - Posted: 2 Jan 2009, 14:35:00 UTC - in response to Message 848184. Last modified: 2 Jan 2009, 15:07:19 UTC Yes, I figured it would be something along those lines, though I haven't had any problems with any tasks the times I've run it since having it available. I do however know I wasn't using the batch file in this case, because it was this task that made me decide to work on the batch file and get one going. Up till then I didn't have enough of them to motivate me to do it, but I didn't like the feel of that WU :) and I thought at the time it was a VLAR affecting me more than they had in the past. I've also had several tasks that had a sluggish/freezing effect, but that usually only lasts' about 10-15 seconds and then the tasks continue without error/ ID: 848202 ·

Odan Send message Joined: 8 May 03 Posts: 91 Credit: 15,331,177 RAC: 0	Message 848621 - Posted: 3 Jan 2009, 10:40:53 UTC - in response to Message 847261. 12/31/08 10:40:50\|SETI@home\|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks 12/31/08 10:40:56\|SETI@home\|Scheduler request completed: got 0 new tasks Again, thanks for any suggestions or "silly boy it's this" answers. Your host recives just what it requests - that is - nothing :) You got too many AP probably and now BOINC thinks you not need any more work. Try to increase cache size or just wait until few APs will be crunched. (And yes, I know AP are doing on CPU and your GPU free and idle - but current BOINC version too silly to understand this fact ;) ) Thanks, Raistmer. That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;) I hadn't actually realised that BOINC was being so simplistic and not treating the 2 apps as separate. I had already tried increasing my cache a bit without success but I whacked it right up & the pipe opened. I now have both & I'm controlling the AP manually by stopping requests in my account preferences once I have a decent stock. Not ideal but it works. Thanks again & also thanks for calling BOINC silly & not me :) ID: 848621 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848652 - Posted: 3 Jan 2009, 11:56:35 UTC - in response to Message 848621. That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;) Sorry, it was just joke :) Actually it's pretty new situation for BOINC too - to have heterogeneous computing, no doubt this scheduling flaws will be removed in next versions :) ID: 848652 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 848693 - Posted: 3 Jan 2009, 13:44:49 UTC - in response to Message 848652. That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;) Sorry, it was just joke :) Actually it's pretty new situation for BOINC too - to have heterogeneous computing, no doubt this scheduling flaws will be removed in next versions :) LOL....you have a lotta faith there, my friend........ "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 848693 ·

Odan Send message Joined: 8 May 03 Posts: 91 Credit: 15,331,177 RAC: 0	Message 848887 - Posted: 3 Jan 2009, 21:03:45 UTC - in response to Message 848652. That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;) Sorry, it was just joke :) Actually it's pretty new situation for BOINC too - to have heterogeneous computing, no doubt this scheduling flaws will be removed in next versions :) LOL, I got the joke, it's OK, no need to apologise! Things flowing well now. The additions to stop disaster following memory leaks seems to work quite well BTW. Do you still want samples of "bad" units? ID: 848887 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848890 - Posted: 3 Jan 2009, 21:07:53 UTC - in response to Message 848887. That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;) Sorry, it was just joke :) Actually it's pretty new situation for BOINC too - to have heterogeneous computing, no doubt this scheduling flaws will be removed in next versions :) LOL, I got the joke, it's OK, no need to apologise! Things flowing well now. The additions to stop disaster following memory leaks seems to work quite well BTW. Do you still want samples of "bad" units? Only if they belong to some new type of error. Already "known" bugs posted here: http://lunatics.kwsn.net/gpu-crunching/wus-that-cuda-mb-cant-do-correctly.0.html If you find some new type of bug - please, post it there too in the same formate (comment about what new in this example, attached task itself, online result or link to it, standalone result and log (if you did standalone run too)). ID: 848890 ·

Robi Send message Joined: 24 Oct 00 Posts: 33 Credit: 886,890 RAC: 1	Message 848904 - Posted: 3 Jan 2009, 21:20:58 UTC - in response to Message 847093. when are the graphics coming back for the screensaver ? I miss the scree saver graphics. you seem to be running the regular apps, thus you should be able to have the graphics, unless you installed BOINC as a service (runs if user is logged off) where the graphics are not enabled. Robi ID: 848904 ·

Loony Send message Joined: 8 Dec 99 Posts: 5 Credit: 3,193,475 RAC: 78	Message 848992 - Posted: 4 Jan 2009, 0:25:05 UTC Tried CUDA version over past 3 days.... Ensured Video card had latest drivers (Geforce 8400 GS, driver version 178.24) Estimated times for workunits looked much shorter... but kept crashing my video card.... Vista machine running quad core.... Normally 2 cores assignesd to Boinc Rolled back to earlier NON cuda version.. ID: 848992 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848994 - Posted: 4 Jan 2009, 0:34:04 UTC - in response to Message 848992. Last modified: 4 Jan 2009, 0:34:38 UTC Rolled back to earlier NON cuda version.. If you want set and forget box - it's correct decision for now. CUDA MB still requires additional work from user to run more or less smooth (actually I still have dotted screen time to time although try to disable/abort tasks that are known to be buggy on CUDA MB). If you wanna participate in debugging - you could abort already known buggy tasks manually or via script and try to find new bugs. ID: 848994 ·

(retired account) Volunteer tester Send message Joined: 5 May 99 Posts: 30 Credit: 91,116 RAC: 0	Message 849066 - Posted: 4 Jan 2009, 3:24:56 UTC Last modified: 4 Jan 2009, 3:45:59 UTC @MarkJ, Josef and Raistmer: thanks for your replies. I also read your PM now, Raistmer, sorry for that, a bit late. For now I just want to note the following two results, which might be interesting, since they have the same AR but a different outcome: resultid=5163428 resultid=5163425 The AR is 0.138212 and both workunits are even from the same series, which is 03no08aa.12883.270960.6.11.xxx. 5163428 caused the video driver to crash and resulted in the usual bunch of various error messages, then 5163425 finished o.k. and was validated against a result calculated on the CPU. It shows again (this has already been discussed) that there is no threshold concerning the AR, where you could say, above that value it's fine and below it will crash when using the CUDA-accelerated application. The reported free GPU memory was also the same for both results. Personal note: I'll be quite occupied with 'real live' the coming days or maybe weeks, so I guess I will keep my GTX260 happy with some GPUgrid workunits and won't read/post here much. I wish you all a good start into the New Year and hope for some interesting things to come with CUDA development here at SETI@home as well as for other projects. Read you later g! Regards Alex ID: 849066 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 849124 - Posted: 4 Jan 2009, 5:28:15 UTC - in response to Message 848992. @Looney: Please don't keep posting the same thing in several threads. Those of us who read these boards will see your message the first time, we don't need to read it several times. Thanks. ID: 849124 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 849204 - Posted: 4 Jan 2009, 10:43:49 UTC - in response to Message 847075. I still haven't see what it will do when it finishes the last cuda task. My last cuda task should be done in 30 mins or so... Well it seems that it behaves in that regard. There are issues with exiting/shutting down science apps but then it is a development version. At least 6.5.0 is better with cuda tasks than 6.4.5. The underlying issues with work fetch are being discussed at the moment, and then there will be a new BOINC. How long this will take is anyones guess. There are also server-side changes from what I gather of the proposal. That still leaves Seti with a science app that doesn't behave. BOINC blog ID: 849204 ·

Vipin Palazhi Send message Joined: 29 Feb 08 Posts: 286 Credit: 167,386,578 RAC: 0	Message 849354 - Posted: 4 Jan 2009, 18:06:22 UTC Last modified: 4 Jan 2009, 18:10:39 UTC After a short gap, I decided to restart CUDA crunching. I used the modified package with cc_config file set to 2 cpus. The system downloaded 2 MB and 2 AP work units. When I checked in the tasks tab, I found that the first MB unit got trashed. For the second MB unit, GPU was being used, but the CPU utilization was 100%. The AP unit was almost at a standstill and the MB unit was being crunched by the CPU (1.453% after 9 min and 33 sec). Tried restarting, but no change. Then I paused the MB unit and it immediately went to 100%. Both were reported back as success. The error message is as follows - Cuda error 'cudaMalloc((void*) &dev_PoT' in file 'd:/BTR/seticuda/Berkeley_rep/client/cuda/cudaAcceleration.cu' in line 334 : out of memory. setiathome_CUDA: CUDA runtime ERROR in device memory allocation (Step 1 of 3). Falling back to HOST CPU processing... Guess I will revert back to non-CUDA crunching after I finish the two AP units. Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names set logFile = FS.OpenTextFile(logFileName,8,True)* and set objLogFile = FS.GetFile(LogFileName) ______________ ID: 849354 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 849407 - Posted: 4 Jan 2009, 21:19:08 UTC - in response to Message 849354. Last modified: 4 Jan 2009, 21:22:05 UTC After a short gap, I decided to restart CUDA crunching. I used the modified package with cc_config file set to 2 cpus. The system downloaded 2 MB and 2 AP work units. When I checked in the tasks tab, I found that the first MB unit got trashed. For the second MB unit, GPU was being used, but the CPU utilization was 100%. The AP unit was almost at a standstill and the MB unit was being crunched by the CPU (1.453% after 9 min and 33 sec). Tried restarting, but no change. Then I paused the MB unit and it immediately went to 100%. Both were reported back as success. The error message is as follows - Cuda error 'cudaMalloc((void*) &dev_PoT' in file 'd:/BTR/seticuda/Berkeley_rep/client/cuda/cudaAcceleration.cu' in line 334 : out of memory. setiathome_CUDA: CUDA runtime ERROR in device memory allocation (Step 1 of 3). Falling back to HOST CPU processing... Guess I will revert back to non-CUDA crunching after I finish the two AP units. Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names set logFile = FS.OpenTextFile(logFileName,8,True)* and set objLogFile = FS.GetFile(LogFileName) Total GPU memory 939196416 free GPU memory 143332864 It seems your GPU has 1GB of memory but only ~100MB were free at start of task. So it said "out of memory". Did you run any 3D graphic while using CUDA ? Why so low free memory ? For another task: Total GPU memory 939196416 free GPU memory 358807552 Again, where rest of GPU memory ?.... What driver do you use? Please, try to answer of these questions and try to crunch few more CUDA tasks after OS reboot. It's pretty interesting case :) ID: 849407 ·

Maik Send message Joined: 15 May 99 Posts: 163 Credit: 9,208,555 RAC: 0	Message 849415 - Posted: 4 Jan 2009, 21:45:36 UTC - in response to Message 849354. Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names set logFile = FS.OpenTextFile(logFileName,8,True) and set objLogFile = FS.GetFile(LogFileName) Whats the error message you getting if you use the original version? And second question: What is written in the first line of the script you using? ID: 849415 ·

Byron S Goodgame Volunteer tester Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0	Message 849481 - Posted: 5 Jan 2009, 1:16:45 UTC Last modified: 5 Jan 2009, 1:25:35 UTC I think this result is a potential problem here. The canonical result was given to a CUDA task since it was verified by anther CUDA machine of an overflow. Problem is, there's a wingman using the stock cpu application that came out with another result that wasn't an overflow. I'm pretty sure if there had been another wingman assigned that didn't use CUDA the canonical result would have been different, and not an overflow. Are there other cases like this? I don't think this is script generated to award people for using CUDA, since CUDA was given the canonical result status. ID: 849481 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 849501 - Posted: 5 Jan 2009, 1:59:42 UTC - in response to Message 849481. I think this result is a potential problem here. The canonical result was given to a CUDA task since it was verified by anther CUDA machine of an overflow. Problem is, there's a wingman using the stock cpu application that came out with another result that wasn't an overflow. I'm pretty sure if there had been another wingman assigned that didn't use CUDA the canonical result would have been different, and not an overflow. Are there other cases like this? I don't think this is script generated to award people for using CUDA, since CUDA was given the canonical result status. This situation described in this thread http://setiathome.berkeley.edu/forum_thread.php?id=50937 and it's most dangerous cause invalid result goes to science database. That's why better to abort even "good" task from suspected range of AR than to give it a chance to do invalid overflow and outnumber CPU valid result... ID: 849501 ·

Maik Send message Joined: 15 May 99 Posts: 163 Credit: 9,208,555 RAC: 0	Message 849502 - Posted: 5 Jan 2009, 2:00:27 UTC Last modified: 5 Jan 2009, 2:02:49 UTC worst case me and wingman using cuda -> result overflow ... next one ... ... and so on ... let watch this one. could be one like your's. or this ... lol this ok, its enough now ;) ID: 849502 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 849506 - Posted: 5 Jan 2009, 2:04:51 UTC - in response to Message 849502. Yes, those 3% Matt mentioned doesn't apply here. CUDA MB is fast so it tends to pair with another CUDA MB host... and then we get comparison between 2 invalid overflows (and we have very good knowledge how fast these overflows can be generated by CUDA app now) and then invalid result go to database... really bad situation. And should be repaired as soon as possible. For now CUDA MB just should not be used unattended.... ID: 849506 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 849509 - Posted: 5 Jan 2009, 2:12:14 UTC - in response to Message 849506. BTW, I run CUDA MB with disabled network from cache, so can check what results are good and what are overflowed ones. Some proposals how to effective abort (already processed but not reported and not uploaded from host) results with overflows? Better if server will recive computational errors than wrong results that could deceive validator with easy... ID: 849509 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.