Modified SETI MB CUDA + opt AP package for full GPU utilization

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 25 · Next

AuthorMessage
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 848202 - Posted: 2 Jan 2009, 14:35:00 UTC - in response to Message 848184.  
Last modified: 2 Jan 2009, 15:07:19 UTC

Yes, I figured it would be something along those lines, though I haven't had any problems with any tasks the times I've run it since having it available.

I do however know I wasn't using the batch file in this case, because it was this task that made me decide to work on the batch file and get one going. Up till then I didn't have enough of them to motivate me to do it, but I didn't like the feel of that WU :) and I thought at the time it was a VLAR affecting me more than they had in the past.

I've also had several tasks that had a sluggish/freezing effect, but that usually only lasts' about 10-15 seconds and then the tasks continue without error/
ID: 848202 · Report as offensive
Profile Odan

Send message
Joined: 8 May 03
Posts: 91
Credit: 15,331,177
RAC: 0
United Kingdom
Message 848621 - Posted: 3 Jan 2009, 10:40:53 UTC - in response to Message 847261.  

12/31/08 10:40:50|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks
12/31/08 10:40:56|SETI@home|Scheduler request completed: got 0 new tasks
Again, thanks for any suggestions or "silly boy it's this" answers.


Your host recives just what it requests - that is - nothing :)
You got too many AP probably and now BOINC thinks you not need any more work.
Try to increase cache size or just wait until few APs will be crunched.

(And yes, I know AP are doing on CPU and your GPU free and idle - but current BOINC version too silly to understand this fact ;) )


Thanks, Raistmer.
That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;)
I hadn't actually realised that BOINC was being so simplistic and not treating the 2 apps as separate. I had already tried increasing my cache a bit without success but I whacked it right up & the pipe opened. I now have both & I'm controlling the AP manually by stopping requests in my account preferences once I have a decent stock. Not ideal but it works.

Thanks again & also thanks for calling BOINC silly & not me :)
ID: 848621 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848652 - Posted: 3 Jan 2009, 11:56:35 UTC - in response to Message 848621.  

That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;)

Sorry, it was just joke :)
Actually it's pretty new situation for BOINC too - to have heterogeneous
computing, no doubt this scheduling flaws will be removed in next versions :)
ID: 848652 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 848693 - Posted: 3 Jan 2009, 13:44:49 UTC - in response to Message 848652.  

That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;)

Sorry, it was just joke :)
Actually it's pretty new situation for BOINC too - to have heterogeneous
computing, no doubt this scheduling flaws will be removed in next versions :)

LOL....you have a lotta faith there, my friend........
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 848693 · Report as offensive
Profile Odan

Send message
Joined: 8 May 03
Posts: 91
Credit: 15,331,177
RAC: 0
United Kingdom
Message 848887 - Posted: 3 Jan 2009, 21:03:45 UTC - in response to Message 848652.  

That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;)

Sorry, it was just joke :)
Actually it's pretty new situation for BOINC too - to have heterogeneous
computing, no doubt this scheduling flaws will be removed in next versions :)


LOL, I got the joke, it's OK, no need to apologise!
Things flowing well now. The additions to stop disaster following memory leaks seems to work quite well BTW.

Do you still want samples of "bad" units?
ID: 848887 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848890 - Posted: 3 Jan 2009, 21:07:53 UTC - in response to Message 848887.  

That was really the question I was asking, "why is BOINC not requesting work". I might be daft but I'm not stupid ;)

Sorry, it was just joke :)
Actually it's pretty new situation for BOINC too - to have heterogeneous
computing, no doubt this scheduling flaws will be removed in next versions :)


LOL, I got the joke, it's OK, no need to apologise!
Things flowing well now. The additions to stop disaster following memory leaks seems to work quite well BTW.

Do you still want samples of "bad" units?


Only if they belong to some new type of error. Already "known" bugs posted here:
http://lunatics.kwsn.net/gpu-crunching/wus-that-cuda-mb-cant-do-correctly.0.html
If you find some new type of bug - please, post it there too in the same formate (comment about what new in this example, attached task itself, online result or link to it, standalone result and log (if you did standalone run too)).
ID: 848890 · Report as offensive
Profile Robi

Send message
Joined: 24 Oct 00
Posts: 33
Credit: 886,890
RAC: 1
United States
Message 848904 - Posted: 3 Jan 2009, 21:20:58 UTC - in response to Message 847093.  

when are the graphics coming back for the screensaver ? I miss the scree saver graphics.

you seem to be running the regular apps, thus you should be able to have the graphics, unless you installed BOINC as a service (runs if user is logged off) where the graphics are not enabled.
Robi
ID: 848904 · Report as offensive
Profile Loony
Avatar

Send message
Joined: 8 Dec 99
Posts: 5
Credit: 3,193,475
RAC: 78
United Kingdom
Message 848992 - Posted: 4 Jan 2009, 0:25:05 UTC

Tried CUDA version over past 3 days....

Ensured Video card had latest drivers (Geforce 8400 GS, driver version 178.24)

Estimated times for workunits looked much shorter... but kept crashing my video card....

Vista machine running quad core....
Normally 2 cores assignesd to Boinc

Rolled back to earlier NON cuda version..
ID: 848992 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848994 - Posted: 4 Jan 2009, 0:34:04 UTC - in response to Message 848992.  
Last modified: 4 Jan 2009, 0:34:38 UTC



Rolled back to earlier NON cuda version..

If you want set and forget box - it's correct decision for now. CUDA MB still requires additional work from user to run more or less smooth (actually I still have dotted screen time to time although try to disable/abort tasks that are known to be buggy on CUDA MB).
If you wanna participate in debugging - you could abort already known buggy tasks manually or via script and try to find new bugs.
ID: 848994 · Report as offensive
(retired account)
Volunteer tester

Send message
Joined: 5 May 99
Posts: 30
Credit: 91,116
RAC: 0
Message 849066 - Posted: 4 Jan 2009, 3:24:56 UTC
Last modified: 4 Jan 2009, 3:45:59 UTC

@MarkJ, Josef and Raistmer: thanks for your replies. I also read your PM now, Raistmer, sorry for that, a bit late.

For now I just want to note the following two results, which might be interesting, since they have the same AR but a different outcome:

resultid=5163428
resultid=5163425

The AR is 0.138212 and both workunits are even from the same series, which is 03no08aa.12883.270960.6.11.xxx. 5163428 caused the video driver to crash and resulted in the usual bunch of various error messages, then 5163425 finished o.k. and was validated against a result calculated on the CPU. It shows again (this has already been discussed) that there is no threshold concerning the AR, where you could say, above that value it's fine and below it will crash when using the CUDA-accelerated application. The reported free GPU memory was also the same for both results.

Personal note: I'll be quite occupied with 'real live' the coming days or maybe weeks, so I guess I will keep my GTX260 happy with some GPUgrid workunits and won't read/post here much. I wish you all a good start into the New Year and hope for some interesting things to come with CUDA development here at SETI@home as well as for other projects. Read you later *g*!

Regards
Alex
ID: 849066 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 849124 - Posted: 4 Jan 2009, 5:28:15 UTC - in response to Message 848992.  

@Looney: Please don't keep posting the same thing in several threads. Those of us who read these boards will see your message the first time, we don't need to read it several times. Thanks.
ID: 849124 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 849204 - Posted: 4 Jan 2009, 10:43:49 UTC - in response to Message 847075.  

I still haven't see what it will do when it finishes the last cuda task. My last cuda task should be done in 30 mins or so...


Well it seems that it behaves in that regard. There are issues with exiting/shutting down science apps but then it is a development version. At least 6.5.0 is better with cuda tasks than 6.4.5.

The underlying issues with work fetch are being discussed at the moment, and then there will be a new BOINC. How long this will take is anyones guess. There are also server-side changes from what I gather of the proposal.

That still leaves Seti with a science app that doesn't behave.
BOINC blog
ID: 849204 · Report as offensive
Profile Vipin Palazhi
Avatar

Send message
Joined: 29 Feb 08
Posts: 286
Credit: 167,386,578
RAC: 0
India
Message 849354 - Posted: 4 Jan 2009, 18:06:22 UTC
Last modified: 4 Jan 2009, 18:10:39 UTC

After a short gap, I decided to restart CUDA crunching. I used the modified package with cc_config file set to 2 cpus. The system downloaded 2 MB and 2 AP work units.

When I checked in the tasks tab, I found that the first MB unit got trashed. For the second MB unit, GPU was being used, but the CPU utilization was 100%.

The AP unit was almost at a standstill and the MB unit was being crunched by the CPU (1.453% after 9 min and 33 sec). Tried restarting, but no change. Then I paused the MB unit and it immediately went to 100%. Both were reported back as success.

The error message is as follows -

Cuda error 'cudaMalloc((void**) &dev_PoT' in file 'd:/BTR/seticuda/Berkeley_rep/client/cuda/cudaAcceleration.cu' in line 334 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation (Step 1 of 3). Falling back to HOST CPU processing...


Guess I will revert back to non-CUDA crunching after I finish the two AP units.

Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names
set logFile = FS.OpenTextFile(logFileName,8,True)
and
set objLogFile = FS.GetFile(LogFileName)
______________


ID: 849354 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 849407 - Posted: 4 Jan 2009, 21:19:08 UTC - in response to Message 849354.  
Last modified: 4 Jan 2009, 21:22:05 UTC

After a short gap, I decided to restart CUDA crunching. I used the modified package with cc_config file set to 2 cpus. The system downloaded 2 MB and 2 AP work units.

When I checked in the tasks tab, I found that the first MB unit got trashed. For the second MB unit, GPU was being used, but the CPU utilization was 100%.

The AP unit was almost at a standstill and the MB unit was being crunched by the CPU (1.453% after 9 min and 33 sec). Tried restarting, but no change. Then I paused the MB unit and it immediately went to 100%. Both were reported back as success.

The error message is as follows -

Cuda error 'cudaMalloc((void**) &dev_PoT' in file 'd:/BTR/seticuda/Berkeley_rep/client/cuda/cudaAcceleration.cu' in line 334 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation (Step 1 of 3). Falling back to HOST CPU processing...


Guess I will revert back to non-CUDA crunching after I finish the two AP units.

Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names
set logFile = FS.OpenTextFile(logFileName,8,True)
and
set objLogFile = FS.GetFile(LogFileName)


Total GPU memory 939196416 free GPU memory 143332864

It seems your GPU has 1GB of memory but only ~100MB were free at start of task. So it said "out of memory".
Did you run any 3D graphic while using CUDA ? Why so low free memory ?

For another task:
Total GPU memory 939196416 free GPU memory 358807552

Again, where rest of GPU memory ?.... What driver do you use?

Please, try to answer of these questions and try to crunch few more CUDA tasks after OS reboot. It's pretty interesting case :)
ID: 849407 · Report as offensive
Maik

Send message
Joined: 15 May 99
Posts: 163
Credit: 9,208,555
RAC: 0
Germany
Message 849415 - Posted: 4 Jan 2009, 21:45:36 UTC - in response to Message 849354.  

Edit: I also used the VB script, and as popandbob mentioned, I had to change the log file names
set logFile = FS.OpenTextFile(logFileName,8,True)
and
set objLogFile = FS.GetFile(LogFileName)


Whats the error message you getting if you use the original version?
And second question: What is written in the first line of the script you using?
ID: 849415 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 849481 - Posted: 5 Jan 2009, 1:16:45 UTC
Last modified: 5 Jan 2009, 1:25:35 UTC

I think this result is a potential problem here. The canonical result was given to a CUDA task since it was verified by anther CUDA machine of an overflow. Problem is, there's a wingman using the stock cpu application that came out with another result that wasn't an overflow. I'm pretty sure if there had been another wingman assigned that didn't use CUDA the canonical result would have been different, and not an overflow. Are there other cases like this? I don't think this is script generated to award people for using CUDA, since CUDA was given the canonical result status.
ID: 849481 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 849501 - Posted: 5 Jan 2009, 1:59:42 UTC - in response to Message 849481.  

I think this result is a potential problem here. The canonical result was given to a CUDA task since it was verified by anther CUDA machine of an overflow. Problem is, there's a wingman using the stock cpu application that came out with another result that wasn't an overflow. I'm pretty sure if there had been another wingman assigned that didn't use CUDA the canonical result would have been different, and not an overflow. Are there other cases like this? I don't think this is script generated to award people for using CUDA, since CUDA was given the canonical result status.


This situation described in this thread http://setiathome.berkeley.edu/forum_thread.php?id=50937 and it's most dangerous cause invalid result goes to science database. That's why better to abort even "good" task from suspected range of AR than to give it a chance to do invalid overflow and outnumber CPU valid result...
ID: 849501 · Report as offensive
Maik

Send message
Joined: 15 May 99
Posts: 163
Credit: 9,208,555
RAC: 0
Germany
Message 849502 - Posted: 5 Jan 2009, 2:00:27 UTC
Last modified: 5 Jan 2009, 2:02:49 UTC

worst case me and wingman using cuda -> result overflow ... next one ... ... and so on ...

let watch this one. could be one like your's. or this ... lol this

ok, its enough now ;)
ID: 849502 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 849506 - Posted: 5 Jan 2009, 2:04:51 UTC - in response to Message 849502.  

Yes, those 3% Matt mentioned doesn't apply here. CUDA MB is fast so it tends to pair with another CUDA MB host... and then we get comparison between 2 invalid overflows (and we have very good knowledge how fast these overflows can be generated by CUDA app now) and then invalid result go to database... really bad situation. And should be repaired as soon as possible. For now CUDA MB just should not be used unattended....
ID: 849506 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 849509 - Posted: 5 Jan 2009, 2:12:14 UTC - in response to Message 849506.  

BTW, I run CUDA MB with disabled network from cache, so can check what results are good and what are overflowed ones.
Some proposals how to effective abort (already processed but not reported and not uploaded from host) results with overflows? Better if server will recive computational errors than wrong results that could deceive validator with easy...
ID: 849509 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 25 · Next

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.