CUDA MB V12b rebuild supposed to work with Fermi GPUs

Message boards : Number crunching : CUDA MB V12b rebuild supposed to work with Fermi GPUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 990092 - Posted: 18 Apr 2010, 12:00:39 UTC - in response to Message 990083.  

Well, AR 0.44, midrange one, that usually the best for GPU, fails with -9 on 480 and Lunatics site down (at least I can't reach it).
Lets try another trick for now then.
Try to disable all GPUs but one (it can be done either via BOINC settings, physical removal other GPUs or by suspending all CUDA MB tasks but one).
Will single GPU produce same error on these tasks?..

EDIT: also, try to update to 197.45 driver if it's available for your GPU.

EDIT2:
Rather high AR tasks fail too, for example
WU true angle range is : 1.373955
met invalid overflow.


197.45 doesn't support the GTX470 or GTX480 according to nvidia's doco. They recommend 197.41 for the 400 series at the moment.
BOINC blog
ID: 990092 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990095 - Posted: 18 Apr 2010, 12:21:54 UTC - in response to Message 990092.  


197.45 doesn't support the GTX470 or GTX480 according to nvidia's doco. They recommend 197.41 for the 400 series at the moment.


Thanks for info. This way disabled then.
ID: 990095 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990096 - Posted: 18 Apr 2010, 12:24:43 UTC

@Todd Hebert
Please, look PM with link to Jason's V13 hybrid build linked vs CUDA 3.0 SDK. Maybe it would help with overflow issue.
ID: 990096 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990151 - Posted: 18 Apr 2010, 17:12:42 UTC

Ok, will do here shortly and post back the behavior that is found with the build.
ID: 990151 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990173 - Posted: 18 Apr 2010, 18:00:15 UTC

I looked at last results - it seems V13 works OK, it produces correct results for ARs that were errors with V12b.
Next step is to understand current host productivity.
ID: 990173 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 990175 - Posted: 18 Apr 2010, 18:01:28 UTC

fascinating :D
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 990175 · Report as offensive
W-H-Oami

Send message
Joined: 6 Mar 10
Posts: 15
Credit: 168,510
RAC: 0
United Kingdom
Message 990180 - Posted: 18 Apr 2010, 18:10:20 UTC

Question: The new Fermi GPU's are multi-tasking, multi-cored etc etc. Does that mean they will be able to run multi CUDA23's at the same time???
If so, how many???
ID: 990180 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 990181 - Posted: 18 Apr 2010, 18:19:58 UTC - in response to Message 990180.  
Last modified: 18 Apr 2010, 18:21:13 UTC

Question: The new Fermi GPU's are multi-tasking, multi-cored etc etc. Does that mean they will be able to run multi CUDA23's at the same time???
If so, how many???


'Probably' but unknown overhead at this stage .. theoretically up to 16, but there may not be enough RAM for that & overheads may be too high. Most likely the best approach will involve implementing concurrent kernel execution with a single Fermi specific App (increasing locality but reducing overhead). This is because the device is most certainly under-utlised with current builds. I've been looking at how to approach that for quite some time (basically since the fermi whitepaper release last september, or so), and once basic operation is confirmed/fixed then we can explore the options a bit more.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 990181 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990182 - Posted: 18 Apr 2010, 18:20:44 UTC - in response to Message 990180.  

I don't believe that would be possible - it would be very challenging to isolate the cores at that level. Just think how long it has taken normal applications to access multi-core cpu's in the correct fashion. But given the right sdk anything is possible - it would just take a long time - and then a new method would come along.
ID: 990182 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 990184 - Posted: 18 Apr 2010, 18:26:47 UTC - in response to Message 990182.  
Last modified: 18 Apr 2010, 18:28:28 UTC

I don't believe that would be possible - it would be very challenging to isolate the cores at that level. Just think how long it has taken normal applications to access multi-core cpu's in the correct fashion. But given the right sdk anything is possible - it would just take a long time - and then a new method would come along.


When the appropriate non-default mode concurrent streams are used (which they aren't yet), the driver & hardware is 'supposed' to pack the kernels 'Tetris-Style' to fill the execution resources. That's one of the major design progressions apparent in this architecture over any before, which traditionally could only execute one device kernel at a time. That meant most of the cores sit idle on for example a GTX285 running pulsefind kernels in a very low angle range task, which can get as narrow in execution width as one part of a core, and very long on execution time.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 990184 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990187 - Posted: 18 Apr 2010, 18:30:43 UTC

Everything changes doesn't it :) The technology must progress and with it the complexity.
ID: 990187 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990188 - Posted: 18 Apr 2010, 18:34:48 UTC

BTW, for now it would be interesting to set coproc number to 0.5 and see how Fermi will perform with 2 V13 apps at once.
ID: 990188 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 990190 - Posted: 18 Apr 2010, 18:36:55 UTC

LoL, let it run for a while, see if it actually works, then fiddle after ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 990190 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990191 - Posted: 18 Apr 2010, 18:41:29 UTC

Let me know if you would like it changed - it only takes but a second.
ID: 990191 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990194 - Posted: 18 Apr 2010, 18:47:53 UTC - in response to Message 990190.  

LoL, let it run for a while, see if it actually works, then fiddle after ;)

Yeah, day-two in current mode to see stability and base performance, then tweaking :)
ID: 990194 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 990197 - Posted: 18 Apr 2010, 18:50:48 UTC - in response to Message 990191.  
Last modified: 18 Apr 2010, 18:51:50 UTC

That's up to you (your machine ;) ). A smattering of different angle ranges might paint some sort of picture we could analyse with the current setup. I suggest a day as is, then trying 0.5 (then if works maybe 0.25 ... ) that way we might be able to determine where the change occurred & gauge any change (if any). Already

I've figured out that the high cpu usage is due to the PTX JIT compiler in the driver being used ... Embedding fermi native kernels in future builds will reduce that time.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 990197 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990205 - Posted: 18 Apr 2010, 18:59:39 UTC

I think for the moment I will leave it as is currently and make a change later tonight. So for the next 6-8 hours it will be the same to maintain stability.

I can tell you this much - the fans on these cards are LOUD when running at 100% - not something I would want to sit next to all day. And my ears are tempered from working in server rooms.
ID: 990205 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 990235 - Posted: 18 Apr 2010, 19:51:14 UTC - in response to Message 990197.  


I've figured out that the high cpu usage is due to the PTX JIT compiler in the driver being used ... Embedding fermi native kernels in future builds will reduce that time.

When I built V12b with CUDA 3.0 two targets were generated, for 1.0 and 2.0 compute capabilities. It seems like not JIT compiler used for Fermi, at least if one build using provided build rule file.
ID: 990235 · Report as offensive
W-H-Oami

Send message
Joined: 6 Mar 10
Posts: 15
Credit: 168,510
RAC: 0
United Kingdom
Message 990236 - Posted: 18 Apr 2010, 19:53:19 UTC - in response to Message 990181.  
Last modified: 18 Apr 2010, 19:56:54 UTC

'Probably' but unknown overhead at this stage .. theoretically up to 16, but there may not be enough RAM for that & overheads may be too high.


Perhaps we should ask nVidia to design a Fermi with 256 or 512K RAM per SM
ID: 990236 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 990254 - Posted: 18 Apr 2010, 21:18:41 UTC

Wow! That would be expensive with 512 SM's per GPU and the target market would be very limited.
ID: 990254 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : CUDA MB V12b rebuild supposed to work with Fermi GPUs


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.