CUDA and the BLUE SCREEN OF DEATH

Message boards : Number crunching : CUDA and the BLUE SCREEN OF DEATH
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 852081 - Posted: 11 Jan 2009, 5:41:34 UTC - in response to Message 852019.  

I distinctly disagree with any suggestion that CUDA should be paired with CUDA now, that would put too many doubtful results in the master science database. Once the bugs are fixed I doubt many will care what pairings happen, but more flexibility would at least make sense.
                                                                  Joe

Another point, if cuda is giving so many "doubtful" results as you had said, why is it being used in the mainstream crunching and not back in beta testing where it should remain until it works? Why cant I also crunch seti on my TI-84 calculator as well; seeing as the quality of results obviously dont matter to seti.

By "doubtful" I meant that the bugs have not yet been fully characterized, many of the CUDA results do validate against CPU results but a significant fraction don't. Having comparisons with CPU results provides both the validation needed to keep the science database reasonably clean, plus data to determine what bugs need to be squashed. What comprises "too many" is of course open to judgement, but it's clearly better to have 3% of 3% uncertainty rather than the straight 3% which would come from always pairing CUDA with CUDA.

As a client-side participant, I never expected to be able to run the project and can't even begin to answer "why" questions. I simply don't know enough. I did express my doubts that having the CUDA app here was sensible immediately after it appeared, but I don't think any amount of complaining by the small minority of participants who frequent these forums is going to affect the project's decisions.

I'm sure the project would be glad to have an application for the TI-84 if you think you can produce one which will meet deadlines. Almost any hand calculator has sufficient accuracy for the calculations, it's primarily a matter of handling large amounts of data quickly enough. If you could even come close, the project might well be willing to produce smaller WUs more suited to the hardware. When the Boincoid programmers inquired about reduced S@H WUs for their JAVA/Android version which may run on some cell phones, Eric asked how much reduction they thought would be appropriate.
                                                                Joe
ID: 852081 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51470
Credit: 1,018,363,574
RAC: 1,004
United States
Message 852105 - Posted: 11 Jan 2009, 6:25:40 UTC - in response to Message 852081.  

I distinctly disagree with any suggestion that CUDA should be paired with CUDA now, that would put too many doubtful results in the master science database. Once the bugs are fixed I doubt many will care what pairings happen, but more flexibility would at least make sense.
                                                                  Joe

Another point, if cuda is giving so many "doubtful" results as you had said, why is it being used in the mainstream crunching and not back in beta testing where it should remain until it works? Why cant I also crunch seti on my TI-84 calculator as well; seeing as the quality of results obviously dont matter to seti.

By "doubtful" I meant that the bugs have not yet been fully characterized, many of the CUDA results do validate against CPU results but a significant fraction don't. Having comparisons with CPU results provides both the validation needed to keep the science database reasonably clean, plus data to determine what bugs need to be squashed. What comprises "too many" is of course open to judgement, but it's clearly better to have 3% of 3% uncertainty rather than the straight 3% which would come from always pairing CUDA with CUDA.

As a client-side participant, I never expected to be able to run the project and can't even begin to answer "why" questions. I simply don't know enough. I did express my doubts that having the CUDA app here was sensible immediately after it appeared, but I don't think any amount of complaining by the small minority of participants who frequent these forums is going to affect the project's decisions.

I'm sure the project would be glad to have an application for the TI-84 if you think you can produce one which will meet deadlines. Almost any hand calculator has sufficient accuracy for the calculations, it's primarily a matter of handling large amounts of data quickly enough. If you could even come close, the project might well be willing to produce smaller WUs more suited to the hardware. When the Boincoid programmers inquired about reduced S@H WUs for their JAVA/Android version which may run on some cell phones, Eric asked how much reduction they thought would be appropriate.
                                                                Joe

Geez.........

I can't get reproducable resutlts fast enough on my slide rule to make the date..........
will Seti slow down enough for me to fit in???
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 852105 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 852143 - Posted: 11 Jan 2009, 9:47:52 UTC
Last modified: 11 Jan 2009, 9:50:45 UTC

Well i tried to get the most ouf of the system with both CPU and Cuda running together, yesterday it BSODed on me.. And what do you know, my profile got erased so Vista made me a new one automatically with my files that i had on the desktop..

Stuff that also errored out is for instance the c:\programdata\boinc drawer so when i entered the darn system again my client detected a faulty client_state file so it automatically detached everything..

With shadowcopy i managed to restore the file and are starting to emptying my cache just as it was before but i got occasional red messages saying that the result that i just crunched returned positive before and if i got 100 results and send it on i don't get any errors or so but i can't seem to see if the server changes the line "detached" to the date it received the file instead.

I just wonder does S@H servers know if i restore files so i got everything back and starts to crunch and send in results if the results would get validated later on EVEN if it says on over thousand of lines "client detached" and would fix so i'll be rewarded credits later on?

Does someone know how that works..

This is my really first "Darn Cuda" sentence that i need to point out..

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 852143 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 852196 - Posted: 11 Jan 2009, 13:22:23 UTC - in response to Message 852143.  
Last modified: 11 Jan 2009, 13:25:58 UTC

Nope, once the project says 'Client Detached' they are dead for you.

Since you said you like to get everything there is to be had from your host, I'd detach/reattach again for real to get rid of the clinkers. This will clear out the SAH directory, so you'd have to reinstall your opti apps, custom app_info file, etc as well.

If there aren't that many, you can go through them manually and abort the ones which are showing as detached on the project.

Alinator
ID: 852196 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 852612 - Posted: 12 Jan 2009, 9:54:07 UTC - in response to Message 851632.  

but if you don't ask you don't get. Its a good idea so i'll put the suggestion in and we will see.


Added trak ticket 822 with the suggestion.


Response to the request for a throttling mechanism from the "Nvidia developer" (whoever that is).

Can the GPU be throttled in another way? (BOINC uses a pause system to throttle CPU calculations, if set by the user. It then pauses all of BOINC for the duration of a second or more. I'm wondering what effect that has on the GPU's lifespan.) In other words, can the GPU be set to use only half its capacity (50% comparable CPU cycles) or not?

I don't know of any supervisory ways to assign a CUDA app to limit to a percentage of the GPU throughput. You could always throttle the CPU thread that feeds the GPU to effectively limit the GPU.

Ah, but the problem here is that it uses so little CPU already. What does it take, 3 to 4% of the CPU? And then it only uses the CPU when data is transferred from the GPU's memory to disk and from the disk to the GPU's memory. The rest is done solely by the GPU.

The CPU usage is fairly small in SETI, but that’s only because we sleep in the driver waiting for the GPU to complete its current task at hand. Program-wise pausing the execution of the CPU thread that is feeding CUDA kernel functions will effectively reduce GPU usage rate because you’re starving the GPU for data to crunch. The downside is that it will slow down speed of the app.

But: The values set by the drivers in combination with the VBIOS should already monitor temperatures and regulate the fan and GPU clocks accordingly. This may not work on a deliberately overclocked GPU.

BOINC blog
ID: 852612 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 853622 - Posted: 15 Jan 2009, 2:46:20 UTC

I expect more and more chips going bye bye, they are not designed to run for too long, see charlies articles on the inquirer about this.


this is my personal opinion.

Who?
ID: 853622 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 853649 - Posted: 15 Jan 2009, 4:24:36 UTC - in response to Message 853622.  

Current BSoD issues in general not connected with any GPU overheating or any hardware troubles. It's just not imperfect OS-driver interactions, that is - software issue, not hardware one.
Still haven't seen any report about some hardware failures.
Who? just goes blame and panic (as usually :P )
ID: 853649 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 853653 - Posted: 15 Jan 2009, 4:33:12 UTC - in response to Message 853649.  

Hmmm...

Perhaps you should review the literature some more. In this case, Francois' observation and point is well taken (IMHO) about the state of the current nVidia's GPU family hardware difficulties.

Of course, that doesn't mean some of the troubles we are seeing here on SAH are a direct result of that, but it sure can't be helping.

Alinator
ID: 853653 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 853655 - Posted: 15 Jan 2009, 4:37:30 UTC - in response to Message 853653.  

Hmmm...

Perhaps you should review the literature some more. In this case, Francois' observation and point is well taken (IMHO) about the state of the current nVidia's GPU family hardware difficulties.

Of course, that doesn't mean some of the troubles we are seeing here on SAH are a direct result of that, but it sure can't be helping.

Alinator

Sure hardware failures are possible - but most of them will be from too weak or poor fans on cheap boards.
And I stress, its to early to say about "more chip failures will come". Can't be second when there is no first still.
ID: 853655 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 853656 - Posted: 15 Jan 2009, 4:39:00 UTC


Links please! :-)

..where I can read this about CUDA-GPUs?

ID: 853656 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 853658 - Posted: 15 Jan 2009, 4:44:38 UTC - in response to Message 853655.  
Last modified: 15 Jan 2009, 4:46:57 UTC

Hmmm...

Perhaps you should review the literature some more. In this case, Francois' observation and point is well taken (IMHO) about the state of the current nVidia's GPU family hardware difficulties.

Of course, that doesn't mean some of the troubles we are seeing here on SAH are a direct result of that, but it sure can't be helping.

Alinator

Sure hardware failures are possible - but most of them will be from too weak or poor fans on cheap boards.
And I stress, its to early to say about "more chip failures will come". Can't be second when there is no first still.


Agreed, to a point. However one big difference here is the form factor you are trying to cram that 'heat factory' into. A graphics card is a far more constrained environment to be pumping 100 watts plus steady state into (a lot plus in some cases).

One only has to look at how well some of the ultra-thin laptops handle having the hammer dropped when running BOINC flatout to see a similar effect for CPU's. Not all of them were what I would call 'cheap'.

@ Sutaro: Just look for Charlie Demerjian's articles in 'The Inquirer'. You can find more if you root around a bit.

Alinator
ID: 853658 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 853659 - Posted: 15 Jan 2009, 4:44:45 UTC - in response to Message 853655.  
Last modified: 15 Jan 2009, 4:48:45 UTC

Hmmm...

Perhaps you should review the literature some more. In this case, Francois' observation and point is well taken (IMHO) about the state of the current nVidia's GPU family hardware difficulties.

Of course, that doesn't mean some of the troubles we are seeing here on SAH are a direct result of that, but it sure can't be helping.

Alinator

Sure hardware failures are possible - but most of them will be from too weak or poor fans on cheap boards.
And I stress, its to early to say about "more chip failures will come". Can't be second when there is no first still.


I wouldn't make commercial for a special manufacturer..

For example 'Zotac' overclock his GPUs in the AMP!-series..
Maybe it's not recommended to buy a OCed GPU for CUDA-crunching?

Hey, to my knowledge they have 5 years warranty.. ;-D
..if a GPU can't calculate with CUDA well, it's a 'warranty-topic'?
ID: 853659 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 853662 - Posted: 15 Jan 2009, 4:54:11 UTC - in response to Message 853659.  
Last modified: 15 Jan 2009, 4:54:41 UTC



I wouldn't make commercial for a special manufacturer..

For example 'Zotac' overclock his GPUs in the AMP!-series..
Maybe it's not recommended to buy a OCed GPU for CUDA-crunching?

Hey, to my knowledge they have 5 years warranty.. ;-D
..if a GPU can't calculate with CUDA well, it's a 'warranty-topic'?


Point well taken there. I'm sure the problem is dependent on the particular card vendor to some extent.

However, the take away point from a lot of the articles is that this is really new ground the GPU manufacturers are moving into. Perhaps this is a case where the ATi-AMD merger has a real positive benefit, since the kinds of hot spot thermal management issues mentioned are something CPU manufacturers have had to deal with from day one out of necessity.

Alinator
ID: 853662 · Report as offensive
Profile Edboard
Volunteer tester

Send message
Joined: 4 Jun 08
Posts: 9
Credit: 1,043,577
RAC: 0
Spain
Message 853814 - Posted: 15 Jan 2009, 16:11:20 UTC
Last modified: 15 Jan 2009, 16:12:44 UTC

I have been trying seti@home cuda units in two PCs (each one with a gtx280 without OC) and almost every day I have a reset or video driver failure with some unit. The temperatures are relatively low (if compared with the temps. I get crunching in folding and Gpugrid which are higher and without problems). I think I'll wait until the units delivered be safer for crunching.
ID: 853814 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 853818 - Posted: 15 Jan 2009, 16:21:51 UTC - in response to Message 853814.  
Last modified: 15 Jan 2009, 16:22:25 UTC

Raistmer has released a new version that he says fixes the VLAR bug. If you would like to try it.

http://setiathome.berkeley.edu/forum_thread.php?id=50829


PROUD MEMBER OF Team Starfire World BOINC
ID: 853818 · Report as offensive
Profile Edboard
Volunteer tester

Send message
Joined: 4 Jun 08
Posts: 9
Credit: 1,043,577
RAC: 0
Spain
Message 853847 - Posted: 15 Jan 2009, 17:31:45 UTC - in response to Message 853818.  

Installed V5 opt. One of my PCs has just done a reset. It seems that it doesn't work for me.
ID: 853847 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14661
Credit: 200,643,578
RAC: 874
United Kingdom
Message 853854 - Posted: 15 Jan 2009, 17:44:51 UTC - in response to Message 853847.  

Installed V5 opt. One of my PCs has just done a reset. It seems that it doesn't work for me.

Details please? Especially model of CUDA card installed, and operating system version?
ID: 853854 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 853869 - Posted: 15 Jan 2009, 18:08:28 UTC - in response to Message 853847.  

The version I linked is version 6. It has been vastly improved over ver 5


PROUD MEMBER OF Team Starfire World BOINC
ID: 853869 · Report as offensive
Profile Edboard
Volunteer tester

Send message
Joined: 4 Jun 08
Posts: 9
Credit: 1,043,577
RAC: 0
Spain
Message 853884 - Posted: 15 Jan 2009, 18:48:47 UTC - in response to Message 853854.  

Both PC's have a gtx280 running under Windows Vista Home Premium 32 bits.
ID: 853884 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14661
Credit: 200,643,578
RAC: 874
United Kingdom
Message 853888 - Posted: 15 Jan 2009, 18:59:40 UTC - in response to Message 853884.  

Both PC's have a gtx280 running under Windows Vista Home Premium 32 bits.

Have a try of the v6 then. That build (as yet relatively untested) does at least run through to the end of the difficult tasks. Be prepared for long-ish running times for some tasks, and perhaps more screen-lag while they run, but the latest bug-fix has been deliberately designed to avoid the driver reset problem.
ID: 853888 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : CUDA and the BLUE SCREEN OF DEATH


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.