@Pre-FERMI nVidia GPU users: Important warning

Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1579037 - Posted: 28 Sep 2014, 13:02:31 UTC

Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-)

Cheers.
ID: 1579037 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1579040 - Posted: 28 Sep 2014, 13:14:03 UTC - in response to Message 1579037.  

Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-)

Cheers.


I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^)
ID: 1579040 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1579043 - Posted: 28 Sep 2014, 13:45:41 UTC - in response to Message 1579040.  
Last modified: 28 Sep 2014, 13:49:56 UTC

Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-)

Cheers.


I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^)



FWIW numbers sometimes also help too. Current top host Astropulse v6 inconclusive to pending ratio (a holistic indicator of host, app and project health) is currently ~4.9% , which is about twice as good or better than it used to be (well over ~10%). I'd have to guess that this apparently low impact might be partially due to that a lower proportion of Pre-Fermis tend to run AP anyway, amongst others like app improvement and better floating point support in the newer remaining cases. Not forgetting that pre-fermi throughput is a lot lower to start with.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1579043 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1579053 - Posted: 28 Sep 2014, 14:11:03 UTC - in response to Message 1579025.  

Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx

People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected.
ID: 1579053 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1579054 - Posted: 28 Sep 2014, 14:14:06 UTC - in response to Message 1579043.  
Last modified: 28 Sep 2014, 14:56:39 UTC

Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-)

Cheers.


I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^)



FWIW numbers sometimes also help too. Current top host Astropulse v6 inconclusive to pending ratio (a holistic indicator of host, app and project health) is currently ~4.9% , which is about twice as good or better than it used to be (well over ~10%). I'd have to guess that this apparently low impact might be partially due to that a lower proportion of Pre-Fermis tend to run AP anyway, amongst others like app improvement and better floating point support in the newer remaining cases. Not forgetting that pre-fermi throughput is a lot lower to start with.

I suppose I don't need to remind you of what the Science community thinks about ignoring known faulty data. I just looked at some of my inconclusives, this one stands out; Validation inconclusive (105) That's one host, here's Another. The driver was just released a few weeks ago, the numbers are multiplying by the day...
ID: 1579054 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1579059 - Posted: 28 Sep 2014, 14:25:28 UTC - in response to Message 1579053.  

Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx

People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected.

More importantly most users never even look at the message boards.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1579059 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1579071 - Posted: 28 Sep 2014, 14:56:36 UTC - in response to Message 1579059.  

More importantly most users never even look at the message boards.

Maybe that they don't read these boards, and maybe that they won't ask for help here. I have found people asking help in the weirdest of locations. You just need some info out there, Google/Bing having picked it up and others on those weird locations will be able to find this info and help those people out.

The info I just posted, I posted on the first of September on the BOINC forums. Goes to show that people here don't look there either, or make use of a search engine to look up information. Of course, all you then need to know is what CUDA version we're talking about, in this case 6.5. So go on, fill in "CUDA 6.5 BOINC" without quotes in your preferenced search engine...

Even Goodsearh has the thread as the 3rd possibility. The only thing that may throw you is that it's a thread about Mac OSX and CUDA 6.5, but just consider that Nvidia's drivers are essentially the same for whichever operating system out there.
ID: 1579071 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1579100 - Posted: 28 Sep 2014, 16:13:15 UTC

There have been complaints of personal attacks and off topic posts here.

To hide all the posts concerned would rob the thread of useful info as the posts have been "quoted" a lot.

Please can I ask for a bit more control this is number crunching not politics and this is a serious topic.

Thanks.
ID: 1579100 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1579101 - Posted: 28 Sep 2014, 16:14:23 UTC - in response to Message 1579054.  
Last modified: 28 Sep 2014, 16:20:43 UTC


I suppose I don't need to remind you of what the Science community thinks about ignoring known faulty data. I just looked at some of my inconclusives, this one stands out; Validation inconclusive (105) That's one host, here's Another. The driver was just released a few weeks ago, the numbers are multiplying by the day...


@TBar
First of all thanks for attempting to attract attention to this issue. And I can assure you that this issue not ignored by "scientific community", just have in mind that donation of computation resources to project does not automatically imply to have right training and views regarding how that donated time should be used and how results should be verified (so no need to take too close to heart reaction on issue from some other participants that express exclusively own point of view)

FYI, issue reported to nVidia bug tracking system, test case supplied and issue confirmed/reproduced by nVidia specialists. Fix in progress I hope.
ID: 1579101 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1579135 - Posted: 28 Sep 2014, 18:19:33 UTC - in response to Message 1579011.  

If the task doesn't have any single pulses it will validate. I've even seen cases where the WingPerson found 1 single pulse and the effected card that didn't find the single pulse still validated.

I was just looking at some of the Valid AP tasks for host 7339909, which you referenced in Message 1579054. I do see several where his single pulse count of 0 validated against other non-zero single pulse counts. In one case, the other hosts actually found 9. Fortunately, what I've also seen is that the canonical result always seems to go to one of the hosts with the non-zero count, even if the 7339909 was the _0 task. So, even if the offending host does get credit, its results aren't actually getting into the science database. On the other hand, I would expect that there are cases where 2 of the old card, new driver hosts validate against each other (much like those ATI hosts with the 30 Autocorr overflows that irritate me), their result will end up in the science database, without even an opportunity for another host to crunch the WU and possibly report a non-zero result.
ID: 1579135 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1579211 - Posted: 28 Sep 2014, 23:47:00 UTC
Last modified: 28 Sep 2014, 23:49:58 UTC

A feature added to the AP Validator in late 2003 makes it ignore single pulses which aren't at least 1% above threshold (THRESHOLD_FUDGE is set to 1.01). That was obviously an attempt to avoid the old problem of a signal very close to threshold being reported by one host but not the wingmate even though the calculations produced very nearly equal values. The implementation simply ignores those lower signals completely, however, so in effect moved the critical level problem server-side.

In relation to the failure to find/report any single pulses being discussed here, it means that for some of the good results with single pulses reported there wouldn't be any single pulse comparisons anyhow, so no way to eliminate the results with no single pulses. That's an additional way that the faulty results may become canonical.

I worked up an improvement to that logic which provides a one way comparison if only one of the signal reports is above the fudge level and reduces that level to 1.001 (just enough to match the allowed tolerance for peak_power). Eric intends to try it out at Beta during the upcoming work week. He also has in mind a statistical method of checking for significant differences between reported signals, that would be a further improvement if/when he actually has time to code it. The shoestring budget tends to delay such.

Meanwhile, recent Lunatics builds (including some being used as stock) provide enough detail about the reported signals to usually judge which single pulses are included in validation and which aren't. Here's a little table I made for my own quick reference when looking at the peak_power of single pulses in those stderr sections:

          Single pulse    fudge          stderr
           threshold     (thresh*1.01)   critical
          ------------  --------------  --------
scale=0    29.107864     29.39894264     29.4
scale=1    31.568300     31.88398300     31.88
scale=2    37.925224     38.30447624     38.3
scale=3    49.470177     49.96487877     49.96
scale=4    61.368561     61.98224661     61.98
scale=5    86.990051     87.85995151     87.86
scale=6   128.857971    130.14655071    130.1
scale=7   212.633087    214.75941787    214.8
scale=8   361.960083    365.57968383    365.6
scale=9   648.259888    654.74248688    654.7

For now, those peak_power values in stderr are rounded to 4 significant digits which leaves uncertainty when the value is at the level shown in the "stderr critical" column. Raistmer has checked in changes so future Lunatics builds will show those more precisely.
                                                                  Joe
ID: 1579211 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1579246 - Posted: 29 Sep 2014, 2:55:31 UTC

You also can't say "400 series" cards either as I know of at least 1 card in that series that is still based on the older GTxxx core, the GT405, which is found in a lot of cheap slimline mATX OEM PC's from that era. ;-)

Cheers.
ID: 1579246 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1579322 - Posted: 29 Sep 2014, 9:12:05 UTC - in response to Message 1579135.  

If the task doesn't have any single pulses it will validate. I've even seen cases where the WingPerson found 1 single pulse and the effected card that didn't find the single pulse still validated.

I was just looking at some of the Valid AP tasks for host 7339909, which you referenced in Message 1579054. I do see several where his single pulse count of 0 validated against other non-zero single pulse counts. In one case, the other hosts actually found 9. Fortunately, what I've also seen is that the canonical result always seems to go to one of the hosts with the non-zero count, even if the 7339909 was the _0 task. So, even if the offending host does get credit, its results aren't actually getting into the science database. On the other hand, I would expect that there are cases where 2 of the old card, new driver hosts validate against each other (much like those ATI hosts with the 30 Autocorr overflows that irritate me), their result will end up in the science database, without even an opportunity for another host to crunch the WU and possibly report a non-zero result.

It appears Host 7339909 has dropped back to Driver 337.88. Now his cards are once again finding Single pulses, http://setiathome.berkeley.edu/result.php?resultid=3755432810 One down, how many to go?

I'm surprised there still isn't a sticky warning nVidia owners about driver 340.xx. If people are not informed about the problem they won't have any reason not to update their driver. Seems a sticky post would be the least SETI could do...
ID: 1579322 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1580975 - Posted: 2 Oct 2014, 15:37:22 UTC

Unfortunately, this issue can be completely solved only by total banning of 340.52 driver, for all types of NV GPUs.
This should be done cause BOINC can differentiate between FERMI and non-FERMI ones client-side once task was recived. Hence hosts with mixed NV GPUs can recive task under FERMI plan class and then produce invalid result on pre-FERMI GPUs.

Sad, but it's BOINC design flaw.
ID: 1580975 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1582444 - Posted: 6 Oct 2014, 7:53:14 UTC
Last modified: 6 Oct 2014, 8:02:12 UTC

FYI
nVidia refused to fix this issue.
Make your conclusions about future of OpenCL on this platform and pre-FERMI cards future in whole...

EDIT: Please, make this thread sticky cause it's the only way to help anonymous platform users with this issue.
ID: 1582444 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1582490 - Posted: 6 Oct 2014, 11:42:26 UTC
Last modified: 6 Oct 2014, 11:52:09 UTC

Not sure this will help in this situation, but the way I've handled Pre-Fermis being unsupported in Cuda 6.5 altogether is to skip those with compute capability < 2.0 (Fermi) in device enumeration, then the project can later restrict device minimum at leisure. Since you use a dedicated plan class for NVOpenCL, you can link in a Cuda driverapi call & compare the compute capability & driver version. 'Non-ideal', as is the only slightly less complicated Cuda situation with such devices, but probably better than relying on users to not update to the transitional broken driver, or wait to figure out a more ideal solution once the full picture is clearer.

#if CUDART_VERSION >= 6050
// Check the supported major revision to ensure it's valid and not some pre-Fermi
if ((cDevProp[i].major < 2))
{
fprintf(stderr, "setiathome_CUDA: device %d is Pre-Fermi CUDA 2.x compute compatibility, only has %d.%d\n",
i+1, cDevProp[i].major, cDevProp[i].minor);
continue; // Skips initialising this device....
}
#else
...


Having no usable device at all would fall through multiple temporary exit retries (further in the surrounding logic), and eventually hard error when Boinc decides enough is enough (fingers crossed anyway). It's that complex way for Cuda initialisation because devices may disappear and come back, depending on user switching, so the interaction with temporary exits is complex.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1582490 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1582496 - Posted: 6 Oct 2014, 11:51:36 UTC - in response to Message 1582490.  

Agree, I will block (can be done via OpenCL runtime itself perhaps) all such devices from next release. But cause BOINC now assigns execution device, just to not enumerate them internally seems not enough. boinc_temporary_exit with user notification about the reason of exit would be more safe perhaps.
Or, logic can be more complex: to not enumerate 1.x in case 2.0 present (there are something to run on though overcommitted) and clear exit in case only 1.x available (nothing good to run on at all).
ID: 1582496 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1582498 - Posted: 6 Oct 2014, 11:54:47 UTC - in response to Message 1582496.  
Last modified: 6 Oct 2014, 11:55:43 UTC

Agree, I will block (can be done via OpenCL runtime itself perhaps) all such devices from next release. But cause BOINC now assigns execution device, just to not enumerate them internally seems not enough. boinc_temporary_exit with user notification about the reason of exit would be more safe perhaps.
Or, logic can be more complex: to not enumerate 1.x in case 2.0 present (there are something to run on though overcommitted) and clear exit in case only 1.x available (nothing good to run on at all).


Yep, was editing to add that because of user switching, there is surrounding logic with temp exits. It's been working as desired here so I avoided fiddling with it more than I had to, to add the device skip in its complete enumeration. (didn't want to break what appeared to be working)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1582498 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1582725 - Posted: 6 Oct 2014, 20:57:09 UTC - in response to Message 1579053.  

Attention, if you have a nVidia card that's 4 years old or older, and have updated to Driver 340.xx

People usually do know what kind of card they have, or can look that up. So then you point out to go to https://developer.nvidia.com/cuda-gpus. Any GPU with compute capability 1.0, 1.1, 1.2 and 1.3 will be affected.

I do not know, off the top of my head, whether my car is a Fermi or how old it is. I just barely have enough interest to look up my 440 on the given link. I am grateful that I don't need to because T'Bar posted that big, bold statement that 400s are okay. It could easily be amended to note "(except 405)" and would be a lot more useful to people who have less interest than I.

Make it a News item so it will appear on the project home page. Maybe with the title "Certain older video cards produce bad science with latest Nvidia driver" and the first line (which will also show up on the home page) can say "300 series and earlier cards, plus others listed here, are affected." Then list cards known to be bad and give the link to look up others to see if they might be.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1582725 · Report as offensive
Profile Borgholio
Avatar

Send message
Joined: 2 Aug 99
Posts: 654
Credit: 18,623,738
RAC: 45
United States
Message 1583466 - Posted: 8 Oct 2014, 14:33:05 UTC

I am running a 9800GT and the Nvidia 340.52 drivers. I noticed three AP tasks that quickly errored out when using the GPU. My normal S@H tasks seem to be working fine. So it seems I have the problem that is described in this thread.

Trouble is, I have never had anything but pain when it comes to downgrading video card drivers. I do not want to downgrade just for this. Is there a way to stop using the GPU for AP tasks while still allowing it to be used for S@H?
You will be assimilated...bunghole!

ID: 1583466 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.