Cuda is as Cuda does,,,,,,,,,,,,,,,,,

Message boards : Number crunching : Cuda is as Cuda does,,,,,,,,,,,,,,,,,
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 851921 - Posted: 10 Jan 2009, 22:56:40 UTC - in response to Message 851909.  

When i do crunch with CUDA it's fine. The errors are me ****ing around. It is faster, though with the optimized AP apps things are clearly moving forward. Things will work in about 18 months.

Although all the points you have made mark have been perfectly valid.

Duh............

I raise a stink............
The points I make are indeed valid..........
But I still get whacked by those who think they know it all.........you should see me PNs..........

I am about one tid of bowing out of here........and you know that takes a lot.........

Mark,

I'm not bashing CUDA precisely because I don't have all the facts.

From what I understand from Matt's post CUDA is about three percent of all work returned.

If we assume that CUDA never validates against the CPU app, and that there is a 0.03*0.03 chance (0.0009, or 0.09%) that two CUDAs will pair up, then the odds of an invalid CUDA result hitting the database is less than 1 in 10,000.

....

Ned,

You've quoted that single anecdotal snapshot of Matt's four or five times now. I think you're trying to build too powerful an argument from it - and making the elementary statistical error of assuming that the events are independent, and hence the probabilities can be multiplied.

We know that the ARs of WUs split are not random - heck, they come in long runs of similar AR, depending on the research project controlling Arecibo at the time of the recording. We know that power crunchers tend to collect their tasks in blocks of 20. And we have persistent, though largely anecdotal, reports that the CUDA app errors consistently on VLAR tasks, and then fails to recover - generating false -9 reports in huge numbers, very quickly, until the host is rebooted. Lots of scope for non-randomness there, so the probabilities fly out of the window.

We can have no idea of the overall CUDA rate from Matt's "currently roughly" remark. I suspect some snapshots would show lower, others higher - but how much lower or higher, neither of us has any way of knowing. An alternative factoid would be the List of recently connected client types, currently showing 14.25% Windows v6.4.5 (and hence potentially CUDA-capable).

As it happens, I strongly agree with Mark that the CUDA release was a deeply flawed technical exercise. He has indicated in public that he has been told the behind-the-scenes reason for the release, but has been forbidden to repeat the information in public. I have heard much the same story in private from other people. It deeply saddens me that such secrecy has come about in an international, academic, scientific, publicly-funded (as in us, the public) research effort as this.

I think that both the technical, and the public relations, aspects of the CUDA release have been extremely poor, and need urgent remedial action. It spoke volumes that no member of the SETI@home project staff was prepared to be named, or quoted, in NVidia's press release on SETI CUDA launch day.
ID: 851921 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 851935 - Posted: 10 Jan 2009, 23:55:59 UTC - in response to Message 851921.  


Ned,

You've quoted that single anecdotal snapshot of Matt's four or five times now. I think you're trying to build too powerful an argument from it - and making the elementary statistical error of assuming that the events are independent, and hence the probabilities can be multiplied.

We know that the ARs of WUs split are not random - heck, they come in long runs of similar AR, depending on the research project controlling Arecibo at the time of the recording. We know that power crunchers tend to collect their tasks in blocks of 20. And we have persistent, though largely anecdotal, reports that the CUDA app errors consistently on VLAR tasks, and then fails to recover - generating false -9 reports in huge numbers, very quickly, until the host is rebooted. Lots of scope for non-randomness there, so the probabilities fly out of the window.

We can have no idea of the overall CUDA rate from Matt's "currently roughly" remark. I suspect some snapshots would show lower, others higher - but how much lower or higher, neither of us has any way of knowing. An alternative factoid would be the List of recently connected client types, currently showing 14.25% Windows v6.4.5 (and hence potentially CUDA-capable).

As it happens, I strongly agree with Mark that the CUDA release was a deeply flawed technical exercise. He has indicated in public that he has been told the behind-the-scenes reason for the release, but has been forbidden to repeat the information in public. I have heard much the same story in private from other people. It deeply saddens me that such secrecy has come about in an international, academic, scientific, publicly-funded (as in us, the public) research effort as this.

I think that both the technical, and the public relations, aspects of the CUDA release have been extremely poor, and need urgent remedial action. It spoke volumes that no member of the SETI@home project staff was prepared to be named, or quoted, in NVidia's press release on SETI CUDA launch day.

Actually, I'm trying to make the point that we don't have a good statistical basis for any argument, good or bad.

I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA.

Your observation that 14.25% are running BOINC 6.4.5 sets the upper boundary. We know that more than 85% of the clients are not running CUDA.

I'm running 6.4.5, but none of my machines have a CUDA capable card. The actual number of CUDA machines (right video card, correct driver, no APP_INFO.XML and right version of BOINC) could be anywhere from a bit more than 0% to 14.25%.

... and we know that some report that CUDA is slower, and others report that it's faster.

So, if any given result can go to either client, the ratio of CUDA and non-CUDA is clearly a function of processing speed and population of the two types. My best guess, with all the unknown variables is somewhere between practically none and something over 30%.

You have to look at the actual returned results to see how many CUDA and how many are non-CUDA. I assume that Matt can do that easily, and that his how he got the 3%.

The next question is: how often does CUDA not validate?

The assumption seems to be that CUDA and CPU apps. do not provide the same results -- that all of the work returned by CUDA is bad.

This one validated -- and I could not find one from this host that did not validate.

So, it seems that at least some of the CUDA work is valid, and validates against the CPU application.

... and that is all I'm trying to say. There is a great number of posts and for the most part they're highly negative. The casual reader is going to see all of these complaints about CUDA, and get the impression that much of the work being done is invalid because CUDA is turning out trash at an alarming rate -- and that the Science suffers because CUDA validates against CUDA, the CPU app gets outvoted, and we're missing our extra-terrestrial "I Love Lucy" episodes.

All of the evidence we have is anecdotal. Except for Matt's 3% and your 14.25% all of the other factors are pure guesses.

A quick, informal sampling says the odds of "bad" CUDA work is less than 100% (and by quite a bit, but my sample is too small to make any better conclusion).

Absent some good solid statistics, produced by sampling as many results as possible, all we have is anecdotal evidence that sometimes CUDA is flawed.

-- Ned
ID: 851935 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 851947 - Posted: 11 Jan 2009, 0:20:31 UTC - in response to Message 851935.  

I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA.

It wasn't a question of acceptability or eligibility. You are quite right: any AR can be disributed to any app.

The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta.
ID: 851947 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 851967 - Posted: 11 Jan 2009, 1:12:39 UTC

I have 3 machines ready to go on CUDA, I started 2 off but the number of errors was just silly. After the outage last week one machine ran through task after task in minutes with "compute error". Yes I suspect they were VLAR tasks, but it should not be up to me to have to work out which tasks I can and can't crunch. I have had to "detach" both machines to get them to stop accepting CUDA. But at least they are returning VALID results without me having to babysit them

The home page is still happily advising everyone of CUDA, and so far I have not seen one official comment that there are ANY problems at all.

Over all I am very disappointed in they way this has been handled. I been with this project almost since the beginning, but I feel let down now.

I now have one machine crunching full time for The Clean Energy Project, if this proves to be stable,then the others may follow.



ID: 851967 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 851975 - Posted: 11 Jan 2009, 1:33:03 UTC - in response to Message 851971.  
Last modified: 11 Jan 2009, 1:36:10 UTC

I happen to think that CUDA is the cat's meow.

I've found that other than having to abort VLAR wu's
so as not confuse my poor gpu, i've had no errors to
speak of. True aborting wu's is a drain on bandwidth,
but the actual hit probably isn't much greater than
someone losing a 10 day cache due to extreme OC'ing.
I have faith in the science teams ability in keeping
the database free from errors.
I find that my credit claims are somewhat higher than
most cpu only credit claims per wu.
One positive result of all this is that more AP wu's
are getting done.


YMMV.

Oh, save me........
I have seen more reports of Cuda gone wrong than anything else..........
The kittyman is NOT impressed..............
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 851975 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 851981 - Posted: 11 Jan 2009, 1:55:57 UTC - in response to Message 851980.  


Oh, save me........
I have seen more reports of Cuda gone wrong than anything else..........
The kittyman is NOT impressed..............


It is true that once you get CUDA working well
one is less inclined to laud it's merits and post.

Geez you may even have a frozen GTX 295 in your future.

Cheers.
LOL.....guidoman
I may beat you all at this game one day. I have my sources........

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 851981 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66284
Credit: 55,293,173
RAC: 49
United States
Message 851985 - Posted: 11 Jan 2009, 2:06:31 UTC - in response to Message 851981.  


Oh, save me........
I have seen more reports of Cuda gone wrong than anything else..........
The kittyman is NOT impressed..............


It is true that once you get CUDA working well
one is less inclined to laud it's merits and post.

Geez you may even have a frozen GTX 295 in your future.

Cheers.
LOL.....guidoman
I may beat you all at this game one day. I have my sources........

You would. ;)
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 851985 · Report as offensive
Profile Voyager
Volunteer tester
Avatar

Send message
Joined: 2 Nov 99
Posts: 602
Credit: 3,264,813
RAC: 0
United States
Message 852007 - Posted: 11 Jan 2009, 3:07:39 UTC

Sources or sorcerers?
ID: 852007 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 852012 - Posted: 11 Jan 2009, 3:16:06 UTC

well i now have 5 cores crunching on a 4 core machine. not bad, lets see if any errors occur now it's crunching 4 ap and cuda together.
ID: 852012 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66284
Credit: 55,293,173
RAC: 49
United States
Message 852032 - Posted: 11 Jan 2009, 4:13:33 UTC - in response to Message 852007.  

Sources or sorcerers?

One Cultures Technology is another Cultures Magic.
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 852032 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 852037 - Posted: 11 Jan 2009, 4:22:44 UTC - in response to Message 851947.  

I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA.

It wasn't a question of acceptability or eligibility. You are quite right: any AR can be disributed to any app.

The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta.

Yes, that should have been caught in beta.

... and as a developer, I'm always worried when my projects move from beta to release because you're never entirely sure when "enough" testing is enough.

Since I haven't found one of these in my limited searching, what exactly happens when a VLAR WU crashes the app? Does it affect the next WU?
ID: 852037 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 852056 - Posted: 11 Jan 2009, 4:57:21 UTC

Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error.

When we have units like those at the minute going through there is no problem, but when they are VLAR or VHAR then all the problems appear.
ID: 852056 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 852080 - Posted: 11 Jan 2009, 5:39:09 UTC - in response to Message 852056.  

Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error.

When we have units like those at the minute going through there is no problem, but when they are VLAR or VHAR then all the problems appear.

Questions like this always trigger my "software engineer" instincts, which is why I'm asking questions.

Is it known for certain that VLAR or VHAR work units will always error?

Is it possible that some versions of the driver and/or some video chips do not error?

My big objection is that the bulk of the CUDA discussion has been "CUDA is bad" and if we want CUDA fixed then we need to get to "when this work unit (by number) is crunched with this model video chip and this driver, it fails."

Does anyone know the limits? Is there a known "safe" range??

I'm really not interested in "how bad" because that doesn't lead to a solution. Defining the problem is the first step in getting it solved.
ID: 852080 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 852087 - Posted: 11 Jan 2009, 5:55:04 UTC - in response to Message 852037.  

...
The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta.

Yes, that should have been caught in beta.
...

It was. Problem with VLAR WUs on CUDA MB started Dec. 13th, confirmed with ample data Dec. 14th.
                                                                 Joe
ID: 852087 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 852101 - Posted: 11 Jan 2009, 6:15:49 UTC - in response to Message 852087.  

...
The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta.

Yes, that should have been caught in beta.
...

It was. Problem with VLAR WUs on CUDA MB started Dec. 13th, confirmed with ample data Dec. 14th.
                                                                 Joe

Then why the launch Joe, why the lauuch????????


I know...do you?
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 852101 · Report as offensive
Beau

Send message
Joined: 24 Feb 08
Posts: 50
Credit: 129,080
RAC: 0
United States
Message 852177 - Posted: 11 Jan 2009, 12:19:13 UTC - in response to Message 852101.  

Maybe they already cashed the check and had to launch...? Just a thought...

ID: 852177 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 852180 - Posted: 11 Jan 2009, 12:27:23 UTC - in response to Message 852177.  

Maybe they already cashed the check and had to launch...? Just a thought...

The largest donation in December 2008 was only $1000.00, so it can't have been a very big check.... ;-)
ID: 852180 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 852181 - Posted: 11 Jan 2009, 12:31:49 UTC

Richard if it was made via other sources, it might not appear on that list. That list is only what they want us to see.
ID: 852181 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 852182 - Posted: 11 Jan 2009, 12:33:37 UTC

ned from what i have seen personallt, the safe range for crunching is between 0.005 and 2.9. Anything else seems to cause f' ups!
ID: 852182 · Report as offensive
Beau

Send message
Joined: 24 Feb 08
Posts: 50
Credit: 129,080
RAC: 0
United States
Message 852184 - Posted: 11 Jan 2009, 12:37:50 UTC - in response to Message 852181.  

And they obviously aren't being truthful and telling everything about the cluster of a forced release of cuda well before it was ready for a production environment, so what else are they not saying??
ID: 852184 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Cuda is as Cuda does,,,,,,,,,,,,,,,,,


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.