Cuda is as Cuda does,,,,,,,,,,,,,,,,,

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 851921 - Posted: 10 Jan 2009, 22:56:40 UTC - in response to Message 851909. When i do crunch with CUDA it's fine. The errors are me ***ing around. It is faster, though with the optimized AP apps things are clearly moving forward. Things will work in about 18 months. Although all the points you have made mark have been perfectly valid. Duh............ I raise a stink............ The points I make are indeed valid.......... But I still get whacked by those who think they know it all.........you should see me PNs.......... I am about one tid of bowing out of here........and you know that takes a lot......... Mark, I'm not bashing CUDA precisely because I don't have all the facts. From what I understand from Matt's post CUDA is about three percent of all work returned. If we assume that CUDA never validates against the CPU app, and that there is a 0.030.03 chance (0.0009, or 0.09%) that two CUDAs will pair up, then the odds of an invalid CUDA result hitting the database is less than 1 in 10,000. .... Ned, You've quoted that single anecdotal snapshot of Matt's four or five times now. I think you're trying to build too powerful an argument from it - and making the elementary statistical error of assuming that the events are independent, and hence the probabilities can be multiplied. We know that the ARs of WUs split are not random - heck, they come in long runs of similar AR, depending on the research project controlling Arecibo at the time of the recording. We know that power crunchers tend to collect their tasks in blocks of 20. And we have persistent, though largely anecdotal, reports that the CUDA app errors consistently on VLAR tasks, and then fails to recover - generating false -9 reports in huge numbers, very quickly, until the host is rebooted. Lots of scope for non-randomness there, so the probabilities fly out of the window. We can have no idea of the overall CUDA rate from Matt's "currently roughly" remark. I suspect some snapshots would show lower, others higher - but how much lower or higher, neither of us has any way of knowing. An alternative factoid would be the List of recently connected client types, currently showing 14.25% Windows v6.4.5 (and hence potentially CUDA-capable). As it happens, I strongly agree with Mark that the CUDA release was a deeply flawed technical exercise. He has indicated in public that he has been told the behind-the-scenes reason for the release, but has been forbidden to repeat the information in public. I have heard much the same story in private from other people. It deeply saddens me that such secrecy has come about in an international, academic, scientific, publicly-funded (as in us, the public) research effort as this. I think that both the technical, and the public relations, aspects of the CUDA release have been extremely poor, and need urgent remedial action. It spoke volumes that no member of the SETI@home project staff was prepared to be named, or quoted, in NVidia's press release on SETI CUDA launch day. ID: 851921 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 851935 - Posted: 10 Jan 2009, 23:55:59 UTC - in response to Message 851921. Ned, You've quoted that single anecdotal snapshot of Matt's four or five times now. I think you're trying to build too powerful an argument from it - and making the elementary statistical error of assuming that the events are independent, and hence the probabilities can be multiplied. We know that the ARs of WUs split are not random - heck, they come in long runs of similar AR, depending on the research project controlling Arecibo at the time of the recording. We know that power crunchers tend to collect their tasks in blocks of 20. And we have persistent, though largely anecdotal, reports that the CUDA app errors consistently on VLAR tasks, and then fails to recover - generating false -9 reports in huge numbers, very quickly, until the host is rebooted. Lots of scope for non-randomness there, so the probabilities fly out of the window. We can have no idea of the overall CUDA rate from Matt's "currently roughly" remark. I suspect some snapshots would show lower, others higher - but how much lower or higher, neither of us has any way of knowing. An alternative factoid would be the List of recently connected client types, currently showing 14.25% Windows v6.4.5 (and hence potentially CUDA-capable). As it happens, I strongly agree with Mark that the CUDA release was a deeply flawed technical exercise. He has indicated in public that he has been told the behind-the-scenes reason for the release, but has been forbidden to repeat the information in public. I have heard much the same story in private from other people. It deeply saddens me that such secrecy has come about in an international, academic, scientific, publicly-funded (as in us, the public) research effort as this. I think that both the technical, and the public relations, aspects of the CUDA release have been extremely poor, and need urgent remedial action. It spoke volumes that no member of the SETI@home project staff was prepared to be named, or quoted, in NVidia's press release on SETI CUDA launch day. Actually, I'm trying to make the point that we don't have a good statistical basis for any argument, good or bad. I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. Your observation that 14.25% are running BOINC 6.4.5 sets the upper boundary. We know that more than 85% of the clients are not running CUDA. I'm running 6.4.5, but none of my machines have a CUDA capable card. The actual number of CUDA machines (right video card, correct driver, no APP_INFO.XML and right version of BOINC) could be anywhere from a bit more than 0% to 14.25%. ... and we know that some report that CUDA is slower, and others report that it's faster. So, if any given result can go to either client, the ratio of CUDA and non-CUDA is clearly a function of processing speed and population of the two types. My best guess, with all the unknown variables is somewhere between practically none and something over 30%. You have to look at the actual returned results to see how many CUDA and how many are non-CUDA. I assume that Matt can do that easily, and that his how he got the 3%. The next question is: how often does CUDA not validate? The assumption seems to be that CUDA and CPU apps. do not provide the same results -- that all of the work returned by CUDA is bad. This one validated -- and I could not find one from this host that did not validate. So, it seems that at least some of the CUDA work is valid, and validates against the CPU application. ... and that is all I'm trying to say. There is a great number of posts and for the most part they're highly negative. The casual reader is going to see all of these complaints about CUDA, and get the impression that much of the work being done is invalid because CUDA is turning out trash at an alarming rate -- and that the Science suffers because CUDA validates against CUDA, the CPU app gets outvoted, and we're missing our extra-terrestrial "I Love Lucy" episodes. All of the evidence we have is anecdotal. Except for Matt's 3% and your 14.25% all of the other factors are pure guesses. A quick, informal sampling says the odds of "bad" CUDA work is less than 100% (and by quite a bit, but my sample is too small to make any better conclusion). Absent some good solid statistics, produced by sampling as many results as possible, all we have is anecdotal evidence that sometimes CUDA is flawed. -- Ned ID: 851935 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 851947 - Posted: 11 Jan 2009, 0:20:31 UTC - in response to Message 851935. I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. It wasn't a question of acceptability or eligibility. You are quite right: any AR can be disributed to any app. The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta. ID: 851947 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 851967 - Posted: 11 Jan 2009, 1:12:39 UTC I have 3 machines ready to go on CUDA, I started 2 off but the number of errors was just silly. After the outage last week one machine ran through task after task in minutes with "compute error". Yes I suspect they were VLAR tasks, but it should not be up to me to have to work out which tasks I can and can't crunch. I have had to "detach" both machines to get them to stop accepting CUDA. But at least they are returning VALID results without me having to babysit them The home page is still happily advising everyone of CUDA, and so far I have not seen one official comment that there are ANY problems at all. Over all I am very disappointed in they way this has been handled. I been with this project almost since the beginning, but I feel let down now. I now have one machine crunching full time for The Clean Energy Project, if this proves to be stable,then the others may follow. ID: 851967 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 851975 - Posted: 11 Jan 2009, 1:33:03 UTC - in response to Message 851971. Last modified: 11 Jan 2009, 1:36:10 UTC I happen to think that CUDA is the cat's meow. I've found that other than having to abort VLAR wu's so as not confuse my poor gpu, i've had no errors to speak of. True aborting wu's is a drain on bandwidth, but the actual hit probably isn't much greater than someone losing a 10 day cache due to extreme OC'ing. I have faith in the science teams ability in keeping the database free from errors. I find that my credit claims are somewhat higher than most cpu only credit claims per wu. One positive result of all this is that more AP wu's are getting done. YMMV. Oh, save me........ I have seen more reports of Cuda gone wrong than anything else.......... The kittyman is NOT impressed.............. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 851975 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 851981 - Posted: 11 Jan 2009, 1:55:57 UTC - in response to Message 851980. Oh, save me........ I have seen more reports of Cuda gone wrong than anything else.......... The kittyman is NOT impressed.............. It is true that once you get CUDA working well one is less inclined to laud it's merits and post. Geez you may even have a frozen GTX 295 in your future. Cheers. LOL.....guidoman I may beat you all at this game one day. I have my sources........ "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 851981 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65738 Credit: 55,293,173 RAC: 49	Message 851985 - Posted: 11 Jan 2009, 2:06:31 UTC - in response to Message 851981. Oh, save me........ I have seen more reports of Cuda gone wrong than anything else.......... The kittyman is NOT impressed.............. It is true that once you get CUDA working well one is less inclined to laud it's merits and post. Geez you may even have a frozen GTX 295 in your future. Cheers. LOL.....guidoman I may beat you all at this game one day. I have my sources........ You would. ;) The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 851985 ·

Voyager Volunteer tester Send message Joined: 2 Nov 99 Posts: 602 Credit: 3,264,813 RAC: 0	Message 852007 - Posted: 11 Jan 2009, 3:07:39 UTC Sources or sorcerers? ID: 852007 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 852012 - Posted: 11 Jan 2009, 3:16:06 UTC well i now have 5 cores crunching on a 4 core machine. not bad, lets see if any errors occur now it's crunching 4 ap and cuda together. ID: 852012 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65738 Credit: 55,293,173 RAC: 49	Message 852032 - Posted: 11 Jan 2009, 4:13:33 UTC - in response to Message 852007. Sources or sorcerers? One Cultures Technology is another Cultures Magic. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 852032 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 852037 - Posted: 11 Jan 2009, 4:22:44 UTC - in response to Message 851947. I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. It wasn't a question of acceptability or eligibility. You are quite right: any AR can be disributed to any app. The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta. Yes, that should have been caught in beta. ... and as a developer, I'm always worried when my projects move from beta to release because you're never entirely sure when "enough" testing is enough. Since I haven't found one of these in my limited searching, what exactly happens when a VLAR WU crashes the app? Does it affect the next WU? ID: 852037 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 852056 - Posted: 11 Jan 2009, 4:57:21 UTC Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error. When we have units like those at the minute going through there is no problem, but when they are VLAR or VHAR then all the problems appear. ID: 852056 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 852080 - Posted: 11 Jan 2009, 5:39:09 UTC - in response to Message 852056. Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error. When we have units like those at the minute going through there is no problem, but when they are VLAR or VHAR then all the problems appear. Questions like this always trigger my "software engineer" instincts, which is why I'm asking questions. Is it known for certain that VLAR or VHAR work units will always error? Is it possible that some versions of the driver and/or some video chips do not error? My big objection is that the bulk of the CUDA discussion has been "CUDA is bad" and if we want CUDA fixed then we need to get to "when this work unit (by number) is crunched with this model video chip and this driver, it fails." Does anyone know the limits? Is there a known "safe" range?? I'm really not interested in "how bad" because that doesn't lead to a solution. Defining the problem is the first step in getting it solved. ID: 852080 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 852087 - Posted: 11 Jan 2009, 5:55:04 UTC - in response to Message 852037. ... The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta. Yes, that should have been caught in beta. ... It was. Problem with VLAR WUs on CUDA MB started Dec. 13th, confirmed with ample data Dec. 14th. Joe ID: 852087 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 852101 - Posted: 11 Jan 2009, 6:15:49 UTC - in response to Message 852087. ... The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta. Yes, that should have been caught in beta. ... It was. Problem with VLAR WUs on CUDA MB started Dec. 13th, confirmed with ample data Dec. 14th. Joe Then why the launch Joe, why the lauuch???????? I know...do you? "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 852101 ·

Beau Send message Joined: 24 Feb 08 Posts: 50 Credit: 129,080 RAC: 0	Message 852177 - Posted: 11 Jan 2009, 12:19:13 UTC - in response to Message 852101. Maybe they already cashed the check and had to launch...? Just a thought... ID: 852177 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 852180 - Posted: 11 Jan 2009, 12:27:23 UTC - in response to Message 852177. Maybe they already cashed the check and had to launch...? Just a thought... The largest donation in December 2008 was only $1000.00, so it can't have been a very big check.... ;-) ID: 852180 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 852181 - Posted: 11 Jan 2009, 12:31:49 UTC Richard if it was made via other sources, it might not appear on that list. That list is only what they want us to see. ID: 852181 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 852182 - Posted: 11 Jan 2009, 12:33:37 UTC ned from what i have seen personallt, the safe range for crunching is between 0.005 and 2.9. Anything else seems to cause f' ups! ID: 852182 ·

Beau Send message Joined: 24 Feb 08 Posts: 50 Credit: 129,080 RAC: 0	Message 852184 - Posted: 11 Jan 2009, 12:37:50 UTC - in response to Message 852181. And they obviously aren't being truthful and telling everything about the cluster of a forced release of cuda well before it was ready for a production environment, so what else are they not saying?? ID: 852184 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.