Message boards :
Number crunching :
Cuda is as Cuda does,,,,,,,,,,,,,,,,,
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14677 Credit: 200,643,578 RAC: 874 |
When i do crunch with CUDA it's fine. The errors are me ****ing around. It is faster, though with the optimized AP apps things are clearly moving forward. Things will work in about 18 months. Ned, You've quoted that single anecdotal snapshot of Matt's four or five times now. I think you're trying to build too powerful an argument from it - and making the elementary statistical error of assuming that the events are independent, and hence the probabilities can be multiplied. We know that the ARs of WUs split are not random - heck, they come in long runs of similar AR, depending on the research project controlling Arecibo at the time of the recording. We know that power crunchers tend to collect their tasks in blocks of 20. And we have persistent, though largely anecdotal, reports that the CUDA app errors consistently on VLAR tasks, and then fails to recover - generating false -9 reports in huge numbers, very quickly, until the host is rebooted. Lots of scope for non-randomness there, so the probabilities fly out of the window. We can have no idea of the overall CUDA rate from Matt's "currently roughly" remark. I suspect some snapshots would show lower, others higher - but how much lower or higher, neither of us has any way of knowing. An alternative factoid would be the List of recently connected client types, currently showing 14.25% Windows v6.4.5 (and hence potentially CUDA-capable). As it happens, I strongly agree with Mark that the CUDA release was a deeply flawed technical exercise. He has indicated in public that he has been told the behind-the-scenes reason for the release, but has been forbidden to repeat the information in public. I have heard much the same story in private from other people. It deeply saddens me that such secrecy has come about in an international, academic, scientific, publicly-funded (as in us, the public) research effort as this. I think that both the technical, and the public relations, aspects of the CUDA release have been extremely poor, and need urgent remedial action. It spoke volumes that no member of the SETI@home project staff was prepared to be named, or quoted, in NVidia's press release on SETI CUDA launch day. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Actually, I'm trying to make the point that we don't have a good statistical basis for any argument, good or bad. I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. Your observation that 14.25% are running BOINC 6.4.5 sets the upper boundary. We know that more than 85% of the clients are not running CUDA. I'm running 6.4.5, but none of my machines have a CUDA capable card. The actual number of CUDA machines (right video card, correct driver, no APP_INFO.XML and right version of BOINC) could be anywhere from a bit more than 0% to 14.25%. ... and we know that some report that CUDA is slower, and others report that it's faster. So, if any given result can go to either client, the ratio of CUDA and non-CUDA is clearly a function of processing speed and population of the two types. My best guess, with all the unknown variables is somewhere between practically none and something over 30%. You have to look at the actual returned results to see how many CUDA and how many are non-CUDA. I assume that Matt can do that easily, and that his how he got the 3%. The next question is: how often does CUDA not validate? The assumption seems to be that CUDA and CPU apps. do not provide the same results -- that all of the work returned by CUDA is bad. This one validated -- and I could not find one from this host that did not validate. So, it seems that at least some of the CUDA work is valid, and validates against the CPU application. ... and that is all I'm trying to say. There is a great number of posts and for the most part they're highly negative. The casual reader is going to see all of these complaints about CUDA, and get the impression that much of the work being done is invalid because CUDA is turning out trash at an alarming rate -- and that the Science suffers because CUDA validates against CUDA, the CPU app gets outvoted, and we're missing our extra-terrestrial "I Love Lucy" episodes. All of the evidence we have is anecdotal. Except for Matt's 3% and your 14.25% all of the other factors are pure guesses. A quick, informal sampling says the odds of "bad" CUDA work is less than 100% (and by quite a bit, but my sample is too small to make any better conclusion). Absent some good solid statistics, produced by sampling as many results as possible, all we have is anecdotal evidence that sometimes CUDA is flawed. -- Ned |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14677 Credit: 200,643,578 RAC: 874 |
I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. It wasn't a question of acceptability or eligibility. You are quite right: any AR can be disributed to any app. The question is, what does the app do with the task of a particular AR when it has received it and started processing. Up until now, every app has been capable of processing every AR. No longer. The CUDA app errors (crashes) at VLAR. That alone should render it unfit for main project distribution, and relegate it to Beta. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
I have 3 machines ready to go on CUDA, I started 2 off but the number of errors was just silly. After the outage last week one machine ran through task after task in minutes with "compute error". Yes I suspect they were VLAR tasks, but it should not be up to me to have to work out which tasks I can and can't crunch. I have had to "detach" both machines to get them to stop accepting CUDA. But at least they are returning VALID results without me having to babysit them The home page is still happily advising everyone of CUDA, and so far I have not seen one official comment that there are ANY problems at all. Over all I am very disappointed in they way this has been handled. I been with this project almost since the beginning, but I feel let down now. I now have one machine crunching full time for The Clean Energy Project, if this proves to be stable,then the others may follow. |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
I happen to think that CUDA is the cat's meow. Oh, save me........ I have seen more reports of Cuda gone wrong than anything else.......... The kittyman is NOT impressed.............. "Time is simply the mechanism that keeps everything from happening all at once." |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
LOL.....guidoman I may beat you all at this game one day. I have my sources........ "Time is simply the mechanism that keeps everything from happening all at once." |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66284 Credit: 55,293,173 RAC: 49 |
LOL.....guidoman You would. ;) Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
Voyager Send message Joined: 2 Nov 99 Posts: 602 Credit: 3,264,813 RAC: 0 |
Sources or sorcerers? |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
well i now have 5 cores crunching on a 4 core machine. not bad, lets see if any errors occur now it's crunching 4 ap and cuda together. |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66284 Credit: 55,293,173 RAC: 49 |
Sources or sorcerers? One Cultures Technology is another Cultures Magic. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I'm not sure the relevance of your angle range comment. I don't know (and I'm willing to be corrected) that there is anything different between work units that are somehow "acceptable" or "eligible" for CUDA. From the complaints I've seen, it appears that a given work unit can be distributed to CUDA or non-CUDA. Yes, that should have been caught in beta. ... and as a developer, I'm always worried when my projects move from beta to release because you're never entirely sure when "enough" testing is enough. Since I haven't found one of these in my limited searching, what exactly happens when a VLAR WU crashes the app? Does it affect the next WU? |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error. When we have units like those at the minute going through there is no problem, but when they are VLAR or VHAR then all the problems appear. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Ned, it can do many wonderful things. The main one is cause thing the drivers to crash, cause fractals to appear on screen. It can screw up the whole cache of units if not caught quickly. It will cause them all to compute error. Questions like this always trigger my "software engineer" instincts, which is why I'm asking questions. Is it known for certain that VLAR or VHAR work units will always error? Is it possible that some versions of the driver and/or some video chips do not error? My big objection is that the bulk of the CUDA discussion has been "CUDA is bad" and if we want CUDA fixed then we need to get to "when this work unit (by number) is crunched with this model video chip and this driver, it fails." Does anyone know the limits? Is there a known "safe" range?? I'm really not interested in "how bad" because that doesn't lead to a solution. Defining the problem is the first step in getting it solved. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... It was. Problem with VLAR WUs on CUDA MB started Dec. 13th, confirmed with ample data Dec. 14th. Joe |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
... Then why the launch Joe, why the lauuch???????? I know...do you? "Time is simply the mechanism that keeps everything from happening all at once." |
Beau Send message Joined: 24 Feb 08 Posts: 50 Credit: 129,080 RAC: 0 |
Maybe they already cashed the check and had to launch...? Just a thought... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14677 Credit: 200,643,578 RAC: 874 |
Maybe they already cashed the check and had to launch...? Just a thought... The largest donation in December 2008 was only $1000.00, so it can't have been a very big check.... ;-) |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
Richard if it was made via other sources, it might not appear on that list. That list is only what they want us to see. |
SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0 |
ned from what i have seen personallt, the safe range for crunching is between 0.005 and 2.9. Anything else seems to cause f' ups! |
Beau Send message Joined: 24 Feb 08 Posts: 50 Credit: 129,080 RAC: 0 |
And they obviously aren't being truthful and telling everything about the cluster of a forced release of cuda well before it was ready for a production environment, so what else are they not saying?? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.