Modified SETI MB CUDA + opt AP package for full GPU utilization

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 25 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 845391 - Posted: 26 Dec 2008, 20:36:16 UTC - in response to Message 845369.  

...
Almost all task that do run are valid.
And a cuda task that sometimes takes up 50% of the CPU time .....
Sometimes it eats away 20 at a time but I have my doubt about the validation system of SETI. Seen results that are 100% in error and got 40 points for it. I see tasks done by 3 users, 1 is in error and all get points of the 2 valid tasks.

Eric has always been willing to give credit for Beta work by using a script, and that's obviously happening here too. The CUDA code is very new and problems can be expected, it wouldn't be fair to penalize users for weaknesses of the application. That kind of credit granting cannot effect the science. The occasional problem where two CUDA apps get a "strongly similar" result may cause false signals to be added to the science database, but the possibility they'll be part of a persistency match is vanishingly small.
                                                               Joe
ID: 845391 · Report as offensive
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 845393 - Posted: 26 Dec 2008, 20:44:19 UTC - in response to Message 845386.  


Ah, but the second one has "WU true angle range is : 0.083363", which is a very helpful indication that the problem extends beyond the 0.05 true VLAR range. Anything with angle range 0.03 to 0.35 is quite rare, and there are variations in array sizes and other details of the computations for anything above 0.05. It's quite possible that an 0.079 might be OK even though the 0.083 is bad, for instance. I did spot some 0.147 range work which seemed OK a couple of days ago.
                                                                  Joe


Should be possible to find out which AR triggers that error. Just modify the AR of a test wu and try running the app in standalone mode.



Join BOINC United now!
ID: 845393 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845410 - Posted: 26 Dec 2008, 21:08:58 UTC - in response to Message 845391.  
Last modified: 26 Dec 2008, 21:09:32 UTC

...
Almost all task that do run are valid.
And a cuda task that sometimes takes up 50% of the CPU time .....
Sometimes it eats away 20 at a time but I have my doubt about the validation system of SETI. Seen results that are 100% in error and got 40 points for it. I see tasks done by 3 users, 1 is in error and all get points of the 2 valid tasks.

Eric has always been willing to give credit for Beta work by using a script, and that's obviously happening here too. The CUDA code is very new and problems can be expected, it wouldn't be fair to penalize users for weaknesses of the application. That kind of credit granting cannot effect the science. The occasional problem where two CUDA apps get a "strongly similar" result may cause false signals to be added to the science database, but the possibility they'll be part of a persistency match is vanishingly small.
                                                               Joe


Not so small. It happened already with overflowed result.
CUDA can generate such results with amazing speed untill host will be rebooted. So leaved unattended such host can prepare pretty big field for such false "validations".
ID: 845410 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845413 - Posted: 26 Dec 2008, 21:10:53 UTC - in response to Message 845410.  
Last modified: 26 Dec 2008, 21:32:59 UTC

Here is new build with logging of AR of overflowed tasks.
It will create and append later file r_debug.txt on C:\ and will write AR of overflowed task there.
Sure there will be "legal" overflows too, but any overflows not validated against CPU result worth to report.

Just replace CUDA MB executable from my last package with this one. Name remains same.
ID: 845413 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 845417 - Posted: 26 Dec 2008, 21:29:24 UTC - in response to Message 845410.  

...
The occasional problem where two CUDA apps get a "strongly similar" result may cause false signals to be added to the science database, but the possibility they'll be part of a persistency match is vanishingly small.
                                                               Joe

Not so small. It happened already with overflowed result.
CUDA can generate such results with amazing speed untill host will be rebooted. So leaved unattended such host can prepare pretty big field for such false "validations".

I didn't mean to imply I think the problem is negligible. Matt said that 3% of validations involve CUDA processing, so the rate at which two CUDA apps are paired is about 1 in 1000. If about half the work on CUDA apps is getting false signal overflows, that's a lot of bad data going into the science database. However, any that say result_overflow are flagged as such when being put into the database, and the likelihood that another observation from another time will match the sky position, frequency, and signal type is small anyhow. NTPCKR will have more data to chew on, but I don't think it is going to identify these as possible candidates for reobservation.
                                                                Joe
ID: 845417 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 845419 - Posted: 26 Dec 2008, 21:32:17 UTC

In my humble opinion.....

Cuda should be withdrawn from Seti Main back into the beta stage from whence it came.....until the 'bugs' are worked out.

In a scientific project, there is little room for tossing known bad data into the results, thereby invalidating what otherwise would be good work.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 845419 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845425 - Posted: 26 Dec 2008, 21:40:23 UTC - in response to Message 845419.  

In my humble opinion.....

Cuda should be withdrawn from Seti Main back into the beta stage from whence it came.....until the 'bugs' are worked out.

In a scientific project, there is little room for tossing known bad data into the results, thereby invalidating what otherwise would be good work.


Agreed. But this can be done only by project staff. We all do what we can do....
ID: 845425 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 845427 - Posted: 26 Dec 2008, 21:43:12 UTC - in response to Message 845425.  

In my humble opinion.....

Cuda should be withdrawn from Seti Main back into the beta stage from whence it came.....until the 'bugs' are worked out.

In a scientific project, there is little room for tossing known bad data into the results, thereby invalidating what otherwise would be good work.


Agreed. But this can be done only by project staff. We all do what we can do....

Of course, Raistmer.
I am not saying anything bad about your efforts to work with what was put forth.

I am suggesting that the Admins yank it from Main until it has proven itself in Beta.....it never should have been released here.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 845427 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 845429 - Posted: 26 Dec 2008, 21:48:34 UTC - in response to Message 845378.  

OMG... look here on this your result. It's absolute record about quantity of errors per single result %)

It seems you should check your GPU stability before doing any OCing...


Hmmm.. I only set the low 3d clocks and 2d clocks to the same speed as performance 3d clocks... ATI tools claims it is stable (after a 30 min test)

WAIT!! I know why that happened... Your cuda app requested to connect to the internet... It was waiting for my permission... (Firewall)

Destination IP 75.154.132.100:53

Any Ideas why?


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 845429 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845432 - Posted: 26 Dec 2008, 21:55:24 UTC - in response to Message 845429.  
Last modified: 26 Dec 2008, 22:00:42 UTC

OMG... look here on this your result. It's absolute record about quantity of errors per single result %)

It seems you should check your GPU stability before doing any OCing...


Hmmm.. I only set the low 3d clocks and 2d clocks to the same speed as performance 3d clocks... ATI tools claims it is stable (after a 30 min test)

WAIT!! I know why that happened... Your cuda app requested to connect to the internet... It was waiting for my permission... (Firewall)

Destination IP 75.154.132.100:53

Any Ideas why?


Hm... no ideas.
This IP resolved to cachednsab04.nssi.telus.com
Sure not my host :)
Try to check it by all antivirus means you could reach. Maybe my MSVC production host infected by some trojan horse virus? ... Will check this too...

UPDATE: My current NOD32 version says file clean.
And I have some idea what could happened - could it be app crash and Windows OS launches error reporting ?
ID: 845432 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 845440 - Posted: 26 Dec 2008, 22:27:18 UTC - in response to Message 845432.  

And I have some idea what could happened - could it be app crash and Windows OS launches error reporting ?


No CUDA app errors in event viewer...
Haven't seen an app crash causing a error report screen (only computer crashes)



Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 845440 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 845443 - Posted: 26 Dec 2008, 22:32:47 UTC

I have installed the updated file. Will run with it and let you know.

Bernie
ID: 845443 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845453 - Posted: 26 Dec 2008, 23:04:03 UTC - in response to Message 845443.  
Last modified: 26 Dec 2008, 23:51:02 UTC

I have installed the updated file. Will run with it and let you know.

Bernie

Fine :)
Just don't forget to check this executable with your latest antivirus software (as it should be for all inet-downloaded stuff).

@popandbob
What your antivirus said?

Don't like idea that my server security compromised in such way... :/

UPDATE: DrWEB CureIT! says "clean" too.
ID: 845453 · Report as offensive
john deneer
Volunteer tester
Avatar

Send message
Joined: 16 Nov 06
Posts: 331
Credit: 20,996,606
RAC: 0
Netherlands
Message 845460 - Posted: 26 Dec 2008, 23:50:55 UTC - in response to Message 845419.  

In my humble opinion.....

Cuda should be withdrawn from Seti Main back into the beta stage from whence it came.....until the 'bugs' are worked out.

In a scientific project, there is little room for tossing known bad data into the results, thereby invalidating what otherwise would be good work.


I fully agree. Imagine the scenario that somebody would build an optimized application for cpu crunching which would give a 10x faster performance compared to stock but resulting in incorrect results. This application would run on each and every system, resulting in speed increases on all systems, but generating wrong results on all systems as well. The guys building it would most likely not be inclined to distribute it, feeling a responsibility to distribute a correctly working program only. And if they did distribute it their program would have to be installed 'manually' by crunchers anyway, and thus it would most probably get distributed only to the 5% or so that are 'on the ball' anyway. Since most of these faulty wu's would be paired with a wingman using stock they would be discarded. So even if the developers of this faulty program would behave irresponsible the damage would be limited since most faulty results would be discarded.

Now look at the scenario that has been established for the cuda application. This thing is generating faulty results all over the place. If you upgrade to the newest version of boinc, and you are the happy owner of a cuda capable card (and there are a lot of computer enthousiasts using nvidia cards for gaming) you have automatically enabled the use of this card for crunching seti, since using the card is the default preference enabled for all users. There are people in the message boards who are stunned to find that their computer is crunching seti using the graphics card.

If the resulting havoc had been caused by some cruncher developing a great killer application in his spare time, thinking he was doing everybody a favor, he would have been called an idiot, somebody who obviously had no idea of what he was doing. The problem is that when you open the stage curtains what you get is not an idiot but the people who should have known better (at least, that's how I think about it).

Okay, I think I'll stop ranting now. I had to blow off some steam I guess :-)

Regards,
John.

PS: I'm actually using the cuda application on a machine that has 2 gt8800's. It seems to be producing reliable results, but who can tell. I'm using this system in the hope that it produces as many faulty results as possible. Unfortunately my dcf on that rig is such that I receive only very few wu's. Producing as many faulty results as possible will hopefully get the message across ..... That's the Christmas spirit for you. Oh sorry, still ranting :-)

ID: 845460 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 845461 - Posted: 26 Dec 2008, 23:51:10 UTC
Last modified: 27 Dec 2008, 0:10:07 UTC

Got first validated result using new modified app.

382164460

This task wasn't a problem, but it's the first one that validated, even though I have other tasks that should have validated before this one.

using the new build released today now.

Edit: even though the "other task" isn't 2.7 AR I'm still curious why it isn't validating and no wingman is being sent out on it if there is a problem?
ID: 845461 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845462 - Posted: 27 Dec 2008, 0:01:51 UTC - in response to Message 845461.  

Well, AR=2,7 and "valid" w/o overflows... fine.
ID: 845462 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 845502 - Posted: 27 Dec 2008, 1:46:21 UTC - in response to Message 845461.  

...even though the "other task" isn't 2.7 AR I'm still curious why it isn't validating and no wingman is being sent out on it if there is a problem?

Since about 1 pm Dec. 25 Berkeley time, the Validator has been falling behind, now "Workunits waiting for validation 197,689". I think it's still working, but slowly enough that it might be running as much as a day late.
                                                                  Joe
ID: 845502 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 845505 - Posted: 27 Dec 2008, 1:51:27 UTC - in response to Message 845502.  

...even though the "other task" isn't 2.7 AR I'm still curious why it isn't validating and no wingman is being sent out on it if there is a problem?

Since about 1 pm Dec. 25 Berkeley time, the Validator has been falling behind, now "Workunits waiting for validation 197,689". I think it's still working, but slowly enough that it might be running as much as a day late.
                                                                  Joe

Thanks Joe, I thought there might be something I wasn't aware of, wrong with the task, and was concerned about doing more till I found out what it was.

I'm at ease now.
ID: 845505 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 845506 - Posted: 27 Dec 2008, 1:58:54 UTC - in response to Message 845462.  
Last modified: 27 Dec 2008, 2:04:09 UTC

Well, AR=2,7 and "valid" w/o overflows... fine.


I've had [edit] some [/edit] 2.71's w/o overflow and [edit] most [/edit] 2.72's with overflow's
Still waiting on the validator to see if they are valid or not...

My anti-virus (zone alarm) Says my PC is clear of all spyware/viruses...
Maybe this will have to go under the "Microsoft Mystery" folder..lol


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 845506 · Report as offensive
Profile enusbaum
Volunteer tester

Send message
Joined: 29 Apr 00
Posts: 15
Credit: 5,921,750
RAC: 0
United States
Message 845548 - Posted: 27 Dec 2008, 4:59:52 UTC

I also think the issue is partly driver related.

I'm running an 8800GTX on the 180.84 beta drivers and started getting nothing but -9 overflows until I rebooted and that seemed to also fix the problem.

So this all might be just a combination of immature drivers and also the API perhaps not clearing our resources after a crash?? I'm not too familiar with the CUDA API.
ID: 845548 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 25 · Next

Message boards : Number crunching : Modified SETI MB CUDA + opt AP package for full GPU utilization


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.