Broken CUDA host - lots of incunclusives few seconds run time

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1423409 - Posted: 3 Oct 2013, 2:42:36 UTC - in response to Message 1423115.  
Last modified: 3 Oct 2013, 2:45:41 UTC

yes, invalid 30 overflows occurs on pulses for that GPU it seems.
But this and many other hosts show quite clearly that current quota management system requires reconsideration. I wrote in BOINC dev mail list about it. Ignored so far (no single reply on mail).

Another direction would be NV pushing to properly detect such error condition and report error via CUDA runtime to app could deal with it (some memory buffer corruption here most probably). Currently there is no errors to handle from app point of view. Unfortunately, same situation we have with some broken OpenCL environment conditions so I think it's easier and more realistic to provide defense on BOINC level still.


I've been exploring options here for some time (of which there are many) which include an extremely small amount of reprocessing host side. I've come to recognise the template boinc response to client side fault tolerance is usually "probably more trouble than it's worth". We already know that approach comes undone with thread safety (exit handling) and there are others. So I am likely to embed several levels of monitoring & corrective actions for x42 series, which should include *some method* of alerting the user of obvious problems if they so choose.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1423409 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1423674 - Posted: 3 Oct 2013, 17:56:00 UTC

Looked at my newest inconclusive:

http://setiathome.berkeley.edu/workunit.php?wuid=1313648942


So I looked at the machine:

http://setiathome.berkeley.edu/results.php?hostid=6253478

Then at one of his valids:

http://setiathome.berkeley.edu/result.php?resultid=3016631119

It was validated against another gpu:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478

With this in the end of his std error:

Device cannot be used
Cuda initialisation FAILED, Initiating Boinc temporary exit (180 secs)
Preemptively Acknowledging temporary exit -> boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->

Does not look so good to me.



Dave

ID: 1423674 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1423769 - Posted: 3 Oct 2013, 20:34:33 UTC - in response to Message 1423674.  

Looked at my newest inconclusive:

http://setiathome.berkeley.edu/workunit.php?wuid=1313648942


So I looked at the machine:

http://setiathome.berkeley.edu/results.php?hostid=6253478

Then at one of his valids:

http://setiathome.berkeley.edu/result.php?resultid=3016631119

It was validated against another gpu:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478

With this in the end of his std error:

Device cannot be used
Cuda initialisation FAILED, Initiating Boinc temporary exit (180 secs)
Preemptively Acknowledging temporary exit -> boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->

Does not look so good to me.




EDIT:

Sorry, the last link should have been:


http://setiathome.berkeley.edu/show_host_detail.php?hostid=6818618
Dave

ID: 1423769 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1423819 - Posted: 3 Oct 2013, 23:45:06 UTC - in response to Message 1423769.  

The "valid" status for those WU 1253963841 tasks is apparently from a credit granting script. Note there's no canonical result shown, and it's a SaH v6 WU which hasn't been reissued although the third wingmate timed out.
                                                                   Joe
ID: 1423819 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1423838 - Posted: 4 Oct 2013, 0:16:24 UTC

I see all that you are saying.

But,

How can an error and an apparently valid result be validated?

Are we polluting the database with issues like this?

Dave

ID: 1423838 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1423917 - Posted: 4 Oct 2013, 5:18:31 UTC - in response to Message 1423838.  

I see all that you are saying.

But,

How can an error and an apparently valid result be validated?

Are we polluting the database with issues like this?

In order for BOINC to grant credit, the tasks MUST be marked as valid. But because there was no canonical result, nothing could be assimilated. If the third task had come back and actually compared strongly similar to the non-errored Task 3016631120 then there would have been a correct canonical result chosen.

Basically there is no harm done by the credit granting script, other than giving a faint odor of rotting fish to all credits by sometimes granting unearned credit for tasks which errored.
                                                                   Joe
ID: 1423917 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1426841 - Posted: 10 Oct 2013, 21:39:24 UTC

Host http://setiathome.berkeley.edu/results.php?hostid=6798051 is still going headstrong into producing mostly invalids and errors. I'm all for something that CPDN is doing, produce only errors and your system gets a technical time-out by the server, you won't get any work anymore until you've proven to the moderators that your system can do productive work again.
ID: 1426841 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1426925 - Posted: 11 Oct 2013, 2:31:26 UTC

I agree with the previous.

Valuable project resources are being consumed by crunchers who are producing junk.

Dave

ID: 1426925 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1427057 - Posted: 11 Oct 2013, 13:27:02 UTC - in response to Message 1426841.  

Host http://setiathome.berkeley.edu/results.php?hostid=6798051 is still going headstrong into producing mostly invalids and errors. I'm all for something that CPDN is doing, produce only errors and your system gets a technical time-out by the server, you won't get any work anymore until you've proven to the moderators that your system can do productive work again.

I don't know if human involvement on the project end is a good idea. Too many potential contacts having to be handled.

Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1427057 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1427147 - Posted: 11 Oct 2013, 15:50:13 UTC - in response to Message 1427057.  

Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis.

Too easily circumvented. Anything not blocked by the server, written in stone in the database, will easily be circumvented by those who know how to. And else they'll ask on forums until they find someone who knows how to.

CPDN isn't too small either, with half a million hosts. And with work models that usually take several days, if not months, you do not want hosts to trash that work as if it's peanuts. They've got a compact group of moderators who will email their administrator(s) when they find one or more hosts that are trashing work, and then doing so by the hundreds or thousands.

Then the administrator will just manually set work fetch capability for such a host to -1. Meaning that host won't be able to fetch work. The user will also be emailed, that the project has blocked work fetch on that host, and told what steps to follow before work is allowed again. Even anonymous system owners will be reached this way, since the administrator can read the email address the person registered with.

I'd think that being emailed by the administrator of the project you run hits a little harder home than General Joe and all his flunkies PMing you about it. ;-)

Translated to Seti that could be:
1. A thread in which people report computers going too far.
2. The moderators will read that thread and check into the systems being reported.
3. When found viable, forward the information about the host to the administrator.
4. When it's convenient for the administrator, block the host and send out an email to its owner with steps on how to remedy the troubles of that host. Normally this means, come post for help in the forums.
5. When the user has been helped, he can contact the administrator and his system can be freed.

Not many contacts needed there.
ID: 1427147 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1428565 - Posted: 14 Oct 2013, 17:38:32 UTC - in response to Message 1427147.  

Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis.

Too easily circumvented. Anything not blocked by the server, written in stone in the database, will easily be circumvented by those who know how to. And else they'll ask on forums until they find someone who knows how to.

As I said, the host would go back to work on a probationary basis. The servers would keep a closer than normal watch on it and if it started doing trash again after the owner had done something (and, I suppose, somehow acknowledged being advised of the problem and that they had done something about it), then it would be more severely restricted and the contact escalated. I can't see any motive for anyone to deliberately return to doing bad work after being told about it.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1428565 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1428579 - Posted: 14 Oct 2013, 17:56:34 UTC

One of the problems facing SETI@Home is that a number of users do not use "real" email addresses, rather "time expired" ones. Thus contacting them is next to impossible. One thought is that BOINC accounts should be "life limited", say one year. Thus after a year you would get an email requesting that you validate you are a living responding human.
In the same vein I would like to see the automatic banishment of "anonymous" account as they appear to be disproportionately represented in "WU Trashers League".
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1428579 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1428707 - Posted: 14 Oct 2013, 22:25:44 UTC - in response to Message 1428579.  

In the same vein I would like to see the automatic banishment of "anonymous" account as they appear to be disproportionately represented in "WU Trashers League".

(The most of) those aren't anonymous accounts, they are just named such by the server if the non-anonymous user has his/her computers hidden (otherwise the hiding wouldn't work;-).

Gruß
Gundolf
ID: 1428707 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1428744 - Posted: 15 Oct 2013, 1:36:43 UTC

I don't like hidden computers. I agree with Rob, a viable e-mail addy should be a must.
Then again, the Admins set the rules.
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1428744 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1428895 - Posted: 15 Oct 2013, 7:55:51 UTC - in response to Message 1428744.  

I don't like hidden computers. I agree with Rob, a viable e-mail addy should be a must.
Then again, the Admins set the rules.

And the admins can see all email addresses, even those of users with hidden computers. Just we 'normal' users can't see the addresses, even if the hosts are not hidden.

Only users with expired or invalid email addresses aren't reachable, but those would just have to ask on the fora if they see their hosts being cut off.

Gruß
Gundolf
ID: 1428895 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1428957 - Posted: 15 Oct 2013, 20:30:31 UTC

I'm not to worried about seeing an email address, but there are times when being ale to use the PM system to contact a user of an errant PC could be of use.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1428957 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1429098 - Posted: 16 Oct 2013, 4:51:08 UTC

Here is another one for the list.

http://setiathome.berkeley.edu/results.php?hostid=452306

State: All (127) · In progress (32) · Validation pending (4) · Validation inconclusive (5) · Valid (0) · Invalid (1) · Error (85)
Application: All (127) · AstroPulse v6 (10) · SETI@home Enhanced (0) · SETI@home v7 (117)
Dave

ID: 1429098 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1434876 - Posted: 28 Oct 2013, 19:03:14 UTC - in response to Message 1428579.  


In the same vein I would like to see the automatic banishment of "anonymous" account as they appear to be disproportionately represented in "WU Trashers League".

My own "gut" feeling, too, was that Anonymous accounts seemed to show up more frequently as "bad" wingmen than did identifiable accounts. However, I finally got around to pulling some numbers from my own WU database, and found that they're actually not that much worse, only a little bit!

Out of about 80,000 WUs that I've processed in the last 6 months or so, I've had 89,948 wingmen whose tasks are either completed and validated, or have failed in some fashion. (I've excluded those that are still in progress or are still in an inconclusive state.)

Of those, 12,613 wingmen were Anonymous, and of that number, 10,989 were completed and validated, while 1,624 failed in some fashion (computation errors, invalids, download errors, abortions, abandonments, time outs, etc.). That represents a 12.87% failure rate.

On the other hand, 77,335 wingmen were identifiable, with 69,090 completed and validated, while 8,245 failed. That represents a 10.66% failure rate. Not all that much better!
ID: 1434876 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1439290 - Posted: 7 Nov 2013, 2:03:13 UTC

http://setiathome.berkeley.edu/results.php?hostid=7100749

Does not look like his GPU is doing so well on MB tasks
Dave

ID: 1439290 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1439316 - Posted: 7 Nov 2013, 2:41:30 UTC - in response to Message 1434876.  
Last modified: 7 Nov 2013, 2:41:58 UTC

Thanks for those numbers! :) I did try to probe the project directly for some reliability information sometime back, but it resulted in silence (they are busy of course).

There are some simple, very low computational cost, approaches I'm looking at designing into x42 Cuda series, that might even aid in the '560ti' situation (as the host indicated by Dave in the post before mine).

Frankly, for a long time, I was baffled by the apparent difficulty running these. Whether there were some weird unknown application bugs etc specific to the GPU. The apparent near flawless performance of my own 560ti here made things even more interesting.

In short, there are ways that can be used application side to detect the majority & take appropriate measures (whatever they be). Over time it's become clear that the 'performance-on-a-budget' market, in particular, is the most sensitive to basic system build issues... something like street hot-dog vendors selling F15 Eagle fighter aircraft.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1439316 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.