Message boards :
Number crunching :
Broken CUDA host - lots of incunclusives few seconds run time
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
yes, invalid 30 overflows occurs on pulses for that GPU it seems. I've been exploring options here for some time (of which there are many) which include an extremely small amount of reprocessing host side. I've come to recognise the template boinc response to client side fault tolerance is usually "probably more trouble than it's worth". We already know that approach comes undone with thread safety (exit handling) and there are others. So I am likely to embed several levels of monitoring & corrective actions for x42 series, which should include *some method* of alerting the user of obvious problems if they so choose. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
Looked at my newest inconclusive: http://setiathome.berkeley.edu/workunit.php?wuid=1313648942 So I looked at the machine: http://setiathome.berkeley.edu/results.php?hostid=6253478 Then at one of his valids: http://setiathome.berkeley.edu/result.php?resultid=3016631119 It was validated against another gpu: http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478 With this in the end of his std error: Device cannot be used Cuda initialisation FAILED, Initiating Boinc temporary exit (180 secs) Preemptively Acknowledging temporary exit -> boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Does not look so good to me. Dave |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
Looked at my newest inconclusive: EDIT: Sorry, the last link should have been: http://setiathome.berkeley.edu/show_host_detail.php?hostid=6818618 Dave |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The "valid" status for those WU 1253963841 tasks is apparently from a credit granting script. Note there's no canonical result shown, and it's a SaH v6 WU which hasn't been reissued although the third wingmate timed out. Joe |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
I see all that you are saying. But, How can an error and an apparently valid result be validated? Are we polluting the database with issues like this? Dave |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I see all that you are saying. In order for BOINC to grant credit, the tasks MUST be marked as valid. But because there was no canonical result, nothing could be assimilated. If the third task had come back and actually compared strongly similar to the non-errored Task 3016631120 then there would have been a correct canonical result chosen. Basically there is no harm done by the credit granting script, other than giving a faint odor of rotting fish to all credits by sometimes granting unearned credit for tasks which errored. Joe |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Host http://setiathome.berkeley.edu/results.php?hostid=6798051 is still going headstrong into producing mostly invalids and errors. I'm all for something that CPDN is doing, produce only errors and your system gets a technical time-out by the server, you won't get any work anymore until you've proven to the moderators that your system can do productive work again. |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
I agree with the previous. Valuable project resources are being consumed by crunchers who are producing junk. Dave |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Host http://setiathome.berkeley.edu/results.php?hostid=6798051 is still going headstrong into producing mostly invalids and errors. I'm all for something that CPDN is doing, produce only errors and your system gets a technical time-out by the server, you won't get any work anymore until you've proven to the moderators that your system can do productive work again. I don't know if human involvement on the project end is a good idea. Too many potential contacts having to be handled. Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis. Too easily circumvented. Anything not blocked by the server, written in stone in the database, will easily be circumvented by those who know how to. And else they'll ask on forums until they find someone who knows how to. CPDN isn't too small either, with half a million hosts. And with work models that usually take several days, if not months, you do not want hosts to trash that work as if it's peanuts. They've got a compact group of moderators who will email their administrator(s) when they find one or more hosts that are trashing work, and then doing so by the hundreds or thousands. Then the administrator will just manually set work fetch capability for such a host to -1. Meaning that host won't be able to fetch work. The user will also be emailed, that the project has blocked work fetch on that host, and told what steps to follow before work is allowed again. Even anonymous system owners will be reached this way, since the administrator can read the email address the person registered with. I'd think that being emailed by the administrator of the project you run hits a little harder home than General Joe and all his flunkies PMing you about it. ;-) Translated to Seti that could be: 1. A thread in which people report computers going too far. 2. The moderators will read that thread and check into the systems being reported. 3. When found viable, forward the information about the host to the administrator. 4. When it's convenient for the administrator, block the host and send out an email to its owner with steps on how to remedy the troubles of that host. Normally this means, come post for help in the forums. 5. When the user has been helped, he can contact the administrator and his system can be freed. Not many contacts needed there. |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Maybe something where Boinc Manager throws up a flag in the system tray and the user has to go through a short list of things to check and possibly correct. Then the host is allowed to work again on a probationary basis. As I said, the host would go back to work on a probationary basis. The servers would keep a closer than normal watch on it and if it started doing trash again after the owner had done something (and, I suppose, somehow acknowledged being advised of the problem and that they had done something about it), then it would be more severely restricted and the contact escalated. I can't see any motive for anyone to deliberately return to doing bad work after being told about it. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
rob smith Send message Joined: 7 Mar 03 Posts: 22241 Credit: 416,307,556 RAC: 380 |
One of the problems facing SETI@Home is that a number of users do not use "real" email addresses, rather "time expired" ones. Thus contacting them is next to impossible. One thought is that BOINC accounts should be "life limited", say one year. Thus after a year you would get an email requesting that you validate you are a living responding human. In the same vein I would like to see the automatic banishment of "anonymous" account as they appear to be disproportionately represented in "WU Trashers League". Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
In the same vein I would like to see the automatic banishment of "anonymous" account as they appear to be disproportionately represented in "WU Trashers League". (The most of) those aren't anonymous accounts, they are just named such by the server if the non-anonymous user has his/her computers hidden (otherwise the hiding wouldn't work;-). Gruß Gundolf |
Uli Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1 |
I don't like hidden computers. I agree with Rob, a viable e-mail addy should be a must. Then again, the Admins set the rules. Pluto will always be a planet to me. Seti Ambassador Not to late to order an Anni Shirt |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
I don't like hidden computers. I agree with Rob, a viable e-mail addy should be a must. And the admins can see all email addresses, even those of users with hidden computers. Just we 'normal' users can't see the addresses, even if the hosts are not hidden. Only users with expired or invalid email addresses aren't reachable, but those would just have to ask on the fora if they see their hosts being cut off. Gruß Gundolf |
rob smith Send message Joined: 7 Mar 03 Posts: 22241 Credit: 416,307,556 RAC: 380 |
I'm not to worried about seeing an email address, but there are times when being ale to use the PM system to contact a user of an errant PC could be of use. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
Here is another one for the list. http://setiathome.berkeley.edu/results.php?hostid=452306 State: All (127) · In progress (32) · Validation pending (4) · Validation inconclusive (5) · Valid (0) · Invalid (1) · Error (85) Application: All (127) · AstroPulse v6 (10) · SETI@home Enhanced (0) · SETI@home v7 (117) Dave |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
My own "gut" feeling, too, was that Anonymous accounts seemed to show up more frequently as "bad" wingmen than did identifiable accounts. However, I finally got around to pulling some numbers from my own WU database, and found that they're actually not that much worse, only a little bit! Out of about 80,000 WUs that I've processed in the last 6 months or so, I've had 89,948 wingmen whose tasks are either completed and validated, or have failed in some fashion. (I've excluded those that are still in progress or are still in an inconclusive state.) Of those, 12,613 wingmen were Anonymous, and of that number, 10,989 were completed and validated, while 1,624 failed in some fashion (computation errors, invalids, download errors, abortions, abandonments, time outs, etc.). That represents a 12.87% failure rate. On the other hand, 77,335 wingmen were identifiable, with 69,090 completed and validated, while 8,245 failed. That represents a 10.66% failure rate. Not all that much better! |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
http://setiathome.berkeley.edu/results.php?hostid=7100749 Does not look like his GPU is doing so well on MB tasks Dave |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Thanks for those numbers! :) I did try to probe the project directly for some reliability information sometime back, but it resulted in silence (they are busy of course). There are some simple, very low computational cost, approaches I'm looking at designing into x42 Cuda series, that might even aid in the '560ti' situation (as the host indicated by Dave in the post before mine). Frankly, for a long time, I was baffled by the apparent difficulty running these. Whether there were some weird unknown application bugs etc specific to the GPU. The apparent near flawless performance of my own 560ti here made things even more interesting. In short, there are ways that can be used application side to detect the majority & take appropriate measures (whatever they be). Over time it's become clear that the 'performance-on-a-budget' market, in particular, is the most sensitive to basic system build issues... something like street hot-dog vendors selling F15 Eagle fighter aircraft. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.