Message boards :
Number crunching :
Major problem with a user
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7
Author | Message |
---|---|
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I apologize if my thoughts were misunderstood. I was thinking of perhaps another way to get the attention of those few who tend to "set and forget" as well as those who feel it is "beneath them" to respond to sincere inquiries regarding the possible failure of their systems. I think we all pretty much agree on that one. Yes, there are "set and forget" people who will never ever look into anything ever again. They signed up, got crunching, and as I said before, they see their computer is always doing something and there are usually a bunch of WUs in the cache, so they figure they're doing just fine. They don't bother looking at their task pages to see that they are just spewing out invalids. On the other hand.. there are machines that probably were reliable and stable, but either got incompatible drivers, or the GPU is overheating now, or the GPU is just on its way to death, so it spews out invalids. In either case, having some kind of automated system to limit the amount of damage such hosts can do is the best solution (that's the job of the quota system). Yes, there are apps coming through the pipeline that can detect some of the common signs of a runaway task and actually abort the task with a non-zero exit error, which will keep invalid -9 overflows out of the DB a bit better, but you can't make an app that handles any and all situations. Every app is going to have compatibility issues with certain drivers along the way, so that's where the quota system steps in and does damage control. As stated a few posts back, like what they do on CPDN, if a machine goes rogue and the person cares enough to fix it, they can be allowed to try again. Except in this case, the quota system is automated and self-governing: if you trash a ton of WUs, you'll be stuck with 1 WU/day until you fix the problem. Once you fix the problem, you crunch your 1 WU, return it, it is valid, +2 gets added to your quota, now you have 3 to try. All three of those are good, you get +2 for each one, now you're up to 9. Do those 9, they're good, you're up to 27, and so on and so forth. Start trashing them again... back down to 1 in a hurry. With that kind of system, if someone gets dropped down to 1/day, those set and forget people that don't think there's anything wrong with what they're doing, they'll look and see that their machine isn't crunching anything anymore, or that they only have one task, and they'll come to at least the front page here to try to find out what's going on. Which means that for a month or so after the quota system gets a few values changed and applied, there will need to be a notice on the front page about the new quota system that says "if you have only one task per day, you have something wrong with your machine. Head over to Number Crunching and ask what's wrong, and somebody will help you get it sorted out." Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I think one problem with BOINC/Seti is ... - user adds seti - user watches times - user over clocks to improve times - user has no idea that they are all errors (since they never look at web page) - user thinks they are doing well with great throughput In the end, BOINC doesn't tell you are running errors and a new user has no idea what stats is normal. If BONIC could implement a "Report Back" system to show errors, maybe things would change. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Damn, I've been screwed again. :-( http://setiathome.berkeley.edu/workunit.php?wuid=1822350755 http://setiathome.berkeley.edu/results.php?hostid=6772486 http://setiathome.berkeley.edu/results.php?hostid=6759774 Cheers. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Doesn't Boinc tell you that under advanced tab, and event log? The trouble is, there are two different types of failure. Some cause an actual application crash, which is clearly visible locally in BOINC Manager, the Event log, and other monitoring tools like BOINC Tasks. It only requires modest observation skills to pick those up (assuming the monitor is on, and there's a person sitting in front of it - though that's unlikely with server farms). But most of the complaints in this thread refer to the other sort - the ones which all local tools report as having completed successfully, but turn out to contain complete garbage and don't validate (or worse, do validate against another garbage-producer). Those can only be identified from the website. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The last time I looked at this thread I thought the major problem was identified as a very old ATI/AMD App, namely r1831, not stopping hosts with out of date drivers. Also, r1831 doesn't compile new kernels when the driver changes, so, the driver can be fine but the r1831 kernels don't match the driver. I think I remember someone saying that was corrected around r1870 but for some reason we are still using r1831. r1870 was a very long time ago, we are now at r2929. Why we are still using r1831 is a mystery. From my experience the ATI/AMD OpenCL App r2929 now on Beta seems to be working well, the same can't be said about the nVidia OpenCL App though. Well, the last time I checked you don't have to wait on the nVidia App to deploy the ATI App. Instead of all this consternation and gnashing of body parts why don't we just wait for the new ATI App and see how that works? I guarantee deploying a new ATI App will be much easier than a few of the recent suggestions in this thread. I suggest waiting for the new release and reevaluating the situation afterwards. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Some could argue (and I happen to agree with them) that the Boinc mechanism is supposed to tolerate the inevitable unreliable users/hosts/applications. How well it does that without polluting the science I suppose would be a matter for each project. On the user side I think all the GPU applications, along with the client and monitoring support, have a long way to go before being close to set and forget friendly. How much is the ideal level of fault tolerance to put in for the user side I think is still a pretty open question, but more than none seems to be working out better. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
another one that i've had before http://setiathome.berkeley.edu/results.php?hostid=5754408 add this one http://setiathome.berkeley.edu/results.php?hostid=7114470 one more http://setiathome.berkeley.edu/results.php?hostid=7443139 i'll look at the others in the morning |
bluestar Send message Joined: 5 Sep 12 Posts: 7020 Credit: 2,084,789 RAC: 3 |
Before I sit down and relax, perhaps I could make the following point. Many users are supposed to behave and be responsible when it comes to the dealing and handling of given tasks. Errors from tasks most often is a result of technical issues, like heat and dust as well as errors in the applications, including the tasks available for CUDA processing. Therefore, when a computer appears to be going completely haywire or running havoc, it is because quite a large number of tasks became downloaded and the user may not be able to deal with such a big cache, either because of technical skills or possibly personal competence and insight. Therefore it should be an administrative responsibility present in order to ensure that such a thing does not happen. If not the computer is to blame, then it most likely should be the user running it instead. |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
Bluestar the ones i mention are repeat offenders I've got before , but given the benefit of the doubt and are going through 3000 + .Units P.M's from Wing men get no response most times . |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
new apps released http://setiathome.berkeley.edu/apps.php check if number of false positives decline |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
new apps released Sorry Raistmer not shor if your talking to me . but i'm not understanding what your asking me to do if you are talking to me. :) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
new apps released It's just info relevant to whole this thread, it's not direct answer on your last post. |
OTS Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0 |
And how does one machine obtain 9190 WUs in progress? I would really like to know his secret. ;) http://setiathome.berkeley.edu/results.php?hostid=7544616&offset=0&show_names=0&state=0&appid= |
woohoo Send message Joined: 30 Oct 13 Posts: 972 Credit: 165,671,404 RAC: 5 |
im sure he will make the deadline |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
And how does one machine obtain 9190 WUs in progress? I would really like to know his secret. ;) And his app details doesn't show the errors. http://setiathome.berkeley.edu/host_app_versions.php?hostid=7544616 |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
He's only has 2 work units awaiting validation, In progress (9140) Error (1096) SETI@home v7 (10238) Looks like he's figured out how to circumvent the limits placed on computers. The first time he did this and got the 1096 they all timed out. Those 9140 (no way he's going to make a dent in those) He's just going to force all those work units to be resent once the deadline passes. |
rob smith Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380 |
Gross rescheduling He believes the way to get a good score is to have a monumental cache and that it doesn't matter that a large proportion of the results returned are rubbish he's winning because he has the biggest cache Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Does anyone really think someone would try to horde MBs when they are always available? Not to mention he's only run two tasks in the last month. Quite possibly something has gone astray. From my recent experience I suspect the version of BOINC. "(MAY BE UNSTABLE - USE ONLY FOR TESTING)" can mean many things, most of them Not good. |
woohoo Send message Joined: 30 Oct 13 Posts: 972 Credit: 165,671,404 RAC: 5 |
i was running that preview build recently and i did run into a few instances where my entire cache would disappear so i'm back on the stable version for now |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
Let's hope siu77 looks at his machine and reverts back to the early version , wow 3 gig worth of units trash'd . Lucky there not AP's |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.