Major problem with a user

Message boards : Number crunching : Major problem with a user
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7

AuthorMessage
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1692911 - Posted: 18 Jun 2015, 3:23:28 UTC - in response to Message 1692833.  
Last modified: 18 Jun 2015, 3:25:42 UTC

I apologize if my thoughts were misunderstood. I was thinking of perhaps another way to get the attention of those few who tend to "set and forget" as well as those who feel it is "beneath them" to respond to sincere inquiries regarding the possible failure of their systems.

I meant no disrespect to those who, despite the responsible tending of their systems, compile invalid or inconclusive wu's due to circumstances beyond their control.

Personally I will always welcome any advice on how to make my system run more efficient. Feel free to examine my work and ask questions about my rig as it is not hidden.

I think we all pretty much agree on that one. Yes, there are "set and forget" people who will never ever look into anything ever again. They signed up, got crunching, and as I said before, they see their computer is always doing something and there are usually a bunch of WUs in the cache, so they figure they're doing just fine. They don't bother looking at their task pages to see that they are just spewing out invalids.

On the other hand.. there are machines that probably were reliable and stable, but either got incompatible drivers, or the GPU is overheating now, or the GPU is just on its way to death, so it spews out invalids.

In either case, having some kind of automated system to limit the amount of damage such hosts can do is the best solution (that's the job of the quota system). Yes, there are apps coming through the pipeline that can detect some of the common signs of a runaway task and actually abort the task with a non-zero exit error, which will keep invalid -9 overflows out of the DB a bit better, but you can't make an app that handles any and all situations. Every app is going to have compatibility issues with certain drivers along the way, so that's where the quota system steps in and does damage control.

As stated a few posts back, like what they do on CPDN, if a machine goes rogue and the person cares enough to fix it, they can be allowed to try again. Except in this case, the quota system is automated and self-governing: if you trash a ton of WUs, you'll be stuck with 1 WU/day until you fix the problem. Once you fix the problem, you crunch your 1 WU, return it, it is valid, +2 gets added to your quota, now you have 3 to try. All three of those are good, you get +2 for each one, now you're up to 9. Do those 9, they're good, you're up to 27, and so on and so forth. Start trashing them again... back down to 1 in a hurry.

With that kind of system, if someone gets dropped down to 1/day, those set and forget people that don't think there's anything wrong with what they're doing, they'll look and see that their machine isn't crunching anything anymore, or that they only have one task, and they'll come to at least the front page here to try to find out what's going on. Which means that for a month or so after the quota system gets a few values changed and applied, there will need to be a notice on the front page about the new quota system that says "if you have only one task per day, you have something wrong with your machine. Head over to Number Crunching and ask what's wrong, and somebody will help you get it sorted out."
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1692911 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1692929 - Posted: 18 Jun 2015, 4:50:24 UTC

I think one problem with BOINC/Seti is ...

- user adds seti
- user watches times
- user over clocks to improve times
- user has no idea that they are all errors (since they never look at web page)
- user thinks they are doing well with great throughput

In the end, BOINC doesn't tell you are running errors and a new user has no idea what stats is normal.

If BONIC could implement a "Report Back" system to show errors, maybe things would change.
ID: 1692929 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1693038 - Posted: 18 Jun 2015, 9:55:47 UTC

ID: 1693038 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1693042 - Posted: 18 Jun 2015, 10:07:45 UTC - in response to Message 1693020.  

Doesn't Boinc tell you that under advanced tab, and event log?

The trouble is, there are two different types of failure.

Some cause an actual application crash, which is clearly visible locally in BOINC Manager, the Event log, and other monitoring tools like BOINC Tasks. It only requires modest observation skills to pick those up (assuming the monitor is on, and there's a person sitting in front of it - though that's unlikely with server farms).

But most of the complaints in this thread refer to the other sort - the ones which all local tools report as having completed successfully, but turn out to contain complete garbage and don't validate (or worse, do validate against another garbage-producer). Those can only be identified from the website.
ID: 1693042 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1693048 - Posted: 18 Jun 2015, 10:49:07 UTC

The last time I looked at this thread I thought the major problem was identified as a very old ATI/AMD App, namely r1831, not stopping hosts with out of date drivers. Also, r1831 doesn't compile new kernels when the driver changes, so, the driver can be fine but the r1831 kernels don't match the driver. I think I remember someone saying that was corrected around r1870 but for some reason we are still using r1831. r1870 was a very long time ago, we are now at r2929. Why we are still using r1831 is a mystery.

From my experience the ATI/AMD OpenCL App r2929 now on Beta seems to be working well, the same can't be said about the nVidia OpenCL App though. Well, the last time I checked you don't have to wait on the nVidia App to deploy the ATI App. Instead of all this consternation and gnashing of body parts why don't we just wait for the new ATI App and see how that works?

I guarantee deploying a new ATI App will be much easier than a few of the recent suggestions in this thread. I suggest waiting for the new release and reevaluating the situation afterwards.
ID: 1693048 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1693095 - Posted: 18 Jun 2015, 14:10:31 UTC
Last modified: 18 Jun 2015, 14:11:24 UTC

Some could argue (and I happen to agree with them) that the Boinc mechanism is supposed to tolerate the inevitable unreliable users/hosts/applications. How well it does that without polluting the science I suppose would be a matter for each project. On the user side I think all the GPU applications, along with the client and monitoring support, have a long way to go before being close to set and forget friendly. How much is the ideal level of fault tolerance to put in for the user side I think is still a pretty open question, but more than none seems to be working out better.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1693095 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1694328 - Posted: 21 Jun 2015, 14:34:25 UTC

ID: 1694328 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 7020
Credit: 2,084,789
RAC: 3
Message 1694588 - Posted: 22 Jun 2015, 14:38:19 UTC
Last modified: 22 Jun 2015, 15:22:16 UTC

Before I sit down and relax, perhaps I could make the following point.

Many users are supposed to behave and be responsible when it comes to the dealing and handling of given tasks.

Errors from tasks most often is a result of technical issues, like heat and dust as well as errors in the applications, including the tasks available for CUDA processing.

Therefore, when a computer appears to be going completely haywire or running havoc, it is because quite a large number of tasks became downloaded and the user may not be able to deal with such a big cache, either because of technical skills or possibly personal competence and insight.

Therefore it should be an administrative responsibility present in order to ensure that such a thing does not happen.

If not the computer is to blame, then it most likely should be the user running it instead.
ID: 1694588 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1694633 - Posted: 22 Jun 2015, 17:04:06 UTC

Bluestar the ones i mention are repeat offenders I've got before , but given the
benefit of the doubt

and are going through 3000 + .Units

P.M's from Wing men get no response most times .
ID: 1694633 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1696077 - Posted: 26 Jun 2015, 23:51:32 UTC

new apps released
http://setiathome.berkeley.edu/apps.php
check if number of false positives decline
ID: 1696077 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1696183 - Posted: 27 Jun 2015, 7:53:31 UTC - in response to Message 1696077.  

new apps released
http://setiathome.berkeley.edu/apps.php
check if number of false positives decline


Sorry Raistmer not shor if your talking to me . but i'm not understanding what your asking me to do if you are talking to me. :)
ID: 1696183 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1696186 - Posted: 27 Jun 2015, 8:25:36 UTC - in response to Message 1696183.  

new apps released
http://setiathome.berkeley.edu/apps.php
check if number of false positives decline


Sorry Raistmer not shor if your talking to me . but i'm not understanding what your asking me to do if you are talking to me. :)


It's just info relevant to whole this thread, it's not direct answer on your last post.
ID: 1696186 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1697581 - Posted: 2 Jul 2015, 3:44:15 UTC

And how does one machine obtain 9190 WUs in progress? I would really like to know his secret. ;)


http://setiathome.berkeley.edu/results.php?hostid=7544616&offset=0&show_names=0&state=0&appid=
ID: 1697581 · Report as offensive
woohoo
Volunteer tester

Send message
Joined: 30 Oct 13
Posts: 972
Credit: 165,671,404
RAC: 5
United States
Message 1697583 - Posted: 2 Jul 2015, 3:46:36 UTC

im sure he will make the deadline
ID: 1697583 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1697586 - Posted: 2 Jul 2015, 3:57:42 UTC - in response to Message 1697581.  

And how does one machine obtain 9190 WUs in progress? I would really like to know his secret. ;)


http://setiathome.berkeley.edu/results.php?hostid=7544616&offset=0&show_names=0&state=0&appid=


And his app details doesn't show the errors.
http://setiathome.berkeley.edu/host_app_versions.php?hostid=7544616
ID: 1697586 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1697589 - Posted: 2 Jul 2015, 4:11:47 UTC - in response to Message 1697586.  

He's only has 2 work units awaiting validation,

In progress (9140)

Error (1096)

SETI@home v7 (10238)

Looks like he's figured out how to circumvent the limits placed on computers.

The first time he did this and got the 1096 they all timed out.

Those 9140 (no way he's going to make a dent in those)

He's just going to force all those work units to be resent once the deadline passes.
ID: 1697589 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1697605 - Posted: 2 Jul 2015, 4:59:20 UTC

Gross rescheduling
He believes the way to get a good score is to have a monumental cache and that it doesn't matter that a large proportion of the results returned are rubbish he's winning because he has the biggest cache
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1697605 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1697612 - Posted: 2 Jul 2015, 5:06:19 UTC
Last modified: 2 Jul 2015, 5:09:31 UTC

Does anyone really think someone would try to horde MBs when they are always available? Not to mention he's only run two tasks in the last month. Quite possibly something has gone astray. From my recent experience I suspect the version of BOINC. "(MAY BE UNSTABLE - USE ONLY FOR TESTING)" can mean many things, most of them Not good.
ID: 1697612 · Report as offensive
woohoo
Volunteer tester

Send message
Joined: 30 Oct 13
Posts: 972
Credit: 165,671,404
RAC: 5
United States
Message 1697614 - Posted: 2 Jul 2015, 5:08:38 UTC

i was running that preview build recently and i did run into a few instances where my entire cache would disappear so i'm back on the stable version for now
ID: 1697614 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1697620 - Posted: 2 Jul 2015, 5:42:24 UTC

Let's hope siu77 looks at his machine and reverts back to the early version , wow 3 gig worth of units trash'd .

Lucky there not AP's
ID: 1697620 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7

Message boards : Number crunching : Major problem with a user


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.