Major problem with a user

Message boards : Number crunching : Major problem with a user
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Werecow
Avatar

Send message
Joined: 13 Mar 05
Posts: 56
Credit: 4,917,657
RAC: 3
United States
Message 1685616 - Posted: 29 May 2015, 15:04:28 UTC - in response to Message 1685584.  

I doubt that Glenn did, or indeed ever will. I had an (unpleasant) exchange with him some time ago. His GTX560 is monstrously overclocked, and "it works fine for games, so why not for S@H?" (only with a pile of four letter words). If ever there was a computer and user that deserved to be cut off at its roots then this combination would be my nomination.

Whilst I can see his initial reaction might be that, surely if he looks at his recent results and out of the last 883 tasks only 11 are vaild, wouldn't that suggest a problem? He has a 560 and an RAC of 200!

In reality why is he bothering to run SETI at all?


One of my antique CPU-only machines would give him a guaranteed 50% RAC boost. I wonder if I could convince him to trade.. ;-)
ID: 1685616 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1685635 - Posted: 29 May 2015, 16:30:57 UTC - in response to Message 1685616.  
Last modified: 29 May 2015, 16:34:49 UTC

I doubt that Glenn did, or indeed ever will. I had an (unpleasant) exchange with him some time ago. His GTX560 is monstrously overclocked, and "it works fine for games, so why not for S@H?" (only with a pile of four letter words). If ever there was a computer and user that deserved to be cut off at its roots then this combination would be my nomination.

Whilst I can see his initial reaction might be that, surely if he looks at his recent results and out of the last 883 tasks only 11 are vaild, wouldn't that suggest a problem? He has a 560 and an RAC of 200!

In reality why is he bothering to run SETI at all?

One of my antique CPU-only machines would give him a guaranteed 50% RAC boost. I wonder if I could convince him to trade.. ;-)

Yeah, I've got an old P4HT/XP box that RACs 415 right now. If he's a gamer, that's what he really cares about, and he's not likely to change anything to improve his S@H performance. Would be nice if there was a mechanism to boot folks such as him off the project.......
Donald
Infernal Optimist / Submariner, retired
ID: 1685635 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1685664 - Posted: 29 May 2015, 18:30:15 UTC

A one in a million.

No real harm is done.
Your computers are running just fine.
Keep tuning.

When the time is done.
Yours will still be running,
tuned and humming.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1685664 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1685691 - Posted: 29 May 2015, 19:37:08 UTC

This guys tasks are all erroring out.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=5295453

What to do?
ID: 1685691 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1685692 - Posted: 29 May 2015, 19:38:33 UTC

I have a lot of wingmen which just destroy all tasks/the science ... (which make me very sad!)

Just two wingman of one task ...
http://setiathome.berkeley.edu/workunit.php?wuid=1801074356

The 'AMD Radeon HD 4670 (256MB) OpenCL: 1.0' destroy all tasks (errors) ...
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7426251

The 'AMD AMD Radeon HD 6200/6300/7200/7300 series (Wrestler) (384MB) driver: 1.4.1589 OpenCL: 1.1' make just '-9 overflows' ...
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170

Wrong driver?
If so, why they get all this fresh tasks just for destroying the science?
ID: 1685692 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1685705 - Posted: 29 May 2015, 20:00:44 UTC - in response to Message 1685692.  
Last modified: 29 May 2015, 20:06:09 UTC

26no12ac.15576.20108.438086664204.12.102
http://setiathome.berkeley.edu/workunit.php?wuid=1802092176

Two results with '-9 overflow' (two AMD VGA cards, stock SETI@home v7 v7.03 (opencl_ati_sah)):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170

And a result with a correct result (stock CPU app):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7522258
Spike count: 7
Autocorr count: 2
Pulse count: 3
Triplet count: 3
Gaussian count: 1

Which is marked as 'invalid' now and the science is destroyed, because the '-9 overflow' is now in the data base.
ID: 1685705 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1685722 - Posted: 29 May 2015, 20:31:02 UTC - in response to Message 1685705.  

26no12ac.15576.20108.438086664204.12.102
http://setiathome.berkeley.edu/workunit.php?wuid=1802092176

Two results with '-9 overflow' (two AMD VGA cards, stock SETI@home v7 v7.03 (opencl_ati_sah)):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170

And a result with a correct result (stock CPU app):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7522258
Spike count: 7
Autocorr count: 2
Pulse count: 3
Triplet count: 3
Gaussian count: 1

Which is marked as 'invalid' now and the science is destroyed, because the '-9 overflow' is now in the data base.


The science is not destroyed.

We might be looking into a wrong direction,

at a wrong vawelength,

trying to find a completely wrong kind of a signal (it takes a lot of power to transmit at one frequency - easy to detect - tdma - fdma.)

...

and those now missed signals can be recalculated ina a day with the HW available in ten years from now. (We're reprocessing right now some I guess)
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1685722 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1685726 - Posted: 29 May 2015, 20:36:12 UTC

ID: 1685726 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1685732 - Posted: 29 May 2015, 20:45:58 UTC - in response to Message 1685726.  

http://setiathome.berkeley.edu/show_host_detail.php?hostid=7044454

is an other one....
What is going on?

This looks like it is another 560ti gone astray.
ID: 1685732 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1685734 - Posted: 29 May 2015, 20:51:23 UTC - in response to Message 1685705.  
Last modified: 29 May 2015, 21:05:27 UTC

If you click to this kind of 'bad' hosts, and then to the overview 'valid' tasks of this hosts, look to them and then you will see wrong '-9 overflows' results which are marked as valid, and wingman with 'well' result but marked as invalid (at this WUs).

Examples:

22my12ab.20524.6202.438086664200.12.217
http://setiathome.berkeley.edu/workunit.php?wuid=1795776824
Bad hosts which are wingmen:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6780580
And a well result of: http://setiathome.berkeley.edu/show_host_detail.php?hostid=7460773
Spike count: 4
Autocorr count: 0
Pulse count: 2
Triplet count: 0
Gaussian count: 0

24fe13ab.31742.13564.438086664207.12.86
http://setiathome.berkeley.edu/workunit.php?wuid=1801693399
Bad hosts which are wingmen:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6656511
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
And a well result of: http://setiathome.berkeley.edu/show_host_detail.php?hostid=7481053
Spike count: 2
Autocorr count: 2
Pulse count: 0
Triplet count: 2
Gaussian count: 0

26no12ac.15576.20108.438086664204.12.96
http://setiathome.berkeley.edu/workunit.php?wuid=1802092170
Bad hosts which are wingmen:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170
And a well result of: http://setiathome.berkeley.edu/show_host_detail.php?hostid=6505277
Spike count: 12
Autocorr count: 0
Pulse count: 0
Triplet count: 1
Gaussian count: 0

25oc12ad.27789.18472.438086664204.12.109
http://setiathome.berkeley.edu/workunit.php?wuid=1802156502
Bad hosts which are wingmen:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170
And a well result of: http://setiathome.berkeley.edu/show_host_detail.php?hostid=5967851
Spike count: 4
Autocorr count: 2
Pulse count: 1
Triplet count: 0
Gaussian count: 1

And always the '-9 overflow' is in the data base, and the well results are destroyed.

Maybe it would be better if a VGA card send a '-9 overflow' result back, it should be calculated on a CPU again if there is really a '-9 overflow'.
If it's true, it lasts just a few seconds.
If it's not true, the science is rescued.
ID: 1685734 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1685744 - Posted: 29 May 2015, 21:14:27 UTC - in response to Message 1685734.  

Well you know what they say "A broken clock is right twice a day"
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1685744 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1685746 - Posted: 29 May 2015, 21:19:31 UTC

ID: 1685746 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1685768 - Posted: 29 May 2015, 22:07:13 UTC - in response to Message 1685705.  

26no12ac.15576.20108.438086664204.12.102
http://setiathome.berkeley.edu/workunit.php?wuid=1802092176

Two results with '-9 overflow' (two AMD VGA cards, stock SETI@home v7 v7.03 (opencl_ati_sah)):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6889787
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7504170

And a result with a correct result (stock CPU app):
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7522258
Spike count: 7
Autocorr count: 2
Pulse count: 3
Triplet count: 3
Gaussian count: 1

Which is marked as 'invalid' now and the science is destroyed, because the '-9 overflow' is now in the data base.

Both hosts running the stock (opencl_ati_sah) are using Catalyst 11.10 drivers (AMD OpenCL SDK 2.5), known to cause such problems. That stock build is rev. 1831, we need to get newer OpenCL_ATi builds deployed here which will refuse to run with anything older than Catalyst 11.12 (AMD OpenCL SDK 2.6). Such apps are under test at SETI BETA, of course, and I hope those who care about quality will devote some of their crunching resources there.

In addition to the check of drivers, recent MB7 OpenCL apps also have a sanity check designed to error out for such Autocorr overflows. Getting that tuned so it does error out reliably on bad cases but is unlikely to do so on a good result has been challenging, the apps at Beta are currently too sensitive in that respect but there's a refinement in the pipeline.
                                                                  Joe
ID: 1685768 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 1685772 - Posted: 29 May 2015, 22:13:24 UTC

Maybe send a list of the "bad" users to the mods and have them send it up to Eric? Just a thought....


ID: 1685772 · Report as offensive
Werecow
Avatar

Send message
Joined: 13 Mar 05
Posts: 56
Credit: 4,917,657
RAC: 3
United States
Message 1685873 - Posted: 30 May 2015, 4:20:24 UTC - in response to Message 1685772.  

Rogue hosts like these were being tracked in this thread, with PMs sent to users. A few got sorted that way. Unfortunately, the thread doesn't seem to be monitored much now, if at all.
ID: 1685873 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1686042 - Posted: 30 May 2015, 15:06:24 UTC - in response to Message 1685772.  

Maybe send a list of the "bad" users to the mods and have them send it up to Eric? Just a thought....

That's admin stuff, we already have enough paperwork to do around here already.

Member of the People Encouraging Niceness In Society club.

ID: 1686042 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1686049 - Posted: 30 May 2015, 15:21:52 UTC

*skimmed thread*

This should be automated. Besides, as was already suggested, a mandatory CPU validation of 2 GPUs agreeing on an overflow (or overflow of spikes and/or autocorrs at least, as if it gets past that with a full match it's likely right... I think), hosts with many invalids should receive notices, possibly be required to do something to confirm they saw them and intend to fix the issue, and if the situation doesn't improve be stopped from getting more WUs, at least for the type of processor that causes the invalids. And this is a matter to be tackled at BOINC level, not project level. Quite shocked it hasn't been, really, especially since I think it wouldn't be hard to implement.
ID: 1686049 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22188
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1686237 - Posted: 31 May 2015, 9:46:24 UTC

Doing what you suggest would not work given the ratio of tasks that are returned by GPU:CPU.
A far better solution would be restrict the number of tasks allowed in the cache of an errant processor when it exceeds a given number (or ratio) of invalid tasks, and the number of tasks permitted per day. Reducing both by a factor of ten (ten in the cache per GPU or CPU, which ever is "offending"). Continued no remedial action resulting in another factor of ten... If that processor started to return valid tasks then slowly ramp the cache and daily back to normal levels, taking days/weeks to reach the "normal" levels.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1686237 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1686270 - Posted: 31 May 2015, 13:42:34 UTC - in response to Message 1685515.  

NO Donald still no contact , sorry i have been busy and this is first chance i've had to check things out .
ID: 1686270 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1686369 - Posted: 31 May 2015, 20:58:39 UTC
Last modified: 31 May 2015, 21:03:51 UTC

The quota system used to be pretty good at putting a stop to these runaway hosts... but then someone decided the absolute bottom value must be 33, so basically all that does is just beg for large amounts of problems.

It used to (I think..?) start for a new host at something like.. 10. For every good task they returned, it would +5, for every bad result, it would divide by 2. You could go all the way up to 100, and go as low as 1. If you had a runaway machine, you would be limited to 1 task/day until you fixed the problem.

Okay, so it took a while to be able to get a reasonable cache by only gaining +5 for every good task, but it meant that those machines that had made it to 100 were reliable and dependable.

Now though... I think it is something like it doubles for every one task that is good up to 100, then +1 for every one beyond that, and halves for every bad one, but cannot go below 33. This is made worse by the fact that there is no upper-limit on number of tasks/day, so if a good machine has been going for a while and is theoretically allowed thousands of tasks/day and then something goes wrong and it starts trashing WUs, it takes a while to get down to 33 from 10,000+ by dividing by 2 each time. What's more is.. when these -9 overflows validate against other -9s, that counts as "good" and raises your daily limit again.


Long story short.. I don't particularly think this is specifically an issue with not letting two overflow results go without a CPU's opinion (because all that will end up happening is the CPU's result won't match the majority, and the CPU's result will be discarded as invalid--which is what already happens quite frequently anyway). The best solution to this problem is a simple one: have a very strict, draconian, unforgiving quota system. Something like:

- Start every machine at a daily limit of 10
- For every valid task, +2 to the quota
- For every bad task, /2 to the quota if below 200, /5 if above 200
- Allow quota to go all the way down to 1, no upper limit.

This would feel agonizing for everyone at first, but the good, reliable machines would quickly build their quota up to a point where the server-side cache limits become the limiting factor, just like it is now. This would drastically limit these runaway hosts down to such a small amount of work that they can't really do much of any damage at all. If the owners of these machines want to complain about how they can't get any work, the response should be "then fix whatever problem your machine has, and then you'll get more work."

That's the way the quota system is supposed to be: rewarded for good work, punished for bad work. The way it is right now, there is no punishment, because you can still get tons of tasks to trash and there's not really anything that can stop you from doing so.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1686369 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Major problem with a user


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.