When can some better controls be put on runaway rigs?

Message boards : Number crunching : When can some better controls be put on runaway rigs?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1446504 - Posted: 25 Nov 2013, 6:19:59 UTC
Last modified: 25 Nov 2013, 6:30:37 UTC

They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in.

Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830.

Surely something can be done to restrict these things better.

[edit] BTW it also takes ages to get them down a respectable (sort of if you can call it that) level of non-annoyance. [/edit]

Cheers.
ID: 1446504 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1446528 - Posted: 25 Nov 2013, 10:17:54 UTC - in response to Message 1446504.  
Last modified: 25 Nov 2013, 10:19:17 UTC

Surely something can be done to restrict these things better.

1. Count invalid results as errors, i.e. decrease quota for those too.
2. Don't double the quota for each valid result. If we decrease the quota for each error by 1, we should not increase it by more than 1 for each valid result. Even with that the host would need just 50% valid results to keep the quota close to the dafault value (I'd prefer a system which requires at least 80%, i.e. 5 down, 1 up, that's really not too much to expect from a properly working machine).
3. Since we all know, that some GPUs might need a reboot sometimes, introduce a "reset quota" button on the application page of a host, so when someone fixes his machine, he can get some results to test it. Of course there should be a limit on how often this button might be used.
ID: 1446528 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1446887 - Posted: 26 Nov 2013, 1:51:26 UTC - in response to Message 1446504.  

They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in.

Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830.
...

Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day. Obviously if there were very many hosts doing the same it could become a problem for the servers to handle the extra traffic, but as a relatively rare case it at least suggests thinking about something better than severely restricting work delivery to the host.

If the project did want to be more restrictive, changing their setting for gpu_multiplier to 1 would make sense (as long as the 100 in progress limit is maintained). Then each GPU would have a quota of "Max tasks per day" for each app_version it can use (rather than 8x "Max tasks per day" as it is now).
                                                                   Joe
ID: 1446887 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 1446900 - Posted: 26 Nov 2013, 2:44:22 UTC

I suspect the answer lies in...When Seti staff actually has time to address the concerns.


ID: 1446900 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1446956 - Posted: 26 Nov 2013, 9:24:25 UTC - in response to Message 1446887.  
Last modified: 26 Nov 2013, 9:31:28 UTC

Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day.

It's known, that fake result overflows from such machines can validate against each other. So in case we are doing some science here and not just generating heat, I don't see how having bad results in the database and maybe missing some interesting signals should be benefitial for the project.

And if the project gets so little processing power, that it has to keep such hosts active at any price, maybe they should better stop such hosts from trashing WUs and increase the limits, so that machines, which actually do something useful for the project don't run dry every time there's some little issue with the servers (or the weekly maintance). Considering the reason, why we have limits, i.e. too many results in database, such hosts are indeed a liability.
ID: 1446956 · Report as offensive
Sp@ceNv@der Project Donor
Avatar

Send message
Joined: 10 Jul 05
Posts: 41
Credit: 117,366,167
RAC: 152
Belgium
Message 1446964 - Posted: 26 Nov 2013, 10:19:18 UTC

In my personal experience with these invalid results, I've discovered they are mainly due to "overclocking the GPU too much": I started seeing those results and it stopped when I re-adjusted some GPU setting, meaning I lowered some values such as core speed en memory speed: I had no problems whatsoever when gaming or rendering, but crunching SETI@home sometimes resulted into invalids: my cooling is more than OK, but the level of overclocking was just too high for this project: now all settings on all cards are optimal for the SETI-project, yet still overclocked for everything else...
To boldly crunch ...
ID: 1446964 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1447298 - Posted: 27 Nov 2013, 6:32:03 UTC - in response to Message 1446964.  

In my personal experience with these invalid results, I've discovered they are mainly due to "overclocking the GPU too much": I started seeing those results and it stopped when I re-adjusted some GPU setting, meaning I lowered some values such as core speed en memory speed: I had no problems whatsoever when gaming or rendering, but crunching SETI@home sometimes resulted into invalids: my cooling is more than OK, but the level of overclocking was just too high for this project: now all settings on all cards are optimal for the SETI-project, yet still overclocked for everything else...

The thing is that many GPU's these days are overclocked before you get them, the GTX560 Ti's being the prime example.

Cheers.
ID: 1447298 · Report as offensive
Sp@ceNv@der Project Donor
Avatar

Send message
Joined: 10 Jul 05
Posts: 41
Credit: 117,366,167
RAC: 152
Belgium
Message 1447308 - Posted: 27 Nov 2013, 6:41:38 UTC - in response to Message 1447298.  

Yes I agree, I buy nothing but those ones, but using overclock tools, you can push clocks even higher: I've never had problems with factory overclocked cards as far as SETI@home is concerned: should it be the case, the same overclock tools can be used to lower clocks: to be perfectly clear: I'm speaking about NVIDIA hardware & using MSI Afterburner to achieve my goals.

Greez
To boldly crunch ...
ID: 1447308 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1447318 - Posted: 27 Nov 2013, 6:54:46 UTC

You may not of had problems with them, but sadly there are plenty of others out there that are and they don't respond to PM's.

Cheers.
ID: 1447318 · Report as offensive
Sp@ceNv@der Project Donor
Avatar

Send message
Joined: 10 Jul 05
Posts: 41
Credit: 117,366,167
RAC: 152
Belgium
Message 1447334 - Posted: 27 Nov 2013, 7:36:00 UTC - in response to Message 1447318.  

I agree on that 2 ... not everybody is a technical expert either ... I've tweaked my hardware to benefit the project in the best possible way: this also means no harm being done to your wingmates... People also need willingness to listen and react when others try to help...

Greez.
To boldly crunch ...
ID: 1447334 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1447931 - Posted: 28 Nov 2013, 15:42:37 UTC - in response to Message 1446887.  
Last modified: 28 Nov 2013, 15:48:47 UTC

They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in.

Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830.
...

Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day.


BIG if here. IF those 70 or so valid results are sientifically valid, not just passed imperfect validator procedure.
The biggest issue with CUDA GPU is that: when it fails it gives similar pattern of results, not just random ones. This makes such fail hosts much more dangerous - they could validate agains each other leaving project database undefended, validator doesn;t help in this situation.
I would vote to change into scheduler: never to pair CUDA GPUs vs each other, just as "never pair hosts of same owner between each other" in effect.
But today too much work performed by CUDA GPUs so such restriction could slow project down.

Current quota system is definiely inadequate for GPU world. Unfortunately, BOINC staff refused to discuss or react somehow on this topic, I tried twice to bring their attention to this issue. And have no such impressive patience as Claggy has to pursue them more than half year just to make one bugfix... Maybe some another with great patience could pick this up?...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1447931 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1447997 - Posted: 28 Nov 2013, 18:14:22 UTC - in response to Message 1447931.  
Last modified: 28 Nov 2013, 18:15:21 UTC

They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in.

Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830.
...

Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day.


BIG if here. IF those 70 or so valid results are sientifically valid, not just passed imperfect validator procedure.
The biggest issue with CUDA GPU is that: when it fails it gives similar pattern of results, not just random ones. This makes such fail hosts much more dangerous - they could validate agains each other leaving project database undefended, validator doesn;t help in this situation.
I would vote to change into scheduler: never to pair CUDA GPUs vs each other, just as "never pair hosts of same owner between each other" in effect.
But today too much work performed by CUDA GPUs so such restriction could slow project down.

Current quota system is definiely inadequate for GPU world. Unfortunately, BOINC staff refused to discuss or react somehow on this topic, I tried twice to bring their attention to this issue. And have no such impressive patience as Claggy has to pursue them more than half year just to make one bugfix... Maybe some another with great patience could pick this up?...


Been examining this situation for quite some time. For x42 is planned a 3 pronged approach, User education, New tools for monitoring and benchmark/stability testing, and application fault detection and tolerance. I've tended to steer away from directions that would increase server complexity or load, because quality in the server codebase there is already a significant problem that doesn't have a lot of input from volunteer devs.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1447997 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1448241 - Posted: 29 Nov 2013, 7:31:17 UTC

Well shoot. It looks like one of my machines is adding to the invald work.
I only crunch during the winter so I have just broght this one back on line
a few weeks ago. Been working fine till now. I have not changed anything that
I know of. The only variable I can think of is it has warmed back up a little.
But according to my video card monitering software both cards, Asus ENGTS450
Direct, run around 110 F. I have ran them in higher temps before without
problems.

I've shut that machine down, don't want to add to the problem. Any thoughts or
ideals would be greatly appreciated.

Here is the computer.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=7060336
ID: 1448241 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1448254 - Posted: 29 Nov 2013, 8:27:19 UTC

ID: 1448254 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1448255 - Posted: 29 Nov 2013, 8:29:58 UTC - in response to Message 1448241.  
Last modified: 29 Nov 2013, 8:30:36 UTC

Well shoot. It looks like one of my machines is adding to the invald work.
I only crunch during the winter so I have just broght this one back on line
a few weeks ago. Been working fine till now. I have not changed anything that
I know of. The only variable I can think of is it has warmed back up a little.
But according to my video card monitering software both cards, Asus ENGTS450
Direct, run around 110 F. I have ran them in higher temps before without
problems.

I've shut that machine down, don't want to add to the problem. Any thoughts or
ideals would be greatly appreciated.

Here is the computer.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=7060336



Something's nagging at my memory about this:

NVIDIA GeForce GTS 450 (1023MB) driver: 320.49


Possibly somewhere around 700 series introduction and some significant Windows platform updates. IIRC, That period had some expected teething problems with radical tech changes all around (including in Windows).

Since r331 WHQL drivers are available for XP 64 bit, I'd suggest checking all Windows updates are up to date, then update the video driver to match.

A mismatch of driver technology and Windows platform updates seems to result in some pretty strange issues. At the very least updating there shouldn't make things worse.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1448255 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1448341 - Posted: 29 Nov 2013, 15:19:00 UTC

DO NOT allow Windows automatic update to update your video driver - the Microsoft update version of the drivers have lots of functionality removed. Use only the Nvidia version and do a clean install.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1448341 · Report as offensive
Profile Andy Westcott
Avatar

Send message
Joined: 8 Nov 00
Posts: 101
Credit: 1,282,556
RAC: 0
United Kingdom
Message 1448362 - Posted: 29 Nov 2013, 16:28:43 UTC

If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions.
ID: 1448362 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1448368 - Posted: 29 Nov 2013, 16:40:23 UTC - in response to Message 1448362.  

If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions.

Be careful when recommending GPU users to throttle their CPUs by reducing processor usage - especially when they are using recent (v7.0.64 or later) versions of BOINC.

There appear to be unresolved side-effects that I'm still trying to get the developers to consider.
ID: 1448368 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1448443 - Posted: 29 Nov 2013, 19:10:34 UTC - in response to Message 1448368.  

If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions.

Be careful when recommending GPU users to throttle their CPUs by reducing processor usage - especially when they are using recent (v7.0.64 or later) versions of BOINC.

There appear to be unresolved side-effects that I'm still trying to get the developers to consider.


To clarify, the web [url=http://setiathome.berkeley.edu/prefs.php?subset=global]Computing preferences{/url] have these two lines:

On multiprocessors, use at most 100% of the processors 
       Enforced by version 6.1+
                    Use at most 100% of CPU time
 Can be used to reduce CPU heat

It's the "100% of CPU time" setting which controls the throttling, which may or may not be applied to GPU tasks also depending on exactly which version of BOINC is being used (and when applied may have side-effects). AFAIK, the "100% of the processors" can be reduced with no unexpected side effects.
                                                                  Joe
ID: 1448443 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1448444 - Posted: 29 Nov 2013, 19:16:41 UTC - in response to Message 1448443.  

It's the "100% of CPU time" setting which controls the throttling, which may or may not be applied to GPU tasks also depending on exactly which version of BOINC is being used (and when applied may have side-effects). AFAIK, the "100% of the processors" can be reduced with no unexpected side effects.
                                                                  Joe

Agreed, and quite correct. Thanks, Joe.
ID: 1448444 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : When can some better controls be put on runaway rigs?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.