When can some better controls be put on runaway rigs?

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1446504 - Posted: 25 Nov 2013, 6:19:59 UTC Last modified: 25 Nov 2013, 6:30:37 UTC They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in. Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830. Surely something can be done to restrict these things better. [edit] BTW it also takes ages to get them down a respectable (sort of if you can call it that) level of non-annoyance. [/edit] Cheers. ID: 1446504 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1446528 - Posted: 25 Nov 2013, 10:17:54 UTC - in response to Message 1446504. Last modified: 25 Nov 2013, 10:19:17 UTC Surely something can be done to restrict these things better. 1. Count invalid results as errors, i.e. decrease quota for those too. 2. Don't double the quota for each valid result. If we decrease the quota for each error by 1, we should not increase it by more than 1 for each valid result. Even with that the host would need just 50% valid results to keep the quota close to the dafault value (I'd prefer a system which requires at least 80%, i.e. 5 down, 1 up, that's really not too much to expect from a properly working machine). 3. Since we all know, that some GPUs might need a reboot sometimes, introduce a "reset quota" button on the application page of a host, so when someone fixes his machine, he can get some results to test it. Of course there should be a limit on how often this button might be used. ID: 1446528 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1446887 - Posted: 26 Nov 2013, 1:51:26 UTC - in response to Message 1446504. They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in. Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830. ... Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day. Obviously if there were very many hosts doing the same it could become a problem for the servers to handle the extra traffic, but as a relatively rare case it at least suggests thinking about something better than severely restricting work delivery to the host. If the project did want to be more restrictive, changing their setting for gpu_multiplier to 1 would make sense (as long as the 100 in progress limit is maintained). Then each GPU would have a quota of "Max tasks per day" for each app_version it can use (rather than 8x "Max tasks per day" as it is now). Joe ID: 1446887 ·

Blurf Volunteer tester Send message Joined: 2 Sep 06 Posts: 8962 Credit: 12,678,685 RAC: 0	Message 1446900 - Posted: 26 Nov 2013, 2:44:22 UTC I suspect the answer lies in...When Seti staff actually has time to address the concerns. ID: 1446900 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1446956 - Posted: 26 Nov 2013, 9:24:25 UTC - in response to Message 1446887. Last modified: 26 Nov 2013, 9:31:28 UTC Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day. It's known, that fake result overflows from such machines can validate against each other. So in case we are doing some science here and not just generating heat, I don't see how having bad results in the database and maybe missing some interesting signals should be benefitial for the project. And if the project gets so little processing power, that it has to keep such hosts active at any price, maybe they should better stop such hosts from trashing WUs and increase the limits, so that machines, which actually do something useful for the project don't run dry every time there's some little issue with the servers (or the weekly maintance). Considering the reason, why we have limits, i.e. too many results in database, such hosts are indeed a liability. ID: 1446956 ·

Sp@ceNv@der Send message Joined: 10 Jul 05 Posts: 41 Credit: 117,366,167 RAC: 152	Message 1446964 - Posted: 26 Nov 2013, 10:19:18 UTC In my personal experience with these invalid results, I've discovered they are mainly due to "overclocking the GPU too much": I started seeing those results and it stopped when I re-adjusted some GPU setting, meaning I lowered some values such as core speed en memory speed: I had no problems whatsoever when gaming or rendering, but crunching SETI@home sometimes resulted into invalids: my cooling is more than OK, but the level of overclocking was just too high for this project: now all settings on all cards are optimal for the SETI-project, yet still overclocked for everything else... To boldly crunch ... ID: 1446964 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1447298 - Posted: 27 Nov 2013, 6:32:03 UTC - in response to Message 1446964. In my personal experience with these invalid results, I've discovered they are mainly due to "overclocking the GPU too much": I started seeing those results and it stopped when I re-adjusted some GPU setting, meaning I lowered some values such as core speed en memory speed: I had no problems whatsoever when gaming or rendering, but crunching SETI@home sometimes resulted into invalids: my cooling is more than OK, but the level of overclocking was just too high for this project: now all settings on all cards are optimal for the SETI-project, yet still overclocked for everything else... The thing is that many GPU's these days are overclocked before you get them, the GTX560 Ti's being the prime example. Cheers. ID: 1447298 ·

Sp@ceNv@der Send message Joined: 10 Jul 05 Posts: 41 Credit: 117,366,167 RAC: 152	Message 1447308 - Posted: 27 Nov 2013, 6:41:38 UTC - in response to Message 1447298. Yes I agree, I buy nothing but those ones, but using overclock tools, you can push clocks even higher: I've never had problems with factory overclocked cards as far as SETI@home is concerned: should it be the case, the same overclock tools can be used to lower clocks: to be perfectly clear: I'm speaking about NVIDIA hardware & using MSI Afterburner to achieve my goals. Greez To boldly crunch ... ID: 1447308 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1447318 - Posted: 27 Nov 2013, 6:54:46 UTC You may not of had problems with them, but sadly there are plenty of others out there that are and they don't respond to PM's. Cheers. ID: 1447318 ·

Sp@ceNv@der Send message Joined: 10 Jul 05 Posts: 41 Credit: 117,366,167 RAC: 152	Message 1447334 - Posted: 27 Nov 2013, 7:36:00 UTC - in response to Message 1447318. I agree on that 2 ... not everybody is a technical expert either ... I've tweaked my hardware to benefit the project in the best possible way: this also means no harm being done to your wingmates... People also need willingness to listen and react when others try to help... Greez. To boldly crunch ... ID: 1447334 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1447931 - Posted: 28 Nov 2013, 15:42:37 UTC - in response to Message 1446887. Last modified: 28 Nov 2013, 15:48:47 UTC They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in. Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830. ... Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day. BIG if here. IF those 70 or so valid results are sientifically valid, not just passed imperfect validator procedure. The biggest issue with CUDA GPU is that: when it fails it gives similar pattern of results, not just random ones. This makes such fail hosts much more dangerous - they could validate agains each other leaving project database undefended, validator doesn;t help in this situation. I would vote to change into scheduler: never to pair CUDA GPUs vs each other, just as "never pair hosts of same owner between each other" in effect. But today too much work performed by CUDA GPUs so such restriction could slow project down. Current quota system is definiely inadequate for GPU world. Unfortunately, BOINC staff refused to discuss or react somehow on this topic, I tried twice to bring their attention to this issue. And have no such impressive patience as Claggy has to pursue them more than half year just to make one bugfix... Maybe some another with great patience could pick this up?... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1447931 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1447997 - Posted: 28 Nov 2013, 18:14:22 UTC - in response to Message 1447931. Last modified: 28 Nov 2013, 18:15:21 UTC They all seem at 1st to get to around 13-16K in total tasks before the server restrictions start to kick in. Here's the latest example to plague us, http://setiathome.berkeley.edu/results.php?hostid=5173830. ... Hmm, that host is at 2298 in the Top Computers list sorted by RAC, and at 3019 when sorted by Total credit. The 70 or so valid results per day are far more than the average host achieves, so perhaps from the project perspective that host is more of an asset than a liability in spite of producing 3800 or so invalid results per day. BIG if here. IF those 70 or so valid results are sientifically valid, not just passed imperfect validator procedure. The biggest issue with CUDA GPU is that: when it fails it gives similar pattern of results, not just random ones. This makes such fail hosts much more dangerous - they could validate agains each other leaving project database undefended, validator doesn;t help in this situation. I would vote to change into scheduler: never to pair CUDA GPUs vs each other, just as "never pair hosts of same owner between each other" in effect. But today too much work performed by CUDA GPUs so such restriction could slow project down. Current quota system is definiely inadequate for GPU world. Unfortunately, BOINC staff refused to discuss or react somehow on this topic, I tried twice to bring their attention to this issue. And have no such impressive patience as Claggy has to pursue them more than half year just to make one bugfix... Maybe some another with great patience could pick this up?... Been examining this situation for quite some time. For x42 is planned a 3 pronged approach, User education, New tools for monitoring and benchmark/stability testing, and application fault detection and tolerance. I've tended to steer away from directions that would increase server complexity or load, because quality in the server codebase there is already a significant problem that doesn't have a lot of input from volunteer devs. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1447997 ·

Philhnnss Volunteer tester Send message Joined: 22 Feb 08 Posts: 63 Credit: 30,694,327 RAC: 162	Message 1448241 - Posted: 29 Nov 2013, 7:31:17 UTC Well shoot. It looks like one of my machines is adding to the invald work. I only crunch during the winter so I have just broght this one back on line a few weeks ago. Been working fine till now. I have not changed anything that I know of. The only variable I can think of is it has warmed back up a little. But according to my video card monitering software both cards, Asus ENGTS450 Direct, run around 110 F. I have ran them in higher temps before without problems. I've shut that machine down, don't want to add to the problem. Any thoughts or ideals would be greatly appreciated. Here is the computer. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7060336 ID: 1448241 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1448254 - Posted: 29 Nov 2013, 8:27:19 UTC Made your link clickable. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7060336 ID: 1448254 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1448255 - Posted: 29 Nov 2013, 8:29:58 UTC - in response to Message 1448241. Last modified: 29 Nov 2013, 8:30:36 UTC Well shoot. It looks like one of my machines is adding to the invald work. I only crunch during the winter so I have just broght this one back on line a few weeks ago. Been working fine till now. I have not changed anything that I know of. The only variable I can think of is it has warmed back up a little. But according to my video card monitering software both cards, Asus ENGTS450 Direct, run around 110 F. I have ran them in higher temps before without problems. I've shut that machine down, don't want to add to the problem. Any thoughts or ideals would be greatly appreciated. Here is the computer. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7060336 Something's nagging at my memory about this: NVIDIA GeForce GTS 450 (1023MB) driver: 320.49 Possibly somewhere around 700 series introduction and some significant Windows platform updates. IIRC, That period had some expected teething problems with radical tech changes all around (including in Windows). Since r331 WHQL drivers are available for XP 64 bit, I'd suggest checking all Windows updates are up to date, then update the video driver to match. A mismatch of driver technology and Windows platform updates seems to result in some pretty strange issues. At the very least updating there shouldn't make things worse. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1448255 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380	Message 1448341 - Posted: 29 Nov 2013, 15:19:00 UTC DO NOT allow Windows automatic update to update your video driver - the Microsoft update version of the drivers have lots of functionality removed. Use only the Nvidia version and do a clean install. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1448341 ·

Andy Westcott Send message Joined: 8 Nov 00 Posts: 101 Credit: 1,282,556 RAC: 0	Message 1448362 - Posted: 29 Nov 2013, 16:28:43 UTC If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions. ID: 1448362 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1448368 - Posted: 29 Nov 2013, 16:40:23 UTC - in response to Message 1448362. If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions. Be careful when recommending GPU users to throttle their CPUs by reducing processor usage - especially when they are using recent (v7.0.64 or later) versions of BOINC. There appear to be unresolved side-effects that I'm still trying to get the developers to consider. ID: 1448368 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1448443 - Posted: 29 Nov 2013, 19:10:34 UTC - in response to Message 1448368. If the invalid results could be due to overheating, could you simply set the processor usage to a lower figure than 100%? Just a thought, as sometimes it is easy to overlook the simpler solutions. Be careful when recommending GPU users to throttle their CPUs by reducing processor usage - especially when they are using recent (v7.0.64 or later) versions of BOINC. There appear to be unresolved side-effects that I'm still trying to get the developers to consider. To clarify, the web [url=http://setiathome.berkeley.edu/prefs.php?subset=global]Computing preferences{/url] have these two lines: On multiprocessors, use at most 100% of the processors Enforced by version 6.1+ Use at most 100% of CPU time Can be used to reduce CPU heat It's the "100% of CPU time" setting which controls the throttling, which may or may not be applied to GPU tasks also depending on exactly which version of BOINC is being used (and when applied may have side-effects). AFAIK, the "100% of the processors" can be reduced with no unexpected side effects. Joe ID: 1448443 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1448444 - Posted: 29 Nov 2013, 19:16:41 UTC - in response to Message 1448443. It's the "100% of CPU time" setting which controls the throttling, which may or may not be applied to GPU tasks also depending on exactly which version of BOINC is being used (and when applied may have side-effects). AFAIK, the "100% of the processors" can be reduced with no unexpected side effects. Joe Agreed, and quite correct. Thanks, Joe. ID: 1448444 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.