Message boards :
Number crunching :
Major problem with a user
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I'm thinking 2 things. If an error is reported, send them a Test file, if the results are correct, send them work again. I'm not sure what the percentage of CPU/GPU work is, but send _0 to CPU, _1 to GPU, if not valid, then resends go to CPU only. I think there is more CPU work being done, but I could be wrong on that. I would love to see the percentage of CPU/GPU work being done. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Damn, dudded by another pair of bad ATi/AMD hosts. :-( http://setiathome.berkeley.edu/workunit.php?wuid=1804279411 http://setiathome.berkeley.edu/results.php?hostid=7492638 http://setiathome.berkeley.edu/results.php?hostid=7024228 Now we've seen certain classes of Nvidia GPU's being blocked from getting work when using bad/unsuitable drivers, so why can't the same be done for AMD/ATi cards? (and there seems to be a lot more of them around lately) Cheers. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
That's the worst thing when it comes to these runaway computers. The science is getting polluted, more and more for each day. So true[/quote] |
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66 |
[/quote]That's the worst thing when it comes to these runaway computers. The science is getting polluted, more and more for each day. And when they are looked more closely think ntpckr for example they will get tossed out, IMO not a big deal in the cosmic picture of things, just a small waste of resources. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Damn, dudded by another pair of bad ATi/AMD hosts. :-( I think it was Eric who said on Beta that AMD/ATI doesn't make it easy to detect the driver version. What is/was used current is the drivers CAL version. Which is, or at least I think was, being used to limit drivers that are to old. However with the current generation of card dropping CAL support BOINC doesn't report any driver version back to the server. So it is up to the BOINC dev team to implement a new driver detection scheme. The information that is provided by clinfo in "Driver version:" is really what we need to know. Also issue is that the OpenCL component is separate from the driver and it may not get updated when the driver does it something like DDU is not used. With a mixed match of driver and run time bad things can happen. The CAL version, that BOINC reports & the server can be set to allow/block will show the most recent, but with an older runtime tasks often just spit out garbage. I don't know if Nvidia has the same issue with their OpenCL support, but I haven't seen that kind of issue with CUDA support. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Cosmic_Ocean wrote: (...) I would suggest (if it's server 'technically' possible), each '-9 overflow' result which comes from a VGA card, this WU should be send additional to a CPU (- only host, or to the CPU of a host) for to check if it's really a '-9 overflow'. If 2 VGA cards send respectively a '-9 overflow' result, two CPUs check if they are correct. So at this WU: 2 VGA cards with '-9 overflow' and 2 CPUs with well results. This way the science is rescued. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Cosmic_Ocean wrote:(...) I think it was said before that the messages like the "-9 overflow" are just informative message only included in the stderr_txt. So the validators never see it as they only look at the result file. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
As a heavy GPU user I would suggest that the majority of issues with -9 overflows are from miss-managed or miss-configured systems. This problem is NOT the ole domain of GPUs, I recently saw some coming from a seriously overclocked CPU.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Cosmic_Ocean wrote:(...) The validator logic does check whether the stderr section for the canonical result contains "result_overflow". If so, a flag bit is set in what gets assimilated, and the runtime_outlier flag is set so the run time won't affect the averages used to estimate host speed. What's technically possible probably includes all of the ideas which have been proposed, but what's practically possible without huge changes to the BOINC server code is much less, and even if such changes were made they might be too much burden on the BOINC database to be practical here. Getting newer MB7 OpenCL_ATi builds released here should reduce the problem to very rare occurence. The errors those builds will produce if the host has too old drivers wiil drive the host's app_version quota down to 1 very quickly, and it will remain there until newer drivers are installed. That's not as nice as somehow making the app_plan not send tasks to those hosts, but it will get the job done. Joe |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
Whatever is implemented, the sooner the better, even if it is not perfect. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
(...} Although it would be really nice if the apps could detect that something went awry and give the task a non-zero error status, it sounds like that would be more involved than to just change a few lines of code for the daily quotas. Having the daily quota bottom-out at 33 leaves a minimum of 33 tasks/day for a bad host to trash. If it were able to go all the way down to 1, that would definitely make a noticeable difference. Okay, so that's not fixing the problem, but it is basically just damage control, and the quotas are already in-place, they just need some very minor adjustments (being able to go down to 1, /2 for every bad task if quota is below 200, /5 if above 200, and +2 for every good task). However.. I'm not even sure the quotas are enforced or applied. I remember not terribly long ago, shortly after APv7 was released, my quota was in the mid-40s, and I got almost 70 tasks in the course of about an hour within the same day UTC. I should have been limited to what my daily quota was, but I wasn't. That suggests to me that the quotas are calculated and displayed on the application details page, but they are not enforced. So.. A) let's start enforcing them, and B) let them go back down to 1 instead of 33. These runaway machines will quickly become basically a non-issue. If you want to add an extra layer of complexity to the code, have something automated that when a host gets down to a quota of 1, have the system send them an email to let them know that something is amiss and they should look into it and report here to Number Crunching if they have any questions. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
This all seems to be a fairly recent phenomenon. So, what changed and when? |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
This all seems to be a fairly recent phenomenon. Wider variety of GPUs in the past ~two years or so, which means more points of failure. Plus the fact that the minimum value for the daily quota per app per machine got raised to 33 (to help people with GPUs be able to get a reasonable number of tasks more quickly), but the catch-22 of that is that it allows runaway machines to continue running away. Plus the fact that I'm pretty sure the quota isn't even applied/enforced anyway, so it becomes moot. These -9 overflow tasks did happen before, but they were seldom and rare, because there were less GPUs crunching a few years ago, so it was much more likely that two GPUs would not be paired together on one WU. Since GPUs are becoming relatively mainstream, it is becoming much more frequent, increased further by the aforementioned daily quota mechanism being essentially useless. Although, it would not surprise me at all if instead of the quota being per app per machine, it was per device using that app per machine (so if a CPU was limited to say.. 20/day, but it has 16 cores, then it would be able to get 320 tasks/day). It needs to just be "per app, per machine" regardless of number of devices/instances using that app. That's about all I've got to say on the subject. I'll quit ranting about it now. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It could be done when criteria for selection would be formulated. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I just got diddled again by a bad pair of AMD/ATi users. :-( http://setiathome.berkeley.edu/workunit.php?wuid=1810647956 http://setiathome.berkeley.edu/results.php?hostid=6805978 http://setiathome.berkeley.edu/results.php?hostid=7442023 Cheers. |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
you can add this user http://setiathome.berkeley.edu/results.php?hostid=6754859 and this one http://setiathome.berkeley.edu/results.php?hostid=7337318 this one http://setiathome.berkeley.edu/show_host_detail.php?hostid=5469669 this guy is just trashing thousands and does need to be suspended http://setiathome.berkeley.edu/results.php?hostid=4026248 most of these wingman are trashing several gig of data between them i have sent them P.M's don't expect to here from them , i just hope they stop |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
Another one trashing thousands , 1 gig of date trashed and poped up on my inconcluses today. http://setiathome.berkeley.edu/results.php?hostid=7190564 |
Kathy Send message Joined: 5 Jan 03 Posts: 338 Credit: 27,877,436 RAC: 0 |
you can add this user I had the 7337318 as a wingman two days ago and did a print screen. They had 4116 wu, 99.9% of which were overflows. 12 were valid and of the valid wu, one was an overflow! |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
...Makes the efforts of my errant cruncher look very tame in comparison - wrecks one or two per day - I'm looking forward to getting home tonight and sorting it out one way or the other. (Its got a sad GTX760, so I can pull the card and see if the dust bunnies are rampant, re-seat it and check the drivers, worst come to the worst I'll just get another GTX980...) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Graham Middleton Send message Joined: 1 Sep 00 Posts: 1520 Credit: 86,815,638 RAC: 0 |
you can add this user It looks like some of these errant guys might even be going out of their way to sabotage some science, building and configuring rigs that intentionally trash WUs as fast as possible! Happy Crunching, Graham |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.