Major problem with a user

Message boards : Number crunching : Major problem with a user
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1686405 - Posted: 31 May 2015, 22:54:55 UTC

I'm thinking 2 things.

If an error is reported, send them a Test file, if the results are correct, send them work again.

I'm not sure what the percentage of CPU/GPU work is, but send _0 to CPU, _1 to GPU, if not valid, then resends go to CPU only. I think there is more CPU work being done, but I could be wrong on that.

I would love to see the percentage of CPU/GPU work being done.
ID: 1686405 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1686411 - Posted: 31 May 2015, 23:11:37 UTC
Last modified: 31 May 2015, 23:12:28 UTC

Damn, dudded by another pair of bad ATi/AMD hosts. :-(

http://setiathome.berkeley.edu/workunit.php?wuid=1804279411

http://setiathome.berkeley.edu/results.php?hostid=7492638

http://setiathome.berkeley.edu/results.php?hostid=7024228

Now we've seen certain classes of Nvidia GPU's being blocked from getting work when using bad/unsuitable drivers, so why can't the same be done for AMD/ATi cards? (and there seems to be a lot more of them around lately)

Cheers.
ID: 1686411 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1686430 - Posted: 1 Jun 2015, 1:43:40 UTC - in response to Message 1686412.  

That's the worst thing when it comes to these runaway computers. The science is getting polluted, more and more for each day.



So true[/quote]
ID: 1686430 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1686433 - Posted: 1 Jun 2015, 2:10:57 UTC - in response to Message 1686430.  

That's the worst thing when it comes to these runaway computers. The science is getting polluted, more and more for each day.



So true
[/quote]
And when they are looked more closely think ntpckr for example they will get tossed out, IMO not a big deal in the cosmic picture of things, just a small waste of resources.
ID: 1686433 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1686438 - Posted: 1 Jun 2015, 2:37:24 UTC - in response to Message 1686411.  

Damn, dudded by another pair of bad ATi/AMD hosts. :-(

http://setiathome.berkeley.edu/workunit.php?wuid=1804279411

http://setiathome.berkeley.edu/results.php?hostid=7492638

http://setiathome.berkeley.edu/results.php?hostid=7024228

Now we've seen certain classes of Nvidia GPU's being blocked from getting work when using bad/unsuitable drivers, so why can't the same be done for AMD/ATi cards? (and there seems to be a lot more of them around lately)

Cheers.

I think it was Eric who said on Beta that AMD/ATI doesn't make it easy to detect the driver version. What is/was used current is the drivers CAL version. Which is, or at least I think was, being used to limit drivers that are to old.
However with the current generation of card dropping CAL support BOINC doesn't report any driver version back to the server. So it is up to the BOINC dev team to implement a new driver detection scheme. The information that is provided by clinfo in "Driver version:" is really what we need to know.

Also issue is that the OpenCL component is separate from the driver and it may not get updated when the driver does it something like DDU is not used. With a mixed match of driver and run time bad things can happen. The CAL version, that BOINC reports & the server can be set to allow/block will show the most recent, but with an older runtime tasks often just spit out garbage.
I don't know if Nvidia has the same issue with their OpenCL support, but I haven't seen that kind of issue with CUDA support.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1686438 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1686477 - Posted: 1 Jun 2015, 6:11:47 UTC - in response to Message 1686369.  

Cosmic_Ocean wrote:
(...)
Long story short.. I don't particularly think this is specifically an issue with not letting two overflow results go without a CPU's opinion (because all that will end up happening is the CPU's result won't match the majority, and the CPU's result will be discarded as invalid--which is what already happens quite frequently anyway).
(...)

I would suggest (if it's server 'technically' possible),
each '-9 overflow' result which comes from a VGA card, this WU should be send additional to a CPU (- only host, or to the CPU of a host) for to check if it's really a '-9 overflow'.

If 2 VGA cards send respectively a '-9 overflow' result, two CPUs check if they are correct.
So at this WU: 2 VGA cards with '-9 overflow' and 2 CPUs with well results.

This way the science is rescued.
ID: 1686477 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1686670 - Posted: 1 Jun 2015, 16:39:32 UTC - in response to Message 1686477.  

Cosmic_Ocean wrote:
(...)
Long story short.. I don't particularly think this is specifically an issue with not letting two overflow results go without a CPU's opinion (because all that will end up happening is the CPU's result won't match the majority, and the CPU's result will be discarded as invalid--which is what already happens quite frequently anyway).
(...)

I would suggest (if it's server 'technically' possible),
each '-9 overflow' result which comes from a VGA card, this WU should be send additional to a CPU (- only host, or to the CPU of a host) for to check if it's really a '-9 overflow'.

If 2 VGA cards send respectively a '-9 overflow' result, two CPUs check if they are correct.
So at this WU: 2 VGA cards with '-9 overflow' and 2 CPUs with well results.

This way the science is rescued.

I think it was said before that the messages like the "-9 overflow" are just informative message only included in the stderr_txt. So the validators never see it as they only look at the result file.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1686670 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1686690 - Posted: 1 Jun 2015, 17:50:33 UTC

As a heavy GPU user I would suggest that the majority of issues with -9 overflows are from miss-managed or miss-configured systems. This problem is NOT the ole domain of GPUs, I recently saw some coming from a seriously overclocked CPU....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1686690 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1686716 - Posted: 1 Jun 2015, 19:47:18 UTC - in response to Message 1686670.  

Cosmic_Ocean wrote:
(...)
Long story short.. I don't particularly think this is specifically an issue with not letting two overflow results go without a CPU's opinion (because all that will end up happening is the CPU's result won't match the majority, and the CPU's result will be discarded as invalid--which is what already happens quite frequently anyway).
(...)

I would suggest (if it's server 'technically' possible),
each '-9 overflow' result which comes from a VGA card, this WU should be send additional to a CPU (- only host, or to the CPU of a host) for to check if it's really a '-9 overflow'.

If 2 VGA cards send respectively a '-9 overflow' result, two CPUs check if they are correct.
So at this WU: 2 VGA cards with '-9 overflow' and 2 CPUs with well results.

This way the science is rescued.

I think it was said before that the messages like the "-9 overflow" are just informative message only included in the stderr_txt. So the validators never see it as they only look at the result file.

The validator logic does check whether the stderr section for the canonical result contains "result_overflow". If so, a flag bit is set in what gets assimilated, and the runtime_outlier flag is set so the run time won't affect the averages used to estimate host speed.

What's technically possible probably includes all of the ideas which have been proposed, but what's practically possible without huge changes to the BOINC server code is much less, and even if such changes were made they might be too much burden on the BOINC database to be practical here.

Getting newer MB7 OpenCL_ATi builds released here should reduce the problem to very rare occurence. The errors those builds will produce if the host has too old drivers wiil drive the host's app_version quota down to 1 very quickly, and it will remain there until newer drivers are installed. That's not as nice as somehow making the app_plan not send tasks to those hosts, but it will get the job done.
                                                                  Joe
ID: 1686716 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1686717 - Posted: 1 Jun 2015, 19:54:02 UTC

Whatever is implemented, the sooner the better, even if it is not perfect.
ID: 1686717 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1686856 - Posted: 2 Jun 2015, 4:10:12 UTC - in response to Message 1686716.  

(...}
What's technically possible probably includes all of the ideas which have been proposed, but what's practically possible without huge changes to the BOINC server code is much less, and even if such changes were made they might be too much burden on the BOINC database to be practical here.
(...)

Although it would be really nice if the apps could detect that something went awry and give the task a non-zero error status, it sounds like that would be more involved than to just change a few lines of code for the daily quotas. Having the daily quota bottom-out at 33 leaves a minimum of 33 tasks/day for a bad host to trash. If it were able to go all the way down to 1, that would definitely make a noticeable difference.

Okay, so that's not fixing the problem, but it is basically just damage control, and the quotas are already in-place, they just need some very minor adjustments (being able to go down to 1, /2 for every bad task if quota is below 200, /5 if above 200, and +2 for every good task).

However.. I'm not even sure the quotas are enforced or applied. I remember not terribly long ago, shortly after APv7 was released, my quota was in the mid-40s, and I got almost 70 tasks in the course of about an hour within the same day UTC. I should have been limited to what my daily quota was, but I wasn't. That suggests to me that the quotas are calculated and displayed on the application details page, but they are not enforced.

So.. A) let's start enforcing them, and B) let them go back down to 1 instead of 33. These runaway machines will quickly become basically a non-issue.

If you want to add an extra layer of complexity to the code, have something automated that when a host gets down to a quota of 1, have the system send them an email to let them know that something is amiss and they should look into it and report here to Number Crunching if they have any questions.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1686856 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1687004 - Posted: 2 Jun 2015, 14:09:22 UTC

This all seems to be a fairly recent phenomenon.
So, what changed and when?
ID: 1687004 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1687117 - Posted: 2 Jun 2015, 23:41:08 UTC - in response to Message 1687004.  

This all seems to be a fairly recent phenomenon.
So, what changed and when?

Wider variety of GPUs in the past ~two years or so, which means more points of failure. Plus the fact that the minimum value for the daily quota per app per machine got raised to 33 (to help people with GPUs be able to get a reasonable number of tasks more quickly), but the catch-22 of that is that it allows runaway machines to continue running away. Plus the fact that I'm pretty sure the quota isn't even applied/enforced anyway, so it becomes moot.

These -9 overflow tasks did happen before, but they were seldom and rare, because there were less GPUs crunching a few years ago, so it was much more likely that two GPUs would not be paired together on one WU. Since GPUs are becoming relatively mainstream, it is becoming much more frequent, increased further by the aforementioned daily quota mechanism being essentially useless.

Although, it would not surprise me at all if instead of the quota being per app per machine, it was per device using that app per machine (so if a CPU was limited to say.. 20/day, but it has 16 cores, then it would be able to get 320 tasks/day). It needs to just be "per app, per machine" regardless of number of devices/instances using that app.

That's about all I've got to say on the subject. I'll quit ranting about it now.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1687117 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1687253 - Posted: 3 Jun 2015, 8:55:42 UTC - in response to Message 1686411.  


Now we've seen certain classes of Nvidia GPU's being blocked from getting work when using bad/unsuitable drivers, so why can't the same be done for AMD/ATi cards? (and there seems to be a lot more of them around lately)

Cheers.

It could be done when criteria for selection would be formulated.
ID: 1687253 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1688725 - Posted: 6 Jun 2015, 23:14:04 UTC

ID: 1688725 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1688737 - Posted: 6 Jun 2015, 23:44:37 UTC

you can add this user

http://setiathome.berkeley.edu/results.php?hostid=6754859

and this one

http://setiathome.berkeley.edu/results.php?hostid=7337318

this one

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5469669 this guy is just trashing thousands and does need to be suspended

http://setiathome.berkeley.edu/results.php?hostid=4026248

most of these wingman are trashing several gig of data between them i have sent them P.M's don't expect to here from them , i just hope they stop
ID: 1688737 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1688816 - Posted: 7 Jun 2015, 5:35:28 UTC

Another one trashing thousands , 1 gig of date trashed and poped up on my inconcluses today.

http://setiathome.berkeley.edu/results.php?hostid=7190564
ID: 1688816 · Report as offensive
Kathy
Avatar

Send message
Joined: 5 Jan 03
Posts: 338
Credit: 27,877,436
RAC: 0
United States
Message 1689045 - Posted: 8 Jun 2015, 0:56:11 UTC - in response to Message 1688737.  

you can add this user

http://setiathome.berkeley.edu/results.php?hostid=6754859

and this one

http://setiathome.berkeley.edu/results.php?hostid=7337318

this one

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5469669 this guy is just trashing thousands and does need to be suspended

http://setiathome.berkeley.edu/results.php?hostid=4026248

most of these wingman are trashing several gig of data between them i have sent them P.M's don't expect to here from them , i just hope they stop


I had the 7337318 as a wingman two days ago and did a print screen. They had 4116 wu, 99.9% of which were overflows. 12 were valid and of the valid wu, one was an overflow!
ID: 1689045 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1689133 - Posted: 8 Jun 2015, 10:07:43 UTC

...Makes the efforts of my errant cruncher look very tame in comparison - wrecks one or two per day - I'm looking forward to getting home tonight and sorting it out one way or the other. (Its got a sad GTX760, so I can pull the card and see if the dust bunnies are rampant, re-seat it and check the drivers, worst come to the worst I'll just get another GTX980...)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1689133 · Report as offensive
Profile Graham Middleton

Send message
Joined: 1 Sep 00
Posts: 1520
Credit: 86,815,638
RAC: 0
United Kingdom
Message 1689157 - Posted: 8 Jun 2015, 12:22:56 UTC - in response to Message 1689045.  
Last modified: 8 Jun 2015, 12:23:24 UTC

you can add this user

http://setiathome.berkeley.edu/results.php?hostid=6754859

and this one

http://setiathome.berkeley.edu/results.php?hostid=7337318

this one

http://setiathome.berkeley.edu/show_host_detail.php?hostid=5469669 this guy is just trashing thousands and does need to be suspended

http://setiathome.berkeley.edu/results.php?hostid=4026248

most of these wingman are trashing several gig of data between them i have sent them P.M's don't expect to here from them , i just hope they stop


I had the 7337318 as a wingman two days ago and did a print screen. They had 4116 wu, 99.9% of which were overflows. 12 were valid and of the valid wu, one was an overflow!



It looks like some of these errant guys might even be going out of their way to sabotage some science, building and configuring rigs that intentionally trash WUs as fast as possible!
Happy Crunching,

Graham

ID: 1689157 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : Number crunching : Major problem with a user


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.