Major problem with a user

Message boards : Number crunching : Major problem with a user
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1689230 - Posted: 8 Jun 2015, 17:54:49 UTC

Since many of the owners of these rigs hide their computers, and/or do not respond to PMs, perhaps a Setizen who lives within a reasonable drive of the Berkeley campus could volunteer a few hours a week to look at these aberrant machines, and, with permission from the Admins, send the owners an email, saying something like:

"Greetings. Thanks you for participatiing in the Seti@Home project. We have noticed that your computer (#1234567) is generating an inordinate number of invalid or error results. Please check your machine. That machine's work quota will be reduced to 10 Tasks per day. If the situation is not corrected within 1 week, that machine will be suspended from the project until you notify us that it is fixed. When we restore your computer, its initial allotment will be 10 taks per day, ramping up to full quota as your machine returns valid results."

Now, how difficult would it be to modify the Scheduler code to enable and enforce such a quota reduction/suspension?
Donald
Infernal Optimist / Submariner, retired
ID: 1689230 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1689240 - Posted: 8 Jun 2015, 18:32:37 UTC

Nice idea Donald.
Given that the scheduler already makes reductions for high error counts it should be possible to add the reduction for high invalid counts in much the same way.
Indeed as I type I've just thought of a variation - why not send out similar mails to those whose PCs are producing loads of errors as well as those that are producing loads of invalids?

Of course we are assuming that the scheduler is "well written" and so amenable to such changes.......
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1689240 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1689344 - Posted: 9 Jun 2015, 2:17:38 UTC

Heh, and the first time I was actually hit by this issue, 1809665189, 2 ATIs, different types but same driver version, validating autocorr overflows against each other and invalidating mine.
ID: 1689344 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1689546 - Posted: 9 Jun 2015, 23:26:13 UTC

Another look at a (so far) inconclusive, and wow, 5025697: Validation pending (1651) · Validation inconclusive (1002) · Valid (4) · Invalid (544)
ID: 1689546 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1689555 - Posted: 9 Jun 2015, 23:47:57 UTC - in response to Message 1689549.  
Last modified: 9 Jun 2015, 23:48:23 UTC

(...)The traces of ET, may very well hide in one of those millions of trashed results....

That is true. I remember it was mentioned earlier-on that -9 overflow WUs get a special flag in the database so that some day, when nitpicker goes through and starts analyzing, it doesn't get excited about something that it sees from one of those -9 results.

That sounds like protection from false positives, but the underlying problem here is that in some cases, when there is a reliable, stable CPU wingmate, that result is discarded entirely because it doesn't match the two problematic GPUs. So a perfectly good result could have been there in that WU, but we will never know, because it has been discarded, and as far as the DB is concerned, that particular portion of the tape was too noisy to have anything useful in it.

So.. unless there is a future plan to continue holding onto all of these tapes and to re-re-process the overflow results again (but only on CPUs, or stable machines this time around), potentially good signals are effectively lost forever.

And it's not like this is a recent problem, either. I remember 5+ years ago, there were occasional gripes in the "panic mode" threads about -9 overflows due to bad drivers/overheating/overclocked GPUs, but they were pretty rare. Now they happen many times per day.

I don't think outright banning machines is the way to solve it, and I've already mentioned what I feel to be a simple "fix" (damage control) to bring this back to being a pretty rare problem.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1689555 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1689641 - Posted: 10 Jun 2015, 5:06:06 UTC

Stop whining about "problematic GPUs", yes GPUs do return invalids, but I've seen a fair number of CPUs that return spurious -9s and other invalid results. The REAL issue is users who do not respond to polite messages telling them there is a problem with their PC and just continue trashing data.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1689641 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1689655 - Posted: 10 Jun 2015, 5:36:04 UTC - in response to Message 1689555.  


I don't think outright banning machines is the way to solve it, and I've already mentioned what I feel to be a simple "fix" (damage control) to bring this back to being a pretty rare problem.


It's in test more than month on beta already. If nothing wrong with just updated binaries main will be updated quite soon.
ID: 1689655 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1689676 - Posted: 10 Jun 2015, 6:40:54 UTC - in response to Message 1689133.  
Last modified: 10 Jun 2015, 6:44:13 UTC

Hi Rob,
...Makes the efforts of my errant cruncher look very tame in comparison - wrecks one or two per day - I'm looking forward to getting home tonight and sorting it out one way or the other. (Its got a sad GTX760, so I can pull the card and see if the dust bunnies are rampant, re-seat it and check the drivers, worst come to the worst I'll just get another GTX980...)


Humm, I've got a GTX980 due to arrive today sometime, at which point I will have [touch wood] a redundant GTX970..

I'll let you know if the new GTX980 installs ok on my main rig and if it does the 970 is yours if you want it.. its been running ok here for some time, so it should be ok:-) Anyways its a freebe:-)[I hate redundant kit going to scrap or just gathering dust]

Cheers,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1689676 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1689859 - Posted: 10 Jun 2015, 18:08:08 UTC - in response to Message 1689735.  

Stop whining about "problematic GPUs", yes GPUs do return invalids, but I've seen a fair number of CPUs that return spurious -9s and other invalid results. The REAL issue is users who do not respond to polite messages telling them there is a problem with their PC and just continue trashing data.

Fair point.

Agreed. And of the hosts noted earlier in this thread, the 11 which are running NVIDIA GPUs don't fit my definition of a "Major" problem. The false overflows produced by those 11 generally do not match even when two such hosts are paired. So for those it's just a matter of good hosts having to wait a little longer before getting credit, and the false overflows make that delay minimal. Trashing tasks in a way that the validation logic is prepared to handle seems a minor issue to me. However, I am also disappointed that some users fail to recognize that they're not really helping the project.
                                                                   Joe
ID: 1689859 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1689890 - Posted: 10 Jun 2015, 19:22:08 UTC

How bad does a bad host need to be, before anything is done?


This guy has 29 VALID -9 results and over 600 invalid tasks.
So about 5% of trashed results are going through as good.


http://setiathome.berkeley.edu/show_host_detail.php?hostid=6780580
ID: 1689890 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1689907 - Posted: 10 Jun 2015, 20:46:55 UTC

It is not a case of "How bad" but "How much time do the project team have". Sadly with most of them being on very short hours there is very little of their time available for either a manual intervention or the development of an automated system to effect a "cure" to these users.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1689907 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1689956 - Posted: 11 Jun 2015, 0:18:03 UTC - in response to Message 1689890.  

How bad does a bad host need to be, before anything is done?


This guy has 29 VALID -9 results and over 600 invalid tasks.
So about 5% of trashed results are going through as good.


http://setiathome.berkeley.edu/show_host_detail.php?hostid=6780580

Read my earlier posts in this thread. Something IS being done to solve the problem of OpenCL_Ati hosts running too old drivers, it just has not yet been brought over from Beta. Should be soon.

In addition, of those 29 validated overflows 5 are correct and 1 more was given credit based on a weakly similar outcome. Those are cases where all/most of the signals in the overflow were found before doing any Autocorr processing, and the wingmate was not running OpenCL_Ati on inappropriate drivers.
                                                                   Joe
ID: 1689956 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1690025 - Posted: 11 Jun 2015, 6:13:26 UTC - in response to Message 1689907.  

It is not a case of "How bad" but "How much time do the project team have". Sadly with most of them being on very short hours there is very little of their time available for either a manual intervention or the development of an automated system to effect a "cure" to these users.

That's why I suggested a dedicated Volunteer who could come in a few hours a week and send emails to these guys that don't respond to our PMs. Would still take a little Staff oversight, but volunteers could handle some of these "administrative" tasks.....
Donald
Infernal Optimist / Submariner, retired
ID: 1690025 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1690042 - Posted: 11 Jun 2015, 8:01:25 UTC - in response to Message 1690025.  

One problem is that sending out 'form' mails to users is then the mail server black lists you for spam, then you can't communicate any users on that mail domain.

I was an admin on a gaming site (that hosts my poker league) and it's a huge problem with certain mail servers trying to contact members, because you're black listed.
ID: 1690042 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1690143 - Posted: 11 Jun 2015, 15:32:51 UTC - in response to Message 1690025.  

It is not a case of "How bad" but "How much time do the project team have". Sadly with most of them being on very short hours there is very little of their time available for either a manual intervention or the development of an automated system to effect a "cure" to these users.

That's why I suggested a dedicated Volunteer who could come in a few hours a week and send emails to these guys that don't respond to our PMs. Would still take a little Staff oversight, but volunteers could handle some of these "administrative" tasks.....

I think there are many people that operate on the premise of "set it and forget it". Which really should be able to be done. The BOINC servers should be able to handle machines that have very high error rates in a reasonable manor. Apparently the BOINC devs don't see the need, or want to spend the time on, such functions.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1690143 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1690147 - Posted: 11 Jun 2015, 15:44:01 UTC - in response to Message 1690143.  


I think there are many people that operate on the premise of "set it and forget it". Which really should be able to be done. The BOINC servers should be able to handle machines that have very high error rates in a reasonable manor. Apparently the BOINC devs don't see the need, or want to spend the time on, such functions.

Absolutely agree. users provided computational resourses.
All above that only their additional kind, not their responsibility.
ID: 1690147 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1690218 - Posted: 11 Jun 2015, 19:11:13 UTC - in response to Message 1690042.  

One problem is that sending out 'form' mails to users is then the mail server black lists you for spam, then you can't communicate any users on that mail domain.

I was an admin on a gaming site (that hosts my poker league) and it's a huge problem with certain mail servers trying to contact members, because you're black listed.

We're not talking about sending a bulk email to hundreds of addressees. Sending out single emails to individuals should not trigger any ISP's "Spam" filters.
Donald
Infernal Optimist / Submariner, retired
ID: 1690218 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1690631 - Posted: 13 Jun 2015, 0:44:25 UTC - in response to Message 1690218.  

We're not talking about sending a bulk email to hundreds of addressees. Sending out single emails to individuals should not trigger any ISP's "Spam" filters.


Actually Donald is that not what these guys are doing SPAMMING the project with invalids , inconclusive's in the 1000 + per machine

maybe that's exsactly what the project needs a Spam filter that looks for theses machines and apply's a automatic 24 hr ban , a 2nd time 48 hrs and so on till a suspension is then active for 3 months a second time in 1 year a permanent suspension .

The user has plenty of warnings by this time
ID: 1690631 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1690675 - Posted: 13 Jun 2015, 2:51:38 UTC - in response to Message 1690631.  
Last modified: 13 Jun 2015, 3:11:03 UTC

We're not talking about sending a bulk email to hundreds of addressees. Sending out single emails to individuals should not trigger any ISP's "Spam" filters.


Actually Donald is that not what these guys are doing SPAMMING the project with invalids , inconclusive's in the 1000 + per machine

maybe that's exsactly what the project needs a Spam filter that looks for theses machines and apply's a automatic 24 hr ban , a 2nd time 48 hrs and so on till a suspension is then active for 3 months a second time in 1 year a permanent suspension .

The user has plenty of warnings by this time

...or have the quota system limit them to one task per 24 hours when they trash that many WUs. *shrug* Oh well, that's not my decision.

edit: Although, I will say that addressing this issue is multiple-sided. Yes, the quota system can do some pretty strong damage control to limit the impact of runaway machines, but that's not the only solution. Having the apps detect the fact that it is an erroneous task and actually mark it with a computational error preserves the integrity of the science.

But neither of those gets to the real issue at hand. Either there are people that are deliberately producing invalid work (or they know about it and refuse to do anything to put a stop to it), or.. some of these machines actually used to produce good work, but because of the nature of "set and forget," they ended up getting now-incompatible drivers for the applications that are being used.

The latter group still thinks they are doing good work and a good contribution since they see that their machines are always doing something and have a lot of tasks in their cache, but they apparently don't come to their user page here and see that there are problems, and because of that.. they don't see that they have unread PMs from other users letting them know there is a problem.

Ergo.. for either group, if the quota system dropped them down to 1 task per 24 hours, the former group would be disappointed and move on to somewhere else, and the latter group would realize something is wrong and may feel inclined to come to this site to try to figure out what's going on.

Having an email go out isn't a bad idea, but as I suggested way earlier-on in this thread: if we were to re-adjust some of the values in the quota system and include some kind of logic to set a flag for some volunteer liaison to go through and decide if they need to get sent an email, or automatically send an email with some kind of generic message saying "it appears your machine has developed some kind of problem, please go to Number Crunching or Q&A for further assistance" when their quota reaches 1, then that would probably help the problem, as well.

The point is.. something should be done about runaway machines, and it isn't a simple solution, nor is there only one or two steps that are needed.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1690675 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1690715 - Posted: 13 Jun 2015, 6:17:19 UTC - in response to Message 1690709.  

Don't over clock.
ID: 1690715 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Major problem with a user


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.