Another Machine (5049618) Leaving a Massive Mess. (It's been fixed, atm anyway).

Message boards : Number crunching : Another Machine (5049618) Leaving a Massive Mess. (It's been fixed, atm anyway).
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1401590 - Posted: 10 Aug 2013, 21:10:45 UTC

I'm getting very annoyed with this Anonymous rig just lately, computer 5049618.

All (14451) · In progress (142) · Validation pending (6294) · Validation inconclusive (5170) · Valid (285) · Invalid (2531) · Error (29)

So to whoever owns it, please get it fixed.

Cheers.
ID: 1401590 · Report as offensive
_
Avatar

Send message
Joined: 15 Nov 12
Posts: 299
Credit: 9,037,618
RAC: 0
United States
Message 1401598 - Posted: 10 Aug 2013, 21:26:59 UTC - in response to Message 1401590.  

Every couple days, I get a small handful (about 10 in the last month) of "Validation Inconclusive" WUs, but I never worried too much about it.

I never thought much about it, but on this rig, it is obviously a problem. Is this a direct result of something going wrong on your rig?
ID: 1401598 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1401625 - Posted: 10 Aug 2013, 23:31:44 UTC

Looks like a good portion of his(?) results are overflows. With (getting old) 295 cards, it's probably a heat related issue. Those old 295's did really funky things when the heat sinks started wearing out.
ID: 1401625 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1401655 - Posted: 11 Aug 2013, 2:42:37 UTC

Whoever is the owner, they would have a right smart machine if he replaced those 4 GTX295s with just 2 current model cards and run with Cuda_50..


I don't buy computers, I build them!!
ID: 1401655 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1401664 - Posted: 11 Aug 2013, 3:27:36 UTC - in response to Message 1401590.  

Looks like this guy has 2 cards that are working okay (Device 1 and Device 2), one that rapidly produces all the overflow invalids (Device 3), and one that slowly gets computation errors by exceeding the time limit (Device 4).

This isn't new for this machine, either. I looked in my own database and found 14 WUs dating back to May 24th (and v6) that I've shared with him and that he's had marked invalid. I've still got one inconclusive waiting since June 27th and another from just a couple days ago. (That's against only 6 WUs that have validated, and one computation error.)

From his Application details:

SETI@home v7 7.00 windows_intelx86 (cuda32)
Number of tasks completed 4315
Max tasks per day 33
Number of tasks today 552

Consecutive valid tasks 0

Just another example of how meaningless that "Max tasks per day" quota is.
ID: 1401664 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1401677 - Posted: 11 Aug 2013, 4:13:44 UTC - in response to Message 1401664.  

Looks like this guy has 2 cards that are working okay (Device 1 and Device 2), one that rapidly produces all the overflow invalids (Device 3), and one that slowly gets computation errors by exceeding the time limit (Device 4).



Remember that the 295's were the first of the "dual processor" cards, so 2 physical cards=4 GPU's. I had a pair of them back a couple years ago, and when 1 GPU went south, the other in the same case followed quickly. Some of the EVGA cards did well and lasted a long time, in fact, I think there are a couple rigs in the top rigs list that are still using them. But beyond that, they had a relatively short life expectancy.
ID: 1401677 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1401741 - Posted: 11 Aug 2013, 9:25:00 UTC - in response to Message 1401664.  

From his Application details:

SETI@home v7 7.00 windows_intelx86 (cuda32)
Number of tasks completed 4315
Max tasks per day 33
Number of tasks today 552

Consecutive valid tasks 0

Just another example of how meaningless that "Max tasks per day" quota is.

Well, here it works as it should, the quota is per core and for GPUs multiplied by 8, so for his 4 GPUs it is 4x33x8=1056.

The real issue is, that invalids are not counted as errors and of course the fact, that each valid result doubles the quota if it's below the start value, so his one working GPU can easily keep the limit at 33.
ID: 1401741 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1409151 - Posted: 29 Aug 2013, 3:45:39 UTC

It would be nice if that 33 quota for invalids was cut to 11, but this rig looks to be well under the control of the servers now as shown by the latest task figures;

All (2979) · In progress (142) · Validation pending (1246) · Validation inconclusive (945) · Valid (294) · Invalid (268) · Error (84)

Cheers.
ID: 1409151 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1409168 - Posted: 29 Aug 2013, 4:30:04 UTC - in response to Message 1409151.  

On the other hand, here's one that's still running wild:

http://setiathome.berkeley.edu/results.php?hostid=6972846

Everything on his GPU is going into overflow, even a bunch of AP tasks today. I've been paired with this machine for 40 WUs, going back to May 26th, and every one has ended with his result getting marked as Invalid.

Sad thing is, he's not Anonymous and has a couple of other machines that are working fine, so I sent him a PM a couple of days ago, but it doesn't seem to have had any effect.
ID: 1409168 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1409215 - Posted: 29 Aug 2013, 7:21:49 UTC - in response to Message 1409168.  

On the other hand, here's one that's still running wild:

http://setiathome.berkeley.edu/results.php?hostid=6972846

Everything on his GPU is going into overflow, even a bunch of AP tasks today. I've been paired with this machine for 40 WUs, going back to May 26th, and every one has ended with his result getting marked as Invalid.

Sad thing is, he's not Anonymous and has a couple of other machines that are working fine, so I sent him a PM a couple of days ago, but it doesn't seem to have had any effect.

Even sadder is the fact that he's another 1 who thinks that there's nothing wrong with his rig, but his total number of tasks has also dropped a lot since the intro of SETI V7.

Cheers.
ID: 1409215 · Report as offensive
Profile Tazz
Volunteer tester
Avatar

Send message
Joined: 5 Oct 99
Posts: 137
Credit: 34,342,390
RAC: 0
Canada
Message 1411224 - Posted: 4 Sep 2013, 2:34:00 UTC

Not entirely OT, but two of my crunchers have been paired with quite a few of these machines leaving a mess.

This one has 18 Validation inconclusive, I think 5 or 6 of them show the same results. The rest of my wingmen have 1,000's of invalids and/or errors.

This one has 22. I checked some of them and there were a few wu's showing the same results but the most were paired with hosts giving invalids or errors.

We need to come up with a name for these machines giving junk results.
</Tazz>
ID: 1411224 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1411240 - Posted: 4 Sep 2013, 4:03:52 UTC - in response to Message 1411224.  

We need to come up with a name for these machines giving junk results.

I'd be less interested in a name for them than in having some way for the project to notify these users when their machines start going haywire, even the Anonymous ones. Some of the ones I've been paired with have been blowing chunks for months.

I looked at the first one on your second list (http://setiathome.berkeley.edu/results.php?hostid=6253478) and it looks like all his MB GPU tasks are getting -9 overflows, but his AP tasks are mostly giving good results. According to my own database, I've been paired with him 12 times in the last 3+ months, with him getting 11 Invalids and only 1 Valid (which just happened to be a legitimate -9 overflow). Being Anonymous, there's no way for you, me or any of his wingmen to give him a heads-up.

It seems to me that the project should have some way for us to report these machines to a central location (other than these scattered grumblings in the forums), which could then send the user an email notice directly (even to the Anonymous ones), so they could do something about the problem. I'm sure these situations are not intentional, and the potential benefit the project could be receiving from having the problems fixed could really add up. (The aforementioned machine has a GTX 560 Ti, which could be making a decent contribution with a little behavior modification.)

I could probably come up with a list of a couple dozen of these runaway machines in just an hour or so, and I'd be happy to take the time to do that, if only I felt that my time would result in these machines becoming valuable contributors to the project again, and both they and S@H could stop this senseless, ongoing waste of resources.
ID: 1411240 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1412140 - Posted: 6 Sep 2013, 13:55:14 UTC

I just randomly decided to look at some of my inconclusives and discovered http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 had returned a task in 5 minutes. I thought that seemed rather quick, so I looked a little deeper and found that the host has huge numbers of inconclusives, invalids, and errors. It also has a very high pending count; I didn't look at any of them in detail, but I bet they're mostly fast turnarounds that will ultimately end up as invalid.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1412140 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1412176 - Posted: 6 Sep 2013, 15:27:18 UTC

Every user haqs to provide a valid email address to join the project. Besides Project Management, I believe Team Leaders/Founders have access to the addys of their teammates. So there IS a way.

The next questions are, how large a problem are these runaway crunchers, and is it worth the extra time and effort for management to contact these folks and then monitor for changes in the cruncher's behavior. Or is it such a minor problem that we just have to live with the frustration of waiting 6-12 weeks for our shared tasks to validate, and the database bloat these runaways create.
Donald
Infernal Optimist / Submariner, retired
ID: 1412176 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1412191 - Posted: 6 Sep 2013, 15:54:20 UTC - in response to Message 1412176.  
Last modified: 6 Sep 2013, 15:58:03 UTC

Every user haqs to provide a valid email address to join the project. Besides Project Management, I believe Team Leaders/Founders have access to the addys of their teammates. So there IS a way.

The next questions are, how large a problem are these runaway crunchers, and is it worth the extra time and effort for management to contact these folks and then monitor for changes in the cruncher's behavior. Or is it such a minor problem that we just have to live with the frustration of waiting 6-12 weeks for our shared tasks to validate, and the database bloat these runaways create.


IIRC a while back it was Joe Segur that pointed out one possible metric as far as overall 'project health'goes.

at the moment:
Workunits waiting for db purging 616,772 12,647 0m
Results waiting for db purging 1,288,541 43,962 0m


so that's ~2.09 tasks sent out per multibeam workunit, and ~3.48 per Astropulse.

Detailed understanding of why APs either come back inconclusive, are abandoned & how this relates to the applications etc might be for future development to explore.

In the multibeam case, some reasons for result divergence are known, such as the possibility of mismatched overflow results between CPU & GPU. For 'regular' results some of my own work was directed toward cross platform match between different platforms and devices (algorithmic precision maintenance related). AP to my knowledge, hasn't received this kind of attention yet, so that is very likely at least some small portion of the disparity.

I think under V6 the same figures were typically around ~2.5 tasks per WU whenever I looked. On my more or less healthy machines (exclude the flakey Linux host) the inconclusive to pending ratios are around ~2%, where before they were more like 8-10%. that 4-5x improvement could be partially due to bad hosts dropping off with the V7 transition, the mentioned application refinements (partially reliability & fault tolerance related also, compared to stock 6.08/6.09/6.10 at least), and in some (perhaps fewer) cases users taking positive corrective actions.

For Cuda multibeam there's certainly more that can be done to handle exceptional cases. How far to push these though before looking at the AP situation might be a good question to think about for me, along with just how reliable is reliable enough ?

Boinc's redundancy mechanisms are designed to handle these 'Bad machines' along with cross platform variations to some (great?) extent. For the most part though I think that does sometimes put user expectations of reliability in the back seat, which might amount to efficiency concerns overall.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1412191 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1412208 - Posted: 6 Sep 2013, 16:33:24 UTC
Last modified: 6 Sep 2013, 16:37:11 UTC

On AP`s there are a few other things to consider.
The AP apps for GPU have parameters to optimize computation.
I check a lot of units every week.

Some user setting up the params running just one instance until its running fine.
After some time a few deciding to increase number of instances without readjusting those parameters.
Especially unroll.
Params running fine with 1 instance doesn`t necessarily work with 2 or 3 instances.

Evenso a lot running outdated drivers and never read the read me`s.
Running multiple instances need a lot of fine tuning depending on blankings and how many cores to free for it.


With each crime and every kindness we birth our future.
ID: 1412208 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1412218 - Posted: 6 Sep 2013, 17:14:10 UTC - in response to Message 1412140.  

I just randomly decided to look at some of my inconclusives and discovered http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 had returned a task in 5 minutes. I thought that seemed rather quick, so I looked a little deeper and found that the host has huge numbers of inconclusives, invalids, and errors. It also has a very high pending count; I didn't look at any of them in detail, but I bet they're mostly fast turnarounds that will ultimately end up as invalid.

It looks like his GPU only went off the rails a couple weeks ago (I had a validated WU with him on August 18, but 9 invalids and 2 inconclusives since August 21), so I went ahead and sent him a PM to give him a heads-up. Perhaps he'll appreciate it, perhaps not! The user I sent a similar PM to about a week ago never responded, but a couple days later he stopped downloading tasks for his GPU, and hasn't reported any since then, either. So, no current contribution from his GPU, but at least it's not wasting the resources that it had been for several months.
ID: 1412218 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1412223 - Posted: 6 Sep 2013, 17:35:30 UTC

jason_gee writes:

"Boinc's redundancy mechanisms are designed to handle these 'Bad machines' along with cross platform variations to some (great?) extent. For the most part though I think that does sometimes put user expectations of reliability in the back seat, which might amount to efficiency concerns overall."

I wish it was just 'back seat', but users like Cellar Dweller (38588)
with host 6766751 think as follows.
Maybe others think this way too.

Part of a PM from Cellar Dweller in August 2013:
"The errors are for debugging and used so that they can perfect the software.

If I were to produce one good unit return out of a thousand - that's a good return. This is why I would only get one credit for the one return.

By killing the machine and not producing ANY results is what hurts.
BOINC works on the premise the more the merrier.
ANY good result is a + to the project."
ID: 1412223 · Report as offensive
Jim1348

Send message
Joined: 13 Dec 01
Posts: 212
Credit: 520,150
RAC: 0
United States
Message 1412292 - Posted: 6 Sep 2013, 19:48:34 UTC - in response to Message 1412223.  
Last modified: 6 Sep 2013, 19:49:17 UTC

Part of a PM from Cellar Dweller in August 2013:
"The errors are for debugging and used so that they can perfect the software.

If I were to produce one good unit return out of a thousand - that's a good return. This is why I would only get one credit for the one return.

By killing the machine and not producing ANY results is what hurts.
BOINC works on the premise the more the merrier.
ANY good result is a + to the project."

When deciding on what projects to support, one consideration has to be how many people think that way. If the project administrators won't filter them out, then I do it by working on other projects. It is a highly effective filter.
ID: 1412292 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1412383 - Posted: 6 Sep 2013, 22:52:39 UTC - in response to Message 1412191.  

IIRC a while back it was Joe Segur that pointed out one possible metric as far as overall 'project health'goes.

at the moment:
Workunits waiting for db purging 616,772 12,647 0m
Results waiting for db purging 1,288,541 43,962 0m


so that's ~2.09 tasks sent out per multibeam workunit, and ~3.48 per Astropulse.


At first glance, I thought that ~2.09 tasks per MB WU seemed low, but then I realized that it represents about one extra task having to be created, resent and processed for every 11 WUs, about 4.5% additional project overhead from that standpoint. So, just out of curiosity......

I looked at only the MB tasks my two busiest machines have received since September 1, and which have either been Completed and Validated, or are still in an Inconclusive state, in order to see how many of those WUs have required resends. One machine has completed 1,010 such tasks, of which 103 had to be resent at least once (for a total of 111 resends). The other has completed 876 such tasks, of which 90 had to be resent at least once (for a total of 95 resends).

Certainly the sample is small, but it more or less confirms your metric, my result coming out to ~2.11 tasks per WU (1886 WUs * minimum 2 tasks + 206 total resends).

Now, I didn't try to figure out how many of those resends were due to computation errors, abandonments, abortions, -9 overflows, or legitimate inconclusives that were or will be resolved. Nor did I try to identify which were due to machines that are in the chronic problem category as opposed to those that just hiccup occasionally. My gut feeling, though, is that further research would find that a significant proportion of those resends are caused by machines that are rapidly blowing through large volumes of tasks very quickly, simply because they're the ones that get those computation errors or overflows within a few seconds of the start of each and every task. They're voracious eaters, and the way the system currently operates, it just keeps feeding them!
ID: 1412383 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Another Machine (5049618) Leaving a Massive Mess. (It's been fixed, atm anyway).


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.