Another Machine (5049618) Leaving a Massive Mess. (It's been fixed, atm anyway).

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1401590 - Posted: 10 Aug 2013, 21:10:45 UTC I'm getting very annoyed with this Anonymous rig just lately, computer 5049618. All (14451) Â· In progress (142) Â· Validation pending (6294) Â· Validation inconclusive (5170) Â· Valid (285) Â· Invalid (2531) Â· Error (29) So to whoever owns it, please get it fixed. Cheers. ID: 1401590 ·

_ Send message Joined: 15 Nov 12 Posts: 299 Credit: 9,037,618 RAC: 0	Message 1401598 - Posted: 10 Aug 2013, 21:26:59 UTC - in response to Message 1401590. Every couple days, I get a small handful (about 10 in the last month) of "Validation Inconclusive" WUs, but I never worried too much about it. I never thought much about it, but on this rig, it is obviously a problem. Is this a direct result of something going wrong on your rig? ID: 1401598 ·

Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0	Message 1401625 - Posted: 10 Aug 2013, 23:31:44 UTC Looks like a good portion of his(?) results are overflows. With (getting old) 295 cards, it's probably a heat related issue. Those old 295's did really funky things when the heat sinks started wearing out. ID: 1401625 ·

Cliff Harding Volunteer tester Send message Joined: 18 Aug 99 Posts: 1432 Credit: 110,967,840 RAC: 67	Message 1401655 - Posted: 11 Aug 2013, 2:42:37 UTC Whoever is the owner, they would have a right smart machine if he replaced those 4 GTX295s with just 2 current model cards and run with Cuda_50.. I don't buy computers, I build them!! ID: 1401655 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1401664 - Posted: 11 Aug 2013, 3:27:36 UTC - in response to Message 1401590. Looks like this guy has 2 cards that are working okay (Device 1 and Device 2), one that rapidly produces all the overflow invalids (Device 3), and one that slowly gets computation errors by exceeding the time limit (Device 4). This isn't new for this machine, either. I looked in my own database and found 14 WUs dating back to May 24th (and v6) that I've shared with him and that he's had marked invalid. I've still got one inconclusive waiting since June 27th and another from just a couple days ago. (That's against only 6 WUs that have validated, and one computation error.) From his Application details: SETI@home v7 7.00 windows_intelx86 (cuda32) Number of tasks completed 4315 Max tasks per day 33 Number of tasks today 552 Consecutive valid tasks 0 Just another example of how meaningless that "Max tasks per day" quota is. ID: 1401664 ·

Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0	Message 1401677 - Posted: 11 Aug 2013, 4:13:44 UTC - in response to Message 1401664. Looks like this guy has 2 cards that are working okay (Device 1 and Device 2), one that rapidly produces all the overflow invalids (Device 3), and one that slowly gets computation errors by exceeding the time limit (Device 4). Remember that the 295's were the first of the "dual processor" cards, so 2 physical cards=4 GPU's. I had a pair of them back a couple years ago, and when 1 GPU went south, the other in the same case followed quickly. Some of the EVGA cards did well and lasted a long time, in fact, I think there are a couple rigs in the top rigs list that are still using them. But beyond that, they had a relatively short life expectancy. ID: 1401677 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1401741 - Posted: 11 Aug 2013, 9:25:00 UTC - in response to Message 1401664. From his Application details: SETI@home v7 7.00 windows_intelx86 (cuda32) Number of tasks completed 4315 Max tasks per day 33 Number of tasks today 552 Consecutive valid tasks 0 Just another example of how meaningless that "Max tasks per day" quota is. Well, here it works as it should, the quota is per core and for GPUs multiplied by 8, so for his 4 GPUs it is 4x33x8=1056. The real issue is, that invalids are not counted as errors and of course the fact, that each valid result doubles the quota if it's below the start value, so his one working GPU can easily keep the limit at 33. ID: 1401741 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1409151 - Posted: 29 Aug 2013, 3:45:39 UTC It would be nice if that 33 quota for invalids was cut to 11, but this rig looks to be well under the control of the servers now as shown by the latest task figures; All (2979) Â· In progress (142) Â· Validation pending (1246) Â· Validation inconclusive (945) Â· Valid (294) Â· Invalid (268) Â· Error (84) Cheers. ID: 1409151 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1409168 - Posted: 29 Aug 2013, 4:30:04 UTC - in response to Message 1409151. On the other hand, here's one that's still running wild: http://setiathome.berkeley.edu/results.php?hostid=6972846 Everything on his GPU is going into overflow, even a bunch of AP tasks today. I've been paired with this machine for 40 WUs, going back to May 26th, and every one has ended with his result getting marked as Invalid. Sad thing is, he's not Anonymous and has a couple of other machines that are working fine, so I sent him a PM a couple of days ago, but it doesn't seem to have had any effect. ID: 1409168 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1409215 - Posted: 29 Aug 2013, 7:21:49 UTC - in response to Message 1409168. On the other hand, here's one that's still running wild: http://setiathome.berkeley.edu/results.php?hostid=6972846 Everything on his GPU is going into overflow, even a bunch of AP tasks today. I've been paired with this machine for 40 WUs, going back to May 26th, and every one has ended with his result getting marked as Invalid. Sad thing is, he's not Anonymous and has a couple of other machines that are working fine, so I sent him a PM a couple of days ago, but it doesn't seem to have had any effect. Even sadder is the fact that he's another 1 who thinks that there's nothing wrong with his rig, but his total number of tasks has also dropped a lot since the intro of SETI V7. Cheers. ID: 1409215 ·

Tazz Volunteer tester Send message Joined: 5 Oct 99 Posts: 137 Credit: 34,342,390 RAC: 0	Message 1411224 - Posted: 4 Sep 2013, 2:34:00 UTC Not entirely OT, but two of my crunchers have been paired with quite a few of these machines leaving a mess. This one has 18 Validation inconclusive, I think 5 or 6 of them show the same results. The rest of my wingmen have 1,000's of invalids and/or errors. This one has 22. I checked some of them and there were a few wu's showing the same results but the most were paired with hosts giving invalids or errors. We need to come up with a name for these machines giving junk results. </Tazz> ID: 1411224 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1411240 - Posted: 4 Sep 2013, 4:03:52 UTC - in response to Message 1411224. We need to come up with a name for these machines giving junk results. I'd be less interested in a name for them than in having some way for the project to notify these users when their machines start going haywire, even the Anonymous ones. Some of the ones I've been paired with have been blowing chunks for months. I looked at the first one on your second list (http://setiathome.berkeley.edu/results.php?hostid=6253478) and it looks like all his MB GPU tasks are getting -9 overflows, but his AP tasks are mostly giving good results. According to my own database, I've been paired with him 12 times in the last 3+ months, with him getting 11 Invalids and only 1 Valid (which just happened to be a legitimate -9 overflow). Being Anonymous, there's no way for you, me or any of his wingmen to give him a heads-up. It seems to me that the project should have some way for us to report these machines to a central location (other than these scattered grumblings in the forums), which could then send the user an email notice directly (even to the Anonymous ones), so they could do something about the problem. I'm sure these situations are not intentional, and the potential benefit the project could be receiving from having the problems fixed could really add up. (The aforementioned machine has a GTX 560 Ti, which could be making a decent contribution with a little behavior modification.) I could probably come up with a list of a couple dozen of these runaway machines in just an hour or so, and I'd be happy to take the time to do that, if only I felt that my time would result in these machines becoming valuable contributors to the project again, and both they and S@H could stop this senseless, ongoing waste of resources. ID: 1411240 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1412140 - Posted: 6 Sep 2013, 13:55:14 UTC I just randomly decided to look at some of my inconclusives and discovered http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 had returned a task in 5 minutes. I thought that seemed rather quick, so I looked a little deeper and found that the host has huge numbers of inconclusives, invalids, and errors. It also has a very high pending count; I didn't look at any of them in detail, but I bet they're mostly fast turnarounds that will ultimately end up as invalid. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1412140 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1412176 - Posted: 6 Sep 2013, 15:27:18 UTC Every user haqs to provide a valid email address to join the project. Besides Project Management, I believe Team Leaders/Founders have access to the addys of their teammates. So there IS a way. The next questions are, how large a problem are these runaway crunchers, and is it worth the extra time and effort for management to contact these folks and then monitor for changes in the cruncher's behavior. Or is it such a minor problem that we just have to live with the frustration of waiting 6-12 weeks for our shared tasks to validate, and the database bloat these runaways create. Donald Infernal Optimist / Submariner, retired ID: 1412176 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1412191 - Posted: 6 Sep 2013, 15:54:20 UTC - in response to Message 1412176. Last modified: 6 Sep 2013, 15:58:03 UTC Every user haqs to provide a valid email address to join the project. Besides Project Management, I believe Team Leaders/Founders have access to the addys of their teammates. So there IS a way. The next questions are, how large a problem are these runaway crunchers, and is it worth the extra time and effort for management to contact these folks and then monitor for changes in the cruncher's behavior. Or is it such a minor problem that we just have to live with the frustration of waiting 6-12 weeks for our shared tasks to validate, and the database bloat these runaways create. IIRC a while back it was Joe Segur that pointed out one possible metric as far as overall 'project health'goes. at the moment: Workunits waiting for db purging 616,772 12,647 0m Results waiting for db purging 1,288,541 43,962 0m so that's ~2.09 tasks sent out per multibeam workunit, and ~3.48 per Astropulse. Detailed understanding of why APs either come back inconclusive, are abandoned & how this relates to the applications etc might be for future development to explore. In the multibeam case, some reasons for result divergence are known, such as the possibility of mismatched overflow results between CPU & GPU. For 'regular' results some of my own work was directed toward cross platform match between different platforms and devices (algorithmic precision maintenance related). AP to my knowledge, hasn't received this kind of attention yet, so that is very likely at least some small portion of the disparity. I think under V6 the same figures were typically around ~2.5 tasks per WU whenever I looked. On my more or less healthy machines (exclude the flakey Linux host) the inconclusive to pending ratios are around ~2%, where before they were more like 8-10%. that 4-5x improvement could be partially due to bad hosts dropping off with the V7 transition, the mentioned application refinements (partially reliability & fault tolerance related also, compared to stock 6.08/6.09/6.10 at least), and in some (perhaps fewer) cases users taking positive corrective actions. For Cuda multibeam there's certainly more that can be done to handle exceptional cases. How far to push these though before looking at the AP situation might be a good question to think about for me, along with just how reliable is reliable enough ? Boinc's redundancy mechanisms are designed to handle these 'Bad machines' along with cross platform variations to some (great?) extent. For the most part though I think that does sometimes put user expectations of reliability in the back seat, which might amount to efficiency concerns overall. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1412191 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1412208 - Posted: 6 Sep 2013, 16:33:24 UTC Last modified: 6 Sep 2013, 16:37:11 UTC On AP`s there are a few other things to consider. The AP apps for GPU have parameters to optimize computation. I check a lot of units every week. Some user setting up the params running just one instance until its running fine. After some time a few deciding to increase number of instances without readjusting those parameters. Especially unroll. Params running fine with 1 instance doesn`t necessarily work with 2 or 3 instances. Evenso a lot running outdated drivers and never read the read me`s. Running multiple instances need a lot of fine tuning depending on blankings and how many cores to free for it. With each crime and every kindness we birth our future. ID: 1412208 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1412218 - Posted: 6 Sep 2013, 17:14:10 UTC - in response to Message 1412140. I just randomly decided to look at some of my inconclusives and discovered http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 had returned a task in 5 minutes. I thought that seemed rather quick, so I looked a little deeper and found that the host has huge numbers of inconclusives, invalids, and errors. It also has a very high pending count; I didn't look at any of them in detail, but I bet they're mostly fast turnarounds that will ultimately end up as invalid. It looks like his GPU only went off the rails a couple weeks ago (I had a validated WU with him on August 18, but 9 invalids and 2 inconclusives since August 21), so I went ahead and sent him a PM to give him a heads-up. Perhaps he'll appreciate it, perhaps not! The user I sent a similar PM to about a week ago never responded, but a couple days later he stopped downloading tasks for his GPU, and hasn't reported any since then, either. So, no current contribution from his GPU, but at least it's not wasting the resources that it had been for several months. ID: 1412218 ·

David Anderson (not that DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74	Message 1412223 - Posted: 6 Sep 2013, 17:35:30 UTC jason_gee writes: "Boinc's redundancy mechanisms are designed to handle these 'Bad machines' along with cross platform variations to some (great?) extent. For the most part though I think that does sometimes put user expectations of reliability in the back seat, which might amount to efficiency concerns overall." I wish it was just 'back seat', but users like Cellar Dweller (38588) with host 6766751 think as follows. Maybe others think this way too. Part of a PM from Cellar Dweller in August 2013: "The errors are for debugging and used so that they can perfect the software. If I were to produce one good unit return out of a thousand - that's a good return. This is why I would only get one credit for the one return. By killing the machine and not producing ANY results is what hurts. BOINC works on the premise the more the merrier. ANY good result is a + to the project." ID: 1412223 ·

Jim1348 Send message Joined: 13 Dec 01 Posts: 212 Credit: 520,150 RAC: 0	Message 1412292 - Posted: 6 Sep 2013, 19:48:34 UTC - in response to Message 1412223. Last modified: 6 Sep 2013, 19:49:17 UTC Part of a PM from Cellar Dweller in August 2013: "The errors are for debugging and used so that they can perfect the software. If I were to produce one good unit return out of a thousand - that's a good return. This is why I would only get one credit for the one return. By killing the machine and not producing ANY results is what hurts. BOINC works on the premise the more the merrier. ANY good result is a + to the project." When deciding on what projects to support, one consideration has to be how many people think that way. If the project administrators won't filter them out, then I do it by working on other projects. It is a highly effective filter. ID: 1412292 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1412383 - Posted: 6 Sep 2013, 22:52:39 UTC - in response to Message 1412191. IIRC a while back it was Joe Segur that pointed out one possible metric as far as overall 'project health'goes. at the moment: Workunits waiting for db purging 616,772 12,647 0m Results waiting for db purging 1,288,541 43,962 0m so that's ~2.09 tasks sent out per multibeam workunit, and ~3.48 per Astropulse. At first glance, I thought that ~2.09 tasks per MB WU seemed low, but then I realized that it represents about one extra task having to be created, resent and processed for every 11 WUs, about 4.5% additional project overhead from that standpoint. So, just out of curiosity...... I looked at only the MB tasks my two busiest machines have received since September 1, and which have either been Completed and Validated, or are still in an Inconclusive state, in order to see how many of those WUs have required resends. One machine has completed 1,010 such tasks, of which 103 had to be resent at least once (for a total of 111 resends). The other has completed 876 such tasks, of which 90 had to be resent at least once (for a total of 95 resends). Certainly the sample is small, but it more or less confirms your metric, my result coming out to ~2.11 tasks per WU (1886 WUs * minimum 2 tasks + 206 total resends). Now, I didn't try to figure out how many of those resends were due to computation errors, abandonments, abortions, -9 overflows, or legitimate inconclusives that were or will be resolved. Nor did I try to identify which were due to machines that are in the chronic problem category as opposed to those that just hiccup occasionally. My gut feeling, though, is that further research would find that a significant proportion of those resends are caused by machines that are rapidly blowing through large volumes of tasks very quickly, simply because they're the ones that get those computation errors or overflows within a few seconds of the start of each and every task. They're voracious eaters, and the way the system currently operates, it just keeps feeding them! ID: 1412383 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.