Anonumous host throwing only errors, 3223 right now

Author	Message
Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0	Message 1247245 - Posted: 17 Jun 2012, 2:28:28 UTC - in response to Message 1247242. Last modified: 17 Jun 2012, 2:42:08 UTC When will this slop stop? From what I was checking a bit through my ever growing pending list, in a lot of cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never. ID: 1247245 ·

Wembley Volunteer tester Send message Joined: 16 Sep 09 Posts: 429 Credit: 1,844,293 RAC: 0	Message 1247249 - Posted: 17 Jun 2012, 3:01:02 UTC Since Microsoft decided to automatically update everyones Nvidia drivers there are a lot of systems out there creating errors that nobody is even watching. Until Microsoft does this again with an Nvidia driver version without the sleep bug we are in for a long and bumpy ride. ID: 1247249 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1247250 - Posted: 17 Jun 2012, 3:01:35 UTC - in response to Message 1247242. Computer 1334363 Computer 2699082 Computer 3378825 Computer 3440890 Computer 5007852 Computer 5236541 Computer 5292089 Computer 5345364 Computer 5461131 Computer 5485240 Computer 5542882 Computer 5762247 Computer 5874840 Computer 5889577 Computer 5927348 Computer 5935169 Computer 5967897 Computer 6126443 Computer 6148459 Computer 6229518 Computer 6247292 Computer 6249179 Computer 6253461 Computer 6256705 Computer 6271384 Computer 6283429 Computer 6318737 Computer 6401198 Computer 6441935 Computer 6469701 Computer 6568123 Computer 6586696 Computer 6589662 Computer 6633670 Computer 6640112 Computer 6643594 Computer 6643620 Computer 6650172 Computer 6650230 Computer 6651362 Computer 1901895 Computer 5218485 Computer 5348349 Computer 5389162 Computer 5877728 Computer 5932466 Computer 6204067 Computer 6236663 Computer 6249533 Computer 6462813 Computer 6598140 Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :( Computer 4799081 Computer 5821256 Computer 5935372 Computer 6028874 Computer 6128026 Computer 6175649 Computer 6189897 Computer 6247549 Computer 6387513 Computer 6585255 Computer 6641972 Computer 6649287 Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone. When will this slop stop? Cheers. From what I was checking a bit, in most cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never. ____________ This is (gettin) big, I've a similar amount of hosts, as my wingmen, having a 10 to 1 failliar rate or 90% is invalid/inconclusive! ID: 1247250 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1247271 - Posted: 17 Jun 2012, 4:29:49 UTC - in response to Message 1247250. In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. A number of others probably just need a good clean out, but there must be some way to alert these people to their problems and somehow try to cut off these hosts until the problems are sorted. Out of all PM's I sent out last week I have now received a grand total of 5 replies back now and a couple of others have not had their hosts contact the servers since but it's a start I spose (maybe even this thread is helping too). Cheers. ID: 1247271 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1247355 - Posted: 17 Jun 2012, 9:54:15 UTC - in response to Message 1247242. Computer 1334363 Computer 2699082 Computer 3378825 Computer 3440890 Computer 5007852 Computer 5236541 Computer 5292089 Computer 5345364 Computer 5461131 Computer 5485240 Computer 5542882 Computer 5762247 Computer 5874840 Computer 5889577 Computer 5927348 Computer 5935169 Computer 5967897 Computer 6126443 Computer 6148459 Computer 6229518 Computer 6247292 Computer 6249179 Computer 6253461 Computer 6256705 Computer 6271384 Computer 6283429 Computer 6318737 Computer 6401198 Computer 6441935 Computer 6469701 Computer 6568123 Computer 6586696 Computer 6589662 Computer 6633670 Computer 6640112 Computer 6643594 Computer 6643620 Computer 6650172 Computer 6650230 Computer 6651362 Computer 1901895 Computer 5218485 Computer 5348349 Computer 5389162 Computer 5877728 Computer 5932466 Computer 6204067 Computer 6236663 Computer 6249533 Computer 6462813 Computer 6598140 Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :( Computer 4799081 Computer 5821256 Computer 5935372 Computer 6028874 Computer 6128026 Computer 6175649 Computer 6189897 Computer 6247549 Computer 6387513 Computer 6585255 Computer 6641972 Computer 6649287 Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone. When will this slop stop? Cheers. I agree ... something needs to be done and it needs to go towards the top of their activity list ... L. ID: 1247355 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1247357 - Posted: 17 Jun 2012, 10:02:58 UTC - in response to Message 1247271. Last modified: 17 Jun 2012, 10:06:01 UTC In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. A number of others probably just need a good clean out, but there must be some way to alert these people to their problems and somehow try to cut off these hosts until the problems are sorted. Out of all PM's I sent out last week I have now received a grand total of 5 replies back now and a couple of others have not had their hosts contact the servers since but it's a start I spose (maybe even this thread is helping too). Cheers. I sure hope cause I still see too much hosts making errors, host 6201358. Host 6623322, has hopefull spotted the errors. Host 6146694 , has 2823 errors. Host 6643839 . Host 5932466. Host 6469701. Even giving problems loading multiple Results! ID: 1247357 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1247361 - Posted: 17 Jun 2012, 10:14:00 UTC - in response to Message 1247355. Computer 1334363 Computer 2699082 Computer 3378825 Computer 3440890 Computer 5007852 Computer 5236541 Computer 5292089 Computer 5345364 Computer 5461131 Computer 5485240 Computer 5542882 Computer 5762247 Computer 5874840 Computer 5889577 Computer 5927348 Computer 5935169 Computer 5967897 Computer 6126443 Computer 6148459 Computer 6229518 Computer 6247292 Computer 6249179 Computer 6253461 Computer 6256705 Computer 6271384 Computer 6283429 Computer 6318737 Computer 6401198 Computer 6441935 Computer 6469701 Computer 6568123 Computer 6586696 Computer 6589662 Computer 6633670 Computer 6640112 Computer 6643594 Computer 6643620 Computer 6650172 Computer 6650230 Computer 6651362 Computer 1901895 Computer 5218485 Computer 5348349 Computer 5389162 Computer 5877728 Computer 5932466 Computer 6204067 Computer 6236663 Computer 6249533 Computer 6462813 Computer 6598140 Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :( Computer 4799081 Computer 5821256 Computer 5935372 Computer 6028874 Computer 6128026 Computer 6175649 Computer 6189897 Computer 6247549 Computer 6387513 Computer 6585255 Computer 6641972 Computer 6649287 Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone. When will this slop stop? Cheers. I agree ... something needs to be done and it needs to go towards the top of their activity list ... L. Well this is getting beyond a joke now as it seems that I have been turned into a magnet to attract these machines for some reason :( Computer 2987620 Computer 4768798 Computer 5820931 Computer 5925627 Computer 6109377 Computer 6110059 Computer 6453095 Computer 6551396 Computer 6621248 Computer 6692048 My pendings are 50% higher than they usually are. Cheers. ID: 1247361 ·

shizaru Volunteer tester Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0	Message 1247379 - Posted: 17 Jun 2012, 11:49:20 UTC One more (couldn't find it with ctrl+f in this thread, so here ya go) Computer 6057058 ID: 1247379 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1248142 - Posted: 18 Jun 2012, 23:42:58 UTC - in response to Message 1247361. Computer 1334363 Computer 2699082 Computer 3378825 Computer 3440890 Computer 5007852 Computer 5236541 Computer 5292089 Computer 5345364 Computer 5461131 Computer 5485240 Computer 5542882 Computer 5762247 Computer 5874840 Computer 5889577 Computer 5927348 Computer 5935169 Computer 5967897 Computer 6126443 Computer 6148459 Computer 6229518 Computer 6247292 Computer 6249179 Computer 6253461 Computer 6256705 Computer 6271384 Computer 6283429 Computer 6318737 Computer 6401198 Computer 6441935 Computer 6469701 Computer 6568123 Computer 6586696 Computer 6589662 Computer 6633670 Computer 6640112 Computer 6643594 Computer 6643620 Computer 6650172 Computer 6650230 Computer 6651362 Computer 1901895 Computer 5218485 Computer 5348349 Computer 5389162 Computer 5877728 Computer 5932466 Computer 6204067 Computer 6236663 Computer 6249533 Computer 6462813 Computer 6598140 Computer 4799081 Computer 5821256 Computer 5935372 Computer 6028874 Computer 6128026 Computer 6175649 Computer 6189897 Computer 6247549 Computer 6387513 Computer 6585255 Computer 6641972 Computer 6649287 Computer 2987620 Computer 4768798 Computer 5820931 Computer 5925627 Computer 6109377 Computer 6110059 Computer 6453095 Computer 6551396 Computer 6621248 Computer 6692048 Well here's the next bunch that have decided to hit me up. :( Computer 4891408 Computer 4910942 Computer 5049618 Computer 5186014 Computer 5352388 Computer 5472976 Computer 5514958 Computer 5545579 Computer 5738515 Computer 5851682 Computer 6137511 Computer 6140929 Computer 6199282 Computer 6201358 Computer 6244264 Computer 6331276 Computer 6598366 Computer 6619286 Computer 6636907 Computer 6637324 Computer 6646583 Computer 6687852 Computer 6689321 Computer 6703621 Cheers. ID: 1248142 ·

Ex: "Socialist" Volunteer tester Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2	Message 1248267 - Posted: 19 Jun 2012, 6:45:57 UTC I would love it if someone from the project end could chime in regarding this. Is this an exponentially growing problem, or is this just a steady percentage of rouge hosts? If this is indeed a growing problem, perhaps the S@H team should put into action WU handout denials based on quota or percentage error/invalids. If it's just a steady percentage and nothing abnormal perhaps the should chime in and shut us up? ;-) One way or another I'm sure we would all appreciate some enlightenment. #resist ID: 1248267 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1248281 - Posted: 19 Jun 2012, 7:47:46 UTC - in response to Message 1248267. Last modified: 19 Jun 2012, 8:04:51 UTC One way or another I'm sure we would all appreciate some enlightenment. While I'm not core project staff, I can certainly relay my interpretation of how 'the project' views errors/invalids. Project Staff tend to look at project health from a more holistic standpoint, that is 'overall' health rather than chasing individual problem hosts & such. They put the existing quota mechanisms in play, by my understanding, to mostly limit the amount of storage space, and possibly regulate the amount of work that reaches 'too many errors', so missing out on scientific analysis. How effective those measures are at regulating those issues, I suggest we can't really know, as individual users, apart from general indications from 'results waiting for db purging' versus 'Workunits waiting for db Purging'. Really in many cases 'invalids' and 'errors' themselves are largely out of direct 'control' by the project, though regulating them at least to some degree is practical. As users and developers we can influence some kinds of these invalids & errors, and I recall a few occasions applications being pulled where invalids without error occurred en masse. With the V7 transition to come, a lot of work by various people has gone into attempting to reduce some of the potential sources of invalids, but 'rogue' hosts are also often out of scope for application development. I think the easiest way to look at it, to avoid insanity, is that regulated numbers of hard errors are more or less 'fine' since they don't pollute the science unless going to 'Too many Errors - may have bug'. Being outvoted by older applications with known accuracy limitations, or notable divergence cross-platform (such as CPU vs GPU), so becoming invalid, is a far more minor problem though seriously annoying, that is within 'our scope'. That is something when I queried about priorities moving forward, Eric responded that it would be a great thing to refine, if we can do it from our end... so V7 multibeam will be improved from our angle at least, and hopefully AP eventually too. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1248281 ·

Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0	Message 1248332 - Posted: 19 Jun 2012, 12:40:21 UTC When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? One of the problems that the project has had for quite some time is limited bandwidth. Every one of these machines getting 50x their normal allotment of work units is taking up the bandwidth of 50 machines of their same class without errors. It would seem to me that even if the percentage of machines pulling huge numbers of work units and erroring out on them is low, the absolute number of machines based on the lists upthread should be enough to have a measureable, and perhaps significant impact on the lab's pipe to the internet? ID: 1248332 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1248334 - Posted: 19 Jun 2012, 12:51:31 UTC - in response to Message 1248332. When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? One of the problems that the project has had for quite some time is limited bandwidth. Every one of these machines getting 50x their normal allotment of work units is taking up the bandwidth of 50 machines of their same class without errors. It would seem to me that even if the percentage of machines pulling huge numbers of work units and erroring out on them is low, the absolute number of machines based on the lists upthread should be enough to have a measureable, and perhaps significant impact on the lab's pipe to the internet? As an 'end-user' I would expect all that to be true exactly, though have no idea of the scale of the problems relative to 'normal' operation, other than the 5-10% or so resends indicated. One indication is that with Astropulse splitting off & improved uptime, download pipe usage actually drops to manageable levels, when combined with fixes that Matt & others have been doing. I believe one of the worst scenarios in recent times was severe storage space issues & failing servers, which I'm proud to be a member of the GPU Users group pushing to get the project what it needs. Right now, with so much going on, following the drives with GPUUG is probably the closest means we all have to supporting the ongoing improvement. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1248334 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1248362 - Posted: 19 Jun 2012, 14:55:10 UTC I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? If it were configurable, then it should be easy to the project to minimize de effects of the "rogue" hosts... just making the rise ratio of the quota slower than the down ratio will help a lot to limit the amount of errors from failling hosts... Or is it that the "succesfull but invalid" dont affect the quota? ID: 1248362 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1248745 - Posted: 20 Jun 2012, 14:36:38 UTC - in response to Message 1248362. I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? If it were configurable, then it should be easy to the project to minimize de effects of the "rogue" hosts... just making the rise ratio of the quota slower than the down ratio will help a lot to limit the amount of errors from failling hosts... Or is it that the "succesfull but invalid" dont affect the quota? See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: - If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below. - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. - A task judged valid increases "Max tasks per day" by one. - A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota. Joe ID: 1248745 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1248748 - Posted: 20 Jun 2012, 14:53:34 UTC - in response to Message 1248745. - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. (...) - A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota. So basically the system is completely useless in case of hosts, which generate invalid results and almost useless in case of hosts which generate errors, since it's enough for a host to send back 1 out of 50 results back as success (which does however not need to be valid) for to stay at 100 tasks per day. I wouldn't consider that as a working system. ID: 1248748 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1248772 - Posted: 20 Jun 2012, 15:41:06 UTC - in response to Message 1248745. See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. LOL.. reading that link is like watching StartTrek It seems like I understand what they are talking about... The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: - If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below. - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. - A task judged valid increases "Max tasks per day" by one. - A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota. Joe Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... Errors, unlike invalids, should be something really rare. If each error decrease the quota by 9 and each success increase it by 1, then you will need 9 success to compensate an error which means that any host with an error rate above 10% will not be able to rise their quota. (Of course, choosing the "right" ratio is beyond me) As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... Of course, Im just thinking out loud, (and Im not original with the ideas). I guess, like Jason said, that as far as the invalids/errors are discarded by the replicatication validation, it may not be a priority for the project to invest time in the implementation of this kind of changes, specially if the wasted resources of that errors are not significative to the load of the servers... ID: 1248772 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1248777 - Posted: 20 Jun 2012, 15:53:07 UTC - in response to Message 1248772. Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. Success != valid result As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... I'd start with 50%, i.e. for each error/invalid 1 down, for each valid (not "success") 1 up. With resetting to 100 on first error/invalid if above. ID: 1248777 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1248789 - Posted: 20 Jun 2012, 16:20:50 UTC - in response to Message 1248777. Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand You're right... which is even worst... As I said, choosing the ratios is beyond me, but anyway Im not sure about resetting the limits, any big step (up or down) will probably mess up things... (I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) ID: 1248789 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1248806 - Posted: 20 Jun 2012, 17:02:51 UTC - in response to Message 1248789. Last modified: 20 Jun 2012, 17:05:43 UTC I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) For to not hurt faster hosts the reset to 100 (or below) tasks could happen after 3-5 (or whatever) consecutive (or within a defined period of time) errors/invalids. Although I think that 100/CPU-core or 800/GPU should still be enough even for the fastest CPUs or GPUs, at least for now. But resetting to 100 because of a single error/invalid is indeed just annoying and in most cases useless. To help hosts, which has been fixed, to build up a cache there could also be a "I fixed it!" button on the application page, which could raise the quota by 50 for example. The usage of this button should be of course limited to once a day or something like that. ID: 1248806 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.