Anonumous host throwing only errors, 3223 right now |
![]() |
| log in |
Message boards : Number crunching : Anonumous host throwing only errors, 3223 right now
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
When will this slop stop? From what I was checking a bit through my ever growing pending list, in a lot of cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never. ____________ | |
| ID: 1247245 · | |
|
Since Microsoft decided to automatically update everyones Nvidia drivers there are a lot of systems out there creating errors that nobody is even watching. Until Microsoft does this again with an Nvidia driver version without the sleep bug we are in for a long and bumpy ride. | |
| ID: 1247249 · | |
Computer 1334363 This is (gettin) big, I've a similar amount of hosts, as my wingmen, having a 10 to 1 failliar rate or 90% is invalid/inconclusive! ____________ Knight Who Says Ni N!, OUT numbered................. | |
| ID: 1247250 · | |
|
In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. | |
| ID: 1247271 · | |
Computer 1334363 I agree ... something needs to be done and it needs to go towards the top of their activity list ... L. ____________ | |
| ID: 1247355 · | |
In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. I sure hope cause I still see too much hosts making errors, host 6201358. Host 6623322, has hopefull spotted the errors. Host 6146694 , has 2823 errors. Host 6643839 . Host 5932466. Host 6469701. Even giving problems loading multiple Results! ____________ Knight Who Says Ni N!, OUT numbered................. | |
| ID: 1247357 · | |
Computer 1334363 Well this is getting beyond a joke now as it seems that I have been turned into a magnet to attract these machines for some reason :( Computer 2987620 Computer 4768798 Computer 5820931 Computer 5925627 Computer 6109377 Computer 6110059 Computer 6453095 Computer 6551396 Computer 6621248 Computer 6692048 My pendings are 50% higher than they usually are. Cheers. ____________ | |
| ID: 1247361 · | |
|
One more (couldn't find it with ctrl+f in this thread, so here ya go) | |
| ID: 1247379 · | |
Computer 1334363 Well here's the next bunch that have decided to hit me up. :( Computer 4891408 Computer 4910942 Computer 5049618 Computer 5186014 Computer 5352388 Computer 5472976 Computer 5514958 Computer 5545579 Computer 5738515 Computer 5851682 Computer 6137511 Computer 6140929 Computer 6199282 Computer 6201358 Computer 6244264 Computer 6331276 Computer 6598366 Computer 6619286 Computer 6636907 Computer 6637324 Computer 6646583 Computer 6687852 Computer 6689321 Computer 6703621 Cheers. ____________ | |
| ID: 1248142 · | |
|
I would love it if someone from the project end could chime in regarding this. Is this an exponentially growing problem, or is this just a steady percentage of rouge hosts? If this is indeed a growing problem, perhaps the S@H team should put into action WU handout denials based on quota or percentage error/invalids. If it's just a steady percentage and nothing abnormal perhaps the should chime in and shut us up? ;-) | |
| ID: 1248267 · | |
One way or another I'm sure we would all appreciate some enlightenment. While I'm not core project staff, I can certainly relay my interpretation of how 'the project' views errors/invalids. Project Staff tend to look at project health from a more holistic standpoint, that is 'overall' health rather than chasing individual problem hosts & such. They put the existing quota mechanisms in play, by my understanding, to mostly limit the amount of storage space, and possibly regulate the amount of work that reaches 'too many errors', so missing out on scientific analysis. How effective those measures are at regulating those issues, I suggest we can't really know, as individual users, apart from general indications from 'results waiting for db purging' versus 'Workunits waiting for db Purging'. Really in many cases 'invalids' and 'errors' themselves are largely out of direct 'control' by the project, though regulating them at least to some degree is practical. As users and developers we can influence some kinds of these invalids & errors, and I recall a few occasions applications being pulled where invalids without error occurred en masse. With the V7 transition to come, a lot of work by various people has gone into attempting to reduce some of the potential sources of invalids, but 'rogue' hosts are also often out of scope for application development. I think the easiest way to look at it, to avoid insanity, is that regulated numbers of hard errors are more or less 'fine' since they don't pollute the science unless going to 'Too many Errors - may have bug'. Being outvoted by older applications with known accuracy limitations, or notable divergence cross-platform (such as CPU vs GPU), so becoming invalid, is a far more minor problem though seriously annoying, that is within 'our scope'. That is something when I queried about priorities moving forward, Eric responded that it would be a great thing to refine, if we can do it from our end... so V7 multibeam will be improved from our angle at least, and hopefully AP eventually too. Jason ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1248281 · | |
|
When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? | |
| ID: 1248332 · | |
When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? As an 'end-user' I would expect all that to be true exactly, though have no idea of the scale of the problems relative to 'normal' operation, other than the 5-10% or so resends indicated. One indication is that with Astropulse splitting off & improved uptime, download pipe usage actually drops to manageable levels, when combined with fixes that Matt & others have been doing. I believe one of the worst scenarios in recent times was severe storage space issues & failing servers, which I'm proud to be a member of the GPU Users group pushing to get the project what it needs. Right now, with so much going on, following the drives with GPUUG is probably the closest means we all have to supporting the ongoing improvement. Jason ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1248334 · | |
|
I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? | |
| ID: 1248362 · | |
I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: - If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below. - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. - A task judged valid increases "Max tasks per day" by one. - A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota. Joe | |
| ID: 1248745 · | |
- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. So basically the system is completely useless in case of hosts, which generate invalid results and almost useless in case of hosts which generate errors, since it's enough for a host to send back 1 out of 50 results back as success (which does however not need to be valid) for to stay at 100 tasks per day. I wouldn't consider that as a working system. ____________ . | |
| ID: 1248748 · | |
See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. LOL.. reading that link is like watching StartTrek It seems like I understand what they are talking about... The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... Errors, unlike invalids, should be something really rare. If each error decrease the quota by 9 and each success increase it by 1, then you will need 9 success to compensate an error which means that any host with an error rate above 10% will not be able to rise their quota. (Of course, choosing the "right" ratio is beyond me) As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... Of course, Im just thinking out loud, (and Im not original with the ideas). I guess, like Jason said, that as far as the invalids/errors are discarded by the replicatication validation, it may not be a priority for the project to invest time in the implementation of this kind of changes, specially if the wasted resources of that errors are not significative to the load of the servers... ____________ | |
| ID: 1248772 · | |
Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. Success != valid result As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... I'd start with 50%, i.e. for each error/invalid 1 down, for each valid (not "success") 1 up. With resetting to 100 on first error/invalid if above. ____________ . | |
| ID: 1248777 · | |
Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... You're right... which is even worst... As I said, choosing the ratios is beyond me, but anyway Im not sure about resetting the limits, any big step (up or down) will probably mess up things... (I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) ____________ | |
| ID: 1248789 · | |
I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) For to not hurt faster hosts the reset to 100 (or below) tasks could happen after 3-5 (or whatever) consecutive (or within a defined period of time) errors/invalids. Although I think that 100/CPU-core or 800/GPU should still be enough even for the fastest CPUs or GPUs, at least for now. But resetting to 100 because of a single error/invalid is indeed just annoying and in most cases useless. To help hosts, which has been fixed, to build up a cache there could also be a "I fixed it!" button on the application page, which could raise the quota by 50 for example. The usage of this button should be of course limited to once a day or something like that. ____________ . | |
| ID: 1248806 · | |
Message boards : Number crunching : Anonumous host throwing only errors, 3223 right now
| Copyright © 2013 University of California |