Message boards :
Number crunching :
Anonumous host throwing only errors, 3223 right now
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0 |
When will this slop stop? From what I was checking a bit through my ever growing pending list, in a lot of cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never. |
Wembley Send message Joined: 16 Sep 09 Posts: 429 Credit: 1,844,293 RAC: 0 |
Since Microsoft decided to automatically update everyones Nvidia drivers there are a lot of systems out there creating errors that nobody is even watching. Until Microsoft does this again with an Nvidia driver version without the sleep bug we are in for a long and bumpy ride. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
Computer 1334363 This is (gettin) big, I've a similar amount of hosts, as my wingmen, having a 10 to 1 failliar rate or 90% is invalid/inconclusive! |
Wiggo Send message Joined: 24 Jan 00 Posts: 36635 Credit: 261,360,520 RAC: 489 |
In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. A number of others probably just need a good clean out, but there must be some way to alert these people to their problems and somehow try to cut off these hosts until the problems are sorted. Out of all PM's I sent out last week I have now received a grand total of 5 replies back now and a couple of others have not had their hosts contact the servers since but it's a start I spose (maybe even this thread is helping too). Cheers. |
Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597 |
Computer 1334363 I agree ... something needs to be done and it needs to go towards the top of their activity list ... L. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total. I sure hope cause I still see too much hosts making errors, host 6201358. Host 6623322, has hopefull spotted the errors. Host 6146694 , has 2823 errors. Host 6643839 . Host 5932466. Host 6469701. Even giving problems loading multiple Results! |
Wiggo Send message Joined: 24 Jan 00 Posts: 36635 Credit: 261,360,520 RAC: 489 |
Computer 1334363 Well this is getting beyond a joke now as it seems that I have been turned into a magnet to attract these machines for some reason :( Computer 2987620 Computer 4768798 Computer 5820931 Computer 5925627 Computer 6109377 Computer 6110059 Computer 6453095 Computer 6551396 Computer 6621248 Computer 6692048 My pendings are 50% higher than they usually are. Cheers. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
|
Wiggo Send message Joined: 24 Jan 00 Posts: 36635 Credit: 261,360,520 RAC: 489 |
Computer 1334363 Well here's the next bunch that have decided to hit me up. :( Computer 4891408 Computer 4910942 Computer 5049618 Computer 5186014 Computer 5352388 Computer 5472976 Computer 5514958 Computer 5545579 Computer 5738515 Computer 5851682 Computer 6137511 Computer 6140929 Computer 6199282 Computer 6201358 Computer 6244264 Computer 6331276 Computer 6598366 Computer 6619286 Computer 6636907 Computer 6637324 Computer 6646583 Computer 6687852 Computer 6689321 Computer 6703621 Cheers. |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
I would love it if someone from the project end could chime in regarding this. Is this an exponentially growing problem, or is this just a steady percentage of rouge hosts? If this is indeed a growing problem, perhaps the S@H team should put into action WU handout denials based on quota or percentage error/invalids. If it's just a steady percentage and nothing abnormal perhaps the should chime in and shut us up? ;-) One way or another I'm sure we would all appreciate some enlightenment. #resist |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
One way or another I'm sure we would all appreciate some enlightenment. While I'm not core project staff, I can certainly relay my interpretation of how 'the project' views errors/invalids. Project Staff tend to look at project health from a more holistic standpoint, that is 'overall' health rather than chasing individual problem hosts & such. They put the existing quota mechanisms in play, by my understanding, to mostly limit the amount of storage space, and possibly regulate the amount of work that reaches 'too many errors', so missing out on scientific analysis. How effective those measures are at regulating those issues, I suggest we can't really know, as individual users, apart from general indications from 'results waiting for db purging' versus 'Workunits waiting for db Purging'. Really in many cases 'invalids' and 'errors' themselves are largely out of direct 'control' by the project, though regulating them at least to some degree is practical. As users and developers we can influence some kinds of these invalids & errors, and I recall a few occasions applications being pulled where invalids without error occurred en masse. With the V7 transition to come, a lot of work by various people has gone into attempting to reduce some of the potential sources of invalids, but 'rogue' hosts are also often out of scope for application development. I think the easiest way to look at it, to avoid insanity, is that regulated numbers of hard errors are more or less 'fine' since they don't pollute the science unless going to 'Too many Errors - may have bug'. Being outvoted by older applications with known accuracy limitations, or notable divergence cross-platform (such as CPU vs GPU), so becoming invalid, is a far more minor problem though seriously annoying, that is within 'our scope'. That is something when I queried about priorities moving forward, Eric responded that it would be a great thing to refine, if we can do it from our end... so V7 multibeam will be improved from our angle at least, and hopefully AP eventually too. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0 |
When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? One of the problems that the project has had for quite some time is limited bandwidth. Every one of these machines getting 50x their normal allotment of work units is taking up the bandwidth of 50 machines of their same class without errors. It would seem to me that even if the percentage of machines pulling huge numbers of work units and erroring out on them is low, the absolute number of machines based on the lists upthread should be enough to have a measureable, and perhaps significant impact on the lab's pipe to the internet? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project? As an 'end-user' I would expect all that to be true exactly, though have no idea of the scale of the problems relative to 'normal' operation, other than the 5-10% or so resends indicated. One indication is that with Astropulse splitting off & improved uptime, download pipe usage actually drops to manageable levels, when combined with fixes that Matt & others have been doing. I believe one of the worst scenarios in recent times was severe storage space issues & failing servers, which I'm proud to be a member of the GPU Users group pushing to get the project what it needs. Right now, with so much going on, following the drives with GPUUG is probably the closest means we all have to supporting the ongoing improvement. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? If it were configurable, then it should be easy to the project to minimize de effects of the "rogue" hosts... just making the rise ratio of the quota slower than the down ratio will help a lot to limit the amount of errors from failling hosts... Or is it that the "succesfull but invalid" dont affect the quota? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable? See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: - If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below. - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. - A task judged valid increases "Max tasks per day" by one. - A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota. Joe |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. So basically the system is completely useless in case of hosts, which generate invalid results and almost useless in case of hosts which generate errors, since it's enough for a host to send back 1 out of 50 results back as success (which does however not need to be valid) for to stay at 100 tasks per day. I wouldn't consider that as a working system. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded. LOL.. reading that link is like watching StartTrek It seems like I understand what they are talking about... The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota: Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... Errors, unlike invalids, should be something really rare. If each error decrease the quota by 9 and each success increase it by 1, then you will need 9 success to compensate an error which means that any host with an error rate above 10% will not be able to rise their quota. (Of course, choosing the "right" ratio is beyond me) As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... Of course, Im just thinking out loud, (and Im not original with the ideas). I guess, like Jason said, that as far as the invalids/errors are discarded by the replicatication validation, it may not be a priority for the project to invest time in the implementation of this kind of changes, specially if the wasted resources of that errors are not significative to the load of the servers... |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand - If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100. Success != valid result As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve... I'd start with 50%, i.e. for each error/invalid 1 down, for each valid (not "success") 1 up. With resetting to 100 on first error/invalid if above. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100... You're right... which is even worst... As I said, choosing the ratios is beyond me, but anyway Im not sure about resetting the limits, any big step (up or down) will probably mess up things... (I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D ) For to not hurt faster hosts the reset to 100 (or below) tasks could happen after 3-5 (or whatever) consecutive (or within a defined period of time) errors/invalids. Although I think that 100/CPU-core or 800/GPU should still be enough even for the fastest CPUs or GPUs, at least for now. But resetting to 100 because of a single error/invalid is indeed just annoying and in most cases useless. To help hosts, which has been fixed, to build up a cache there could also be a "I fixed it!" button on the application page, which could raise the quota by 50 for example. The usage of this button should be of course limited to once a day or something like that. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.