Anonumous host throwing only errors, 3223 right now


log in

Advanced search

Message boards : Number crunching : Anonumous host throwing only errors, 3223 right now

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Profile Khangollo
Avatar
Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1247245 - Posted: 17 Jun 2012, 2:28:28 UTC - in response to Message 1247242.
Last modified: 17 Jun 2012, 2:42:08 UTC

When will this slop stop?


From what I was checking a bit through my ever growing pending list, in a lot of cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never.
____________

Wembley
Volunteer tester
Avatar
Send message
Joined: 16 Sep 09
Posts: 415
Credit: 888,257
RAC: 0
United States
Message 1247249 - Posted: 17 Jun 2012, 3:01:02 UTC

Since Microsoft decided to automatically update everyones Nvidia drivers there are a lot of systems out there creating errors that nobody is even watching. Until Microsoft does this again with an Nvidia driver version without the sleep bug we are in for a long and bumpy ride.
____________


Donate with your searches and online buys:
http://www.goodsearch.com/toolbar/university-of-california-setihome

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3232
Credit: 31,585,541
RAC: 0
Netherlands
Message 1247250 - Posted: 17 Jun 2012, 3:01:35 UTC - in response to Message 1247242.

Computer 1334363
Computer 2699082
Computer 3378825
Computer 3440890
Computer 5007852
Computer 5236541
Computer 5292089
Computer 5345364
Computer 5461131
Computer 5485240
Computer 5542882
Computer 5762247
Computer 5874840
Computer 5889577
Computer 5927348
Computer 5935169
Computer 5967897
Computer 6126443
Computer 6148459
Computer 6229518
Computer 6247292
Computer 6249179
Computer 6253461
Computer 6256705
Computer 6271384
Computer 6283429
Computer 6318737
Computer 6401198
Computer 6441935
Computer 6469701
Computer 6568123
Computer 6586696
Computer 6589662
Computer 6633670
Computer 6640112
Computer 6643594
Computer 6643620
Computer 6650172
Computer 6650230
Computer 6651362

Computer 1901895
Computer 5218485
Computer 5348349
Computer 5389162
Computer 5877728
Computer 5932466
Computer 6204067
Computer 6236663
Computer 6249533
Computer 6462813
Computer 6598140

Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :(

Computer 4799081
Computer 5821256
Computer 5935372
Computer 6028874
Computer 6128026
Computer 6175649
Computer 6189897
Computer 6247549
Computer 6387513
Computer 6585255
Computer 6641972
Computer 6649287

Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone.

When will this slop stop?

Cheers.


From what I was checking a bit, in most cases it is the faulty windows nvidia driver ("sleep bug"). Unless admins update the scheduler to not issue cuda tasks to hosts with certain driver versions - like they did at einstein@home - this is probably not going to stop on its own until everyone updates to 301.x... which may take months or even never.
____________



This is (gettin) big, I've a similar amount of hosts, as my wingmen, having a 10 to 1 failliar rate or 90% is invalid/inconclusive!






____________


Knight Who Says Ni N!, OUT numbered.................

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,060,740
RAC: 71,829
Australia
Message 1247271 - Posted: 17 Jun 2012, 4:29:49 UTC - in response to Message 1247250.

In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total.

A number of others probably just need a good clean out, but there must be some way to alert these people to their problems and somehow try to cut off these hosts until the problems are sorted.

Out of all PM's I sent out last week I have now received a grand total of 5 replies back now and a couple of others have not had their hosts contact the servers since but it's a start I spose (maybe even this thread is helping too).

Cheers.
____________

Lionel
Send message
Joined: 25 Mar 00
Posts: 543
Credit: 198,065,652
RAC: 164,910
Australia
Message 1247355 - Posted: 17 Jun 2012, 9:54:15 UTC - in response to Message 1247242.

Computer 1334363
Computer 2699082
Computer 3378825
Computer 3440890
Computer 5007852
Computer 5236541
Computer 5292089
Computer 5345364
Computer 5461131
Computer 5485240
Computer 5542882
Computer 5762247
Computer 5874840
Computer 5889577
Computer 5927348
Computer 5935169
Computer 5967897
Computer 6126443
Computer 6148459
Computer 6229518
Computer 6247292
Computer 6249179
Computer 6253461
Computer 6256705
Computer 6271384
Computer 6283429
Computer 6318737
Computer 6401198
Computer 6441935
Computer 6469701
Computer 6568123
Computer 6586696
Computer 6589662
Computer 6633670
Computer 6640112
Computer 6643594
Computer 6643620
Computer 6650172
Computer 6650230
Computer 6651362

Computer 1901895
Computer 5218485
Computer 5348349
Computer 5389162
Computer 5877728
Computer 5932466
Computer 6204067
Computer 6236663
Computer 6249533
Computer 6462813
Computer 6598140

Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :(

Computer 4799081
Computer 5821256
Computer 5935372
Computer 6028874
Computer 6128026
Computer 6175649
Computer 6189897
Computer 6247549
Computer 6387513
Computer 6585255
Computer 6641972
Computer 6649287

Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone.

When will this slop stop?

Cheers.


I agree ... something needs to be done and it needs to go towards the top of their activity list ...

L.

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3232
Credit: 31,585,541
RAC: 0
Netherlands
Message 1247357 - Posted: 17 Jun 2012, 10:02:58 UTC - in response to Message 1247271.
Last modified: 17 Jun 2012, 10:06:01 UTC

In that list there are 17 with the sleepy driver bug, but there is also 10 out of control GTX 560 Ti's running with other drivers, though these two types do account for around 45% of the total.

A number of others probably just need a good clean out, but there must be some way to alert these people to their problems and somehow try to cut off these hosts until the problems are sorted.

Out of all PM's I sent out last week I have now received a grand total of 5 replies back now and a couple of others have not had their hosts contact the servers since but it's a start I spose (maybe even this thread is helping too).

Cheers.


I sure hope cause I still see too much hosts making errors,

host
6201358.

Host
6623322,
has hopefull spotted the errors.
Host 6146694
,
has 2823 errors.
Host 6643839
.
Host 5932466.

Host 6469701.


Even giving problems loading multiple Results!
____________


Knight Who Says Ni N!, OUT numbered.................

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,060,740
RAC: 71,829
Australia
Message 1247361 - Posted: 17 Jun 2012, 10:14:00 UTC - in response to Message 1247355.

Computer 1334363
Computer 2699082
Computer 3378825
Computer 3440890
Computer 5007852
Computer 5236541
Computer 5292089
Computer 5345364
Computer 5461131
Computer 5485240
Computer 5542882
Computer 5762247
Computer 5874840
Computer 5889577
Computer 5927348
Computer 5935169
Computer 5967897
Computer 6126443
Computer 6148459
Computer 6229518
Computer 6247292
Computer 6249179
Computer 6253461
Computer 6256705
Computer 6271384
Computer 6283429
Computer 6318737
Computer 6401198
Computer 6441935
Computer 6469701
Computer 6568123
Computer 6586696
Computer 6589662
Computer 6633670
Computer 6640112
Computer 6643594
Computer 6643620
Computer 6650172
Computer 6650230
Computer 6651362

Computer 1901895
Computer 5218485
Computer 5348349
Computer 5389162
Computer 5877728
Computer 5932466
Computer 6204067
Computer 6236663
Computer 6249533
Computer 6462813
Computer 6598140

Now this is just getting a bit far past the funny side now as I checked in this morning to find that my pendings had taken another leap upwards again so I checked them all again only to find out that I've been hit up again by a new lot of problematic hosts. :(

Computer 4799081
Computer 5821256
Computer 5935372
Computer 6028874
Computer 6128026
Computer 6175649
Computer 6189897
Computer 6247549
Computer 6387513
Computer 6585255
Computer 6641972
Computer 6649287

Now that makes 63 that I've teamed, double teamed, triple teamed and quadrupled teamed with this month alone.

When will this slop stop?

Cheers.


I agree ... something needs to be done and it needs to go towards the top of their activity list ...

L.

Well this is getting beyond a joke now as it seems that I have been turned into a magnet to attract these machines for some reason :(

Computer 2987620
Computer 4768798
Computer 5820931
Computer 5925627
Computer 6109377
Computer 6110059
Computer 6453095
Computer 6551396
Computer 6621248
Computer 6692048

My pendings are 50% higher than they usually are.

Cheers.
____________

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 533
Credit: 1,575,675
RAC: 482
Greece
Message 1247379 - Posted: 17 Jun 2012, 11:49:20 UTC

One more (couldn't find it with ctrl+f in this thread, so here ya go)

Computer 6057058

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,060,740
RAC: 71,829
Australia
Message 1248142 - Posted: 18 Jun 2012, 23:42:58 UTC - in response to Message 1247361.

Computer 1334363
Computer 2699082
Computer 3378825
Computer 3440890
Computer 5007852
Computer 5236541
Computer 5292089
Computer 5345364
Computer 5461131
Computer 5485240
Computer 5542882
Computer 5762247
Computer 5874840
Computer 5889577
Computer 5927348
Computer 5935169
Computer 5967897
Computer 6126443
Computer 6148459
Computer 6229518
Computer 6247292
Computer 6249179
Computer 6253461
Computer 6256705
Computer 6271384
Computer 6283429
Computer 6318737
Computer 6401198
Computer 6441935
Computer 6469701
Computer 6568123
Computer 6586696
Computer 6589662
Computer 6633670
Computer 6640112
Computer 6643594
Computer 6643620
Computer 6650172
Computer 6650230
Computer 6651362

Computer 1901895
Computer 5218485
Computer 5348349
Computer 5389162
Computer 5877728
Computer 5932466
Computer 6204067
Computer 6236663
Computer 6249533
Computer 6462813
Computer 6598140


Computer 4799081
Computer 5821256
Computer 5935372
Computer 6028874
Computer 6128026
Computer 6175649
Computer 6189897
Computer 6247549
Computer 6387513
Computer 6585255
Computer 6641972
Computer 6649287


Computer 2987620
Computer 4768798
Computer 5820931
Computer 5925627
Computer 6109377
Computer 6110059
Computer 6453095
Computer 6551396
Computer 6621248
Computer 6692048

Well here's the next bunch that have decided to hit me up. :(

Computer 4891408
Computer 4910942
Computer 5049618
Computer 5186014
Computer 5352388
Computer 5472976
Computer 5514958
Computer 5545579
Computer 5738515
Computer 5851682
Computer 6137511
Computer 6140929
Computer 6199282
Computer 6201358
Computer 6244264
Computer 6331276
Computer 6598366
Computer 6619286
Computer 6636907
Computer 6637324
Computer 6646583
Computer 6687852
Computer 6689321
Computer 6703621

Cheers.
____________

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2892
Credit: 1,563,333
RAC: 1,300
United States
Message 1248267 - Posted: 19 Jun 2012, 6:45:57 UTC

I would love it if someone from the project end could chime in regarding this. Is this an exponentially growing problem, or is this just a steady percentage of rouge hosts? If this is indeed a growing problem, perhaps the S@H team should put into action WU handout denials based on quota or percentage error/invalids. If it's just a steady percentage and nothing abnormal perhaps the should chime in and shut us up? ;-)

One way or another I'm sure we would all appreciate some enlightenment.
____________
-Dave #2


3.2.0-33

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4813
Credit: 71,601,082
RAC: 9,076
Australia
Message 1248281 - Posted: 19 Jun 2012, 7:47:46 UTC - in response to Message 1248267.
Last modified: 19 Jun 2012, 8:04:51 UTC

One way or another I'm sure we would all appreciate some enlightenment.


While I'm not core project staff, I can certainly relay my interpretation of how 'the project' views errors/invalids.

Project Staff tend to look at project health from a more holistic standpoint, that is 'overall' health rather than chasing individual problem hosts & such. They put the existing quota mechanisms in play, by my understanding, to mostly limit the amount of storage space, and possibly regulate the amount of work that reaches 'too many errors', so missing out on scientific analysis.

How effective those measures are at regulating those issues, I suggest we can't really know, as individual users, apart from general indications from 'results waiting for db purging' versus 'Workunits waiting for db Purging'.

Really in many cases 'invalids' and 'errors' themselves are largely out of direct 'control' by the project, though regulating them at least to some degree is practical.

As users and developers we can influence some kinds of these invalids & errors, and I recall a few occasions applications being pulled where invalids without error occurred en masse. With the V7 transition to come, a lot of work by various people has gone into attempting to reduce some of the potential sources of invalids, but 'rogue' hosts are also often out of scope for application development.

I think the easiest way to look at it, to avoid insanity, is that regulated numbers of hard errors are more or less 'fine' since they don't pollute the science unless going to 'Too many Errors - may have bug'.

Being outvoted by older applications with known accuracy limitations, or notable divergence cross-platform (such as CPU vs GPU), so becoming invalid, is a far more minor problem though seriously annoying, that is within 'our scope'. That is something when I queried about priorities moving forward, Eric responded that it would be a great thing to refine, if we can do it from our end... so V7 multibeam will be improved from our angle at least, and hopefully AP eventually too.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Cheopis
Send message
Joined: 17 Sep 00
Posts: 139
Credit: 9,591,283
RAC: 8,969
United States
Message 1248332 - Posted: 19 Jun 2012, 12:40:21 UTC

When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project?

One of the problems that the project has had for quite some time is limited bandwidth. Every one of these machines getting 50x their normal allotment of work units is taking up the bandwidth of 50 machines of their same class without errors.

It would seem to me that even if the percentage of machines pulling huge numbers of work units and erroring out on them is low, the absolute number of machines based on the lists upthread should be enough to have a measureable, and perhaps significant impact on the lab's pipe to the internet?

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4813
Credit: 71,601,082
RAC: 9,076
Australia
Message 1248334 - Posted: 19 Jun 2012, 12:51:31 UTC - in response to Message 1248332.

When these machines are requesting hundreds if not thousands of new work units per day because they are finishing the work units with errors 50+ times faster than they should be finishing them, isn't that a significant impact on the health of the project?

One of the problems that the project has had for quite some time is limited bandwidth. Every one of these machines getting 50x their normal allotment of work units is taking up the bandwidth of 50 machines of their same class without errors.

It would seem to me that even if the percentage of machines pulling huge numbers of work units and erroring out on them is low, the absolute number of machines based on the lists upthread should be enough to have a measureable, and perhaps significant impact on the lab's pipe to the internet?


As an 'end-user' I would expect all that to be true exactly, though have no idea of the scale of the problems relative to 'normal' operation, other than the 5-10% or so resends indicated. One indication is that with Astropulse splitting off & improved uptime, download pipe usage actually drops to manageable levels, when combined with fixes that Matt & others have been doing.

I believe one of the worst scenarios in recent times was severe storage space issues & failing servers, which I'm proud to be a member of the GPU Users group pushing to get the project what it needs.

Right now, with so much going on, following the drives with GPUUG is probably the closest means we all have to supporting the ongoing improvement.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 60,123,784
RAC: 88,082
Argentina
Message 1248362 - Posted: 19 Jun 2012, 14:55:10 UTC

I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable?

If it were configurable, then it should be easy to the project to minimize de effects of the "rogue" hosts... just making the rise ratio of the quota slower than the down ratio will help a lot to limit the amount of errors from failling hosts...

Or is it that the "succesfull but invalid" dont affect the quota?
____________

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,004,216
RAC: 254
United States
Message 1248745 - Posted: 20 Jun 2012, 14:36:38 UTC - in response to Message 1248362.

I know that there is a quota system already implemented on BOINC, but the questions is: Is it hard coded or is configurable?

If it were configurable, then it should be easy to the project to minimize de effects of the "rogue" hosts... just making the rise ratio of the quota slower than the down ratio will help a lot to limit the amount of errors from failling hosts...

Or is it that the "succesfull but invalid" dont affect the quota?

See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded.

The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota:

- If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below.

- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100.

- A task judged valid increases "Max tasks per day" by one.

- A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota.
Joe

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 813
Credit: 1,501,229
RAC: 416
Germany
Message 1248748 - Posted: 20 Jun 2012, 14:53:34 UTC - in response to Message 1248745.

- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100.

(...)

- A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota.

So basically the system is completely useless in case of hosts, which generate invalid results and almost useless in case of hosts which generate errors, since it's enough for a host to send back 1 out of 50 results back as success (which does however not need to be valid) for to stay at 100 tasks per day. I wouldn't consider that as a working system.
____________
.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 60,123,784
RAC: 88,082
Argentina
Message 1248772 - Posted: 20 Jun 2012, 15:41:06 UTC - in response to Message 1248745.

See http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits for the parts which are not hard coded.

LOL.. reading that link is like watching StartTrek It seems like I understand what they are talking about...


The hard coded stuff currently is primarily based on whether the host reports an error or not, though validations do have some effect. Given this project's setting of 100 for daily_result_quota:

- If an error is reported the "Max tasks per day" is reduced to less than the basic 100 quota, 99 if the host was previously OK or subtract one if it was already below.

- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100.

- A task judged valid increases "Max tasks per day" by one.

- A task judged invalid reduces "Max tasks per day" by one, but only if it was above the basic quota.
Joe

Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100...

Errors, unlike invalids, should be something really rare. If each error decrease the quota by 9 and each success increase it by 1, then you will need 9 success to compensate an error which means that any host with an error rate above 10% will not be able to rise their quota. (Of course, choosing the "right" ratio is beyond me)

As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve...

Of course, Im just thinking out loud, (and Im not original with the ideas). I guess, like Jason said, that as far as the invalids/errors are discarded by the replicatication validation, it may not be a priority for the project to invest time in the implementation of this kind of changes, specially if the wasted resources of that errors are not significative to the load of the servers...
____________

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 813
Credit: 1,501,229
RAC: 416
Germany
Message 1248777 - Posted: 20 Jun 2012, 15:53:07 UTC - in response to Message 1248772.

Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100...

No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand

- If a "success" is reported and the host was below the basic quota, "Max tasks per day" is doubled but capped at 100.

Success != valid result



As invalids are more common and they do not necesarily means that the host is failling, the 1 down/1 up seems more or less good as it requires at least a 50% valid tasks to rise the quota, but I think it could be a bit less permissive without affecting the quota of good hosts: 2 or 3 down/1 up will requiere a 66% or 75% of valids, which should not be hard to achieve...

I'd start with 50%, i.e. for each error/invalid 1 down, for each valid (not "success") 1 up. With resetting to 100 on first error/invalid if above.
____________
.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 60,123,784
RAC: 88,082
Argentina
Message 1248789 - Posted: 20 Jun 2012, 16:20:50 UTC - in response to Message 1248777.

Thats it... If a host throws 50 errors, plus as much invalids as it wants and then just 1 valid it gets the quota set again at 100...

No, it does not need a valid result to get back to 100, it needs to report a task a success, if it get's invalid after that it doesn't matter, at least that's how I understand

You're right... which is even worst...

As I said, choosing the ratios is beyond me, but anyway Im not sure about resetting the limits, any big step (up or down) will probably mess up things...
(I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D )
____________

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 813
Credit: 1,501,229
RAC: 416
Germany
Message 1248806 - Posted: 20 Jun 2012, 17:02:51 UTC - in response to Message 1248789.
Last modified: 20 Jun 2012, 17:05:43 UTC

I dont want to hurt faster hosts quotas just for one invalid/error... wich will be like killing mosquitos with a nuke, effective? may be... efficient? sure not... :D )

For to not hurt faster hosts the reset to 100 (or below) tasks could happen after 3-5 (or whatever) consecutive (or within a defined period of time) errors/invalids. Although I think that 100/CPU-core or 800/GPU should still be enough even for the fastest CPUs or GPUs, at least for now. But resetting to 100 because of a single error/invalid is indeed just annoying and in most cases useless.

To help hosts, which has been fixed, to build up a cache there could also be a "I fixed it!" button on the application page, which could raise the quota by 50 for example. The usage of this button should be of course limited to once a day or something like that.
____________
.

Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Anonumous host throwing only errors, 3223 right now

Copyright © 2014 University of California