Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 76 · 77 · 78 · 79 · 80 · 81 · 82 . . . 94 · Next
Author | Message |
---|---|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...It did just that. Twice! But the initial hosts were both bad hosts and returned bad results that matched each other better than the two good results matched each other. Convincing the validator to believe the bad results were more reliable. |
rob smith Send message Joined: 7 Mar 03 Posts: 22221 Credit: 416,307,556 RAC: 380 |
I've said this before, but I'll say it again. It is about time "invalid" tasks were treated in much the same was as "error" tasks. Ignore the odd one, but if a computer is returning loads then it gets its allowance progressively cut until the cycle is broken. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I've said this before, but I'll say it again.There's even more reason to do that with invalids than errors! Errors can never result in bad data going into the science database. Results that should be invalids could end up as false positives and pollute the science data. I also think that validators should trust results from hosts that produce a high percentage of invalids less than results from hosts that produce almost no invalids. The results should be considered valid only when at least one of the pair of matching results is from a 'good' host. If such a match is not found, it should keep resending the task until such a match can be found. Even better would be if the scheduler could filter what it sends to each hosts and make sure no more than one 'bad' host is ever included in the replication of one workunit. Also when a host has produced so much invalids that it gets classified as 'bad' one, a message should appear in the 'messages' tab of boingmgr that states this fact and requests the user to fix his host. This good/bad status should be considered separately for each application. If the host is not an anonymous platform with just one app for the particular processing unit, then the server could also reduce the amount of work it sends to that particular app in that host and use other apps instead. But the amount should not be reduced to zero because then the host can never clear the bad status. |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
+1 +1 . . Zero invalids should be the target ... Stephen . . |
W-K 666 Send message Joined: 18 May 99 Posts: 19078 Credit: 40,757,560 RAC: 67 |
Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ... I warned of that in https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2027128#2027128, after I got invalid to two bad ATI hosts which I had observed in https://setiathome.berkeley.edu/forum_thread.php?id=84508&postid=2026843#2026843 |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785The highest number the ssp has had within the last day or so was 20,012,235 and it spends most of its time below 20 mil only doing brief dips above it. I guess you are mixing some non-result fields in your count getting a weird hybrid number that doesn't match the size of any table. Thanks for the information |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Your answer is in your quoted message. You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Your answer is in your quoted message. So it is thanks Keith I will change my original post |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ... . . The problem with the NAVI AMD cards has been an issue for a couple of months now and has its own thread. Stephen <shrug> |
rob smith Send message Joined: 7 Mar 03 Posts: 22221 Credit: 416,307,556 RAC: 380 |
Actually the pure random invalid can be caused by an "event" on a computer that has a very good record. So while 0% is the goal there will always be the odd event that trips one up. Systematic invalids (which are the ones we are talking about here) are where a computer, for whatever reason, is just chucking out garbage by the truck load, is certainly a big no-no. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
I've said this before, but I'll say it again.Invalids as a percentage of Pendings? 0.5% or higher gets you sin binned. Grant Darwin NT |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Actually the pure random invalid can be caused by an "event" on a computer that has a very good record. So while 0% is the goal there will always be the odd event that trips one up.That 'event' is something that really should happen less than once in a lifetime of a computer. A randomly flipping bit can cause a computer to crash. If a computer crashes spontaneously without a software bug, that could be tolerated once but if it happens again, there is clearly something wrong with the hardware. Probably the most common cause of those 'events' is that the cpu or gpu has been overclocked too far. For cpus this rarely happens without the user being guilty but the graphics card manufacturers sometimes go too far when competing with other manufacturers producing cards with the same gpu chip so that your graphics card is unstable out of the box. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Invalids as a percentage of Pendings?That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying. Good metric would be the recent average ratio of invalids to valids. Choose a constant x that is a small positive number (a lot smaller than 1), then for each validated task add x to a variable if it was invalid but don't add anything if it was valid and 'decay' the variable between tasks by multiplying it with 1-x. The variable will approach the ratio of invalids to all tasks over time. If the host produces 1% invalids, the value will stabilize at 0.01. Smaller the x, the slower the value changes and more recent tasks affect the current value. The weight of each task affecting it decreases exponentially by the 'age' of the task. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
Actually the Pendings number is generally less variable than the Valids number, and it's a good indicator of the amount of work the system is actually processing.Invalids as a percentage of Pendings?That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying. It's also what is used when developing applications with the goal of Inconclusives being 5% or less of the Pending value. Having some sort of weighting/ time factor may be of use, but would add to the complexity. I'd see how a basic percentage goes at first, and tweak it from there if necessary. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22221 Credit: 416,307,556 RAC: 380 |
RAC is not a good metric to use for any purpose, it is far too variable. Far better to keep to the very simple technique that is used for error tasks - let the first couple (in a defined period - 24 hours I think), then reduce the number of tasks permitted for every error task returned until the computer is down to a very low number of tasks allowed (1 per day from memory). Recover slowly, at something like half the decay rate. This is very simple to add to the server code as there are already a couple of error types counted, so just add invalid to the list. Not having the server code to hand just now I can't recall what the decrementor is, but I think it is something like two or three per error over the allowance. I can check in the morning. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
RAC is not a good metric to use for any purpose, it is far too variable.RAC is variable because of CreditScrew. This recent average invalid ratio would vary only if the actual ratio of invalids varies because each invalid would have exactly the same score. Exponentially decaying average is a good way to calculate stuff like this because you need only one stored number and only one multiply-add per operation. A regular moving average of recent n tasks would need an array of size n to keep track of the tasks falling out of the window and this would fatten the database a lot. Far better to keep to the very simple technique that is used for error tasks - let the first couple (in a defined period - 24 hours I think), then reduce the number of tasks permitted for every error task returned until the computer is down to a very low number of tasks allowed (1 per day from memory). morning.This wouldn't work for invalids. Error throttling is intended to limit the server load caused by a broken host that errors out every task. A few - or even a few hundred errors are not an issue but a host that immediately errors out every task would ask for a full cache of tasks every scheduler contact and return them all in the next contact causing a very high server load. But even those few invalids per day that this system would allow without any consequences could be a significant percentage of all tasks for a slow host. The cpu of my slower host crunches about three AstroPulses per day. One invalid per day would be 33% invalid ratio! And we don't want to throttle the host returning lot of invalids but flag it as an unreliable host that should not be validated against another flagged host. Exponentially decaying average would be a more server-friendly way to do the error throttling too. Because then you wouldn't need the daily database sweep to reset the error counts. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
RAC is variable because of CreditScrew.Even without Credit Screw it is variable due to the different WUs- MB & AP & GBT & Arecibo, along with the different angle ranges resulting in differing processing times. Even with the excellent Credit system prior to Credit New (actual FLOP counting), RAC still varied due to this, even with aid of some tweaking that accounted for the differing processing times of some similar AR WUs. But of course Credit New does take the variability to a whole new, somewhat extreme, level. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Actually the Pendings number is generally less variable than the Valids number, and it's a good indicator of the amount of work the system is actually processing.Invalids as a percentage of Pendings?That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying. . . Sorry Grant but you are looking at the wrong set of numbers. Pendings have NOT been through the validation process and are irrelevant. The Set of validation processed numbers are 'valids', 'invalids' and 'inconclusives'. The only significant ratio is of one of those subsets to the overall set. 100*Xx/valids+invalids+inconclusives where Xx is one of the subsets. Also time is a factor because when things are running right the valids are only shown on the system for approx 24 hours, so when dealing with the inconclusives only those that occurred in the same 24 hour period as the valids can be treated as significant. So for invalids it is 100*'invalids in the last 24 hours"/(valids+inconclusives"24 hours"+invalids"24 hours"). Stephen :( |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Even without Credit Screw it is variable due to the different WUs- MB & AP & GBT & Arecibo, along with the different angle ranges resulting in differing processing times. Even with the excellent Credit system prior to Credit New (actual FLOP counting), RAC still varied due to this,Ideal FLOP counting would give you very similar credit per crunching time for different tasks because most of the difference in the time needed to crunch them is caused by the different amonut of FLOPs needed. But FLOP counting is very imprecise art when you cant rely on every CPU and GPU used having hardware support for counting them. FLOP guessing would be more appropriate term than FLOP counting. Also better optimized clients could use less FLOPs for the same task so actual FLOP counting would penalize them unfairly. But invalid task counting can be done exactly. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.