The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 76 · 77 · 78 · 79 · 80 · 81 · 82 . . . 94 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031276 - Posted: 7 Feb 2020, 21:12:15 UTC - in response to Message 2031275.  

Your answer is in your quoted message.
You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'.

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031276 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2031278 - Posted: 7 Feb 2020, 21:15:22 UTC - in response to Message 2031276.  

Your answer is in your quoted message.
You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'.

So it is thanks Keith I will change my original post
ID: 2031278 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031287 - Posted: 7 Feb 2020, 21:41:35 UTC - in response to Message 2031272.  

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...

https://setiathome.berkeley.edu/workunit.php?wuid=3871356807

Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids).
Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids).

Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?

Tom

I warned of that in https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2027128#2027128, after I got invalid to two bad ATI hosts which I had observed in https://setiathome.berkeley.edu/forum_thread.php?id=84508&postid=2026843#2026843


. . The problem with the NAVI AMD cards has been an issue for a couple of months now and has its own thread.

Stephen

<shrug>
ID: 2031287 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22739
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031289 - Posted: 7 Feb 2020, 21:49:21 UTC - in response to Message 2031270.  

Actually the pure random invalid can be caused by an "event" on a computer that has a very good record. So while 0% is the goal there will always be the odd event that trips one up.
Systematic invalids (which are the ones we are talking about here) are where a computer, for whatever reason, is just chucking out garbage by the truck load, is certainly a big no-no.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031289 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031292 - Posted: 7 Feb 2020, 22:02:42 UTC - in response to Message 2031250.  

I've said this before, but I'll say it again.
It is about time "invalid" tasks were treated in much the same was as "error" tasks.
Ignore the odd one, but if a computer is returning loads then it gets its allowance progressively cut until the cycle is broken.
Invalids as a percentage of Pendings?
0.5% or higher gets you sin binned.
Grant
Darwin NT
ID: 2031292 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031296 - Posted: 7 Feb 2020, 22:18:53 UTC - in response to Message 2031289.  

Actually the pure random invalid can be caused by an "event" on a computer that has a very good record. So while 0% is the goal there will always be the odd event that trips one up.
That 'event' is something that really should happen less than once in a lifetime of a computer. A randomly flipping bit can cause a computer to crash. If a computer crashes spontaneously without a software bug, that could be tolerated once but if it happens again, there is clearly something wrong with the hardware.

Probably the most common cause of those 'events' is that the cpu or gpu has been overclocked too far. For cpus this rarely happens without the user being guilty but the graphics card manufacturers sometimes go too far when competing with other manufacturers producing cards with the same gpu chip so that your graphics card is unstable out of the box.
ID: 2031296 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031303 - Posted: 7 Feb 2020, 23:01:09 UTC - in response to Message 2031292.  

Invalids as a percentage of Pendings?
That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying.

Good metric would be the recent average ratio of invalids to valids. Choose a constant x that is a small positive number (a lot smaller than 1), then for each validated task add x to a variable if it was invalid but don't add anything if it was valid and 'decay' the variable between tasks by multiplying it with 1-x. The variable will approach the ratio of invalids to all tasks over time. If the host produces 1% invalids, the value will stabilize at 0.01. Smaller the x, the slower the value changes and more recent tasks affect the current value. The weight of each task affecting it decreases exponentially by the 'age' of the task.
ID: 2031303 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031305 - Posted: 7 Feb 2020, 23:09:56 UTC - in response to Message 2031303.  
Last modified: 7 Feb 2020, 23:10:51 UTC

Invalids as a percentage of Pendings?
That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying.

Good metric would be the recent average ratio of invalids to valids. Choose a constant x that is a small positive number (a lot smaller than 1), then for each validated task add x to a variable if it was invalid but don't add anything if it was valid and 'decay' the variable between tasks by multiplying it with 1-x. The variable will approach the ratio of invalids to all tasks over time. If the host produces 1% invalids, the value will stabilize at 0.01. Smaller the x, the slower the value changes and more recent tasks affect the current value. The weight of each task affecting it decreases exponentially by the 'age' of the task.
Actually the Pendings number is generally less variable than the Valids number, and it's a good indicator of the amount of work the system is actually processing.
It's also what is used when developing applications with the goal of Inconclusives being 5% or less of the Pending value.
Having some sort of weighting/ time factor may be of use, but would add to the complexity. I'd see how a basic percentage goes at first, and tweak it from there if necessary.
Grant
Darwin NT
ID: 2031305 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22739
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031306 - Posted: 7 Feb 2020, 23:10:06 UTC

RAC is not a good metric to use for any purpose, it is far too variable.

Far better to keep to the very simple technique that is used for error tasks - let the first couple (in a defined period - 24 hours I think), then reduce the number of tasks permitted for every error task returned until the computer is down to a very low number of tasks allowed (1 per day from memory). Recover slowly, at something like half the decay rate. This is very simple to add to the server code as there are already a couple of error types counted, so just add invalid to the list. Not having the server code to hand just now I can't recall what the decrementor is, but I think it is something like two or three per error over the allowance. I can check in the morning.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031306 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031309 - Posted: 7 Feb 2020, 23:41:12 UTC - in response to Message 2031306.  
Last modified: 8 Feb 2020, 0:02:59 UTC

RAC is not a good metric to use for any purpose, it is far too variable.
RAC is variable because of CreditScrew. This recent average invalid ratio would vary only if the actual ratio of invalids varies because each invalid would have exactly the same score.

Exponentially decaying average is a good way to calculate stuff like this because you need only one stored number and only one multiply-add per operation. A regular moving average of recent n tasks would need an array of size n to keep track of the tasks falling out of the window and this would fatten the database a lot.

Far better to keep to the very simple technique that is used for error tasks - let the first couple (in a defined period - 24 hours I think), then reduce the number of tasks permitted for every error task returned until the computer is down to a very low number of tasks allowed (1 per day from memory). morning.
This wouldn't work for invalids. Error throttling is intended to limit the server load caused by a broken host that errors out every task. A few - or even a few hundred errors are not an issue but a host that immediately errors out every task would ask for a full cache of tasks every scheduler contact and return them all in the next contact causing a very high server load. But even those few invalids per day that this system would allow without any consequences could be a significant percentage of all tasks for a slow host. The cpu of my slower host crunches about three AstroPulses per day. One invalid per day would be 33% invalid ratio!

And we don't want to throttle the host returning lot of invalids but flag it as an unreliable host that should not be validated against another flagged host.

Exponentially decaying average would be a more server-friendly way to do the error throttling too. Because then you wouldn't need the daily database sweep to reset the error counts.
ID: 2031309 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031310 - Posted: 7 Feb 2020, 23:47:52 UTC - in response to Message 2031309.  

RAC is variable because of CreditScrew.
Even without Credit Screw it is variable due to the different WUs- MB & AP & GBT & Arecibo, along with the different angle ranges resulting in differing processing times. Even with the excellent Credit system prior to Credit New (actual FLOP counting), RAC still varied due to this, even with aid of some tweaking that accounted for the differing processing times of some similar AR WUs.
But of course Credit New does take the variability to a whole new, somewhat extreme, level.
Grant
Darwin NT
ID: 2031310 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031314 - Posted: 8 Feb 2020, 0:17:18 UTC - in response to Message 2031305.  

Invalids as a percentage of Pendings?
That's a bad metric because invalids spend 24 hours in the database but the pendings spend quite variable time so the ratio can vary a lot without the actual percentage of invalids returned varying.

Good metric would be the recent average ratio of invalids to valids. Choose a constant x that is a small positive number (a lot smaller than 1), then for each validated task add x to a variable if it was invalid but don't add anything if it was valid and 'decay' the variable between tasks by multiplying it with 1-x. The variable will approach the ratio of invalids to all tasks over time. If the host produces 1% invalids, the value will stabilize at 0.01. Smaller the x, the slower the value changes and more recent tasks affect the current value. The weight of each task affecting it decreases exponentially by the 'age' of the task.
Actually the Pendings number is generally less variable than the Valids number, and it's a good indicator of the amount of work the system is actually processing.
It's also what is used when developing applications with the goal of Inconclusives being 5% or less of the Pending value.
Having some sort of weighting/ time factor may be of use, but would add to the complexity. I'd see how a basic percentage goes at first, and tweak it from there if necessary.


. . Sorry Grant but you are looking at the wrong set of numbers. Pendings have NOT been through the validation process and are irrelevant. The Set of validation processed numbers are 'valids', 'invalids' and 'inconclusives'. The only significant ratio is of one of those subsets to the overall set. 100*Xx/valids+invalids+inconclusives where Xx is one of the subsets. Also time is a factor because when things are running right the valids are only shown on the system for approx 24 hours, so when dealing with the inconclusives only those that occurred in the same 24 hour period as the valids can be treated as significant. So for invalids it is 100*'invalids in the last 24 hours"/(valids+inconclusives"24 hours"+invalids"24 hours").

Stephen

:(
ID: 2031314 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031315 - Posted: 8 Feb 2020, 0:23:19 UTC - in response to Message 2031310.  

Even without Credit Screw it is variable due to the different WUs- MB & AP & GBT & Arecibo, along with the different angle ranges resulting in differing processing times. Even with the excellent Credit system prior to Credit New (actual FLOP counting), RAC still varied due to this,
Ideal FLOP counting would give you very similar credit per crunching time for different tasks because most of the difference in the time needed to crunch them is caused by the different amonut of FLOPs needed. But FLOP counting is very imprecise art when you cant rely on every CPU and GPU used having hardware support for counting them. FLOP guessing would be more appropriate term than FLOP counting. Also better optimized clients could use less FLOPs for the same task so actual FLOP counting would penalize them unfairly.

But invalid task counting can be done exactly.
ID: 2031315 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031317 - Posted: 8 Feb 2020, 0:32:47 UTC - in response to Message 2031314.  
Last modified: 8 Feb 2020, 0:33:20 UTC

. . Sorry Grant but you are looking at the wrong set of numbers. Pendings have NOT been through the validation process and are irrelevant.
Actually it is what makes them relevant.
What matters is what percentage of the WUs being processed are Invalid/Inconclusive/Errors. It's not about absolute numbers, but the percentage of crud out of all the work done.


The only significant ratio is of one of those subsets to the overall set. 100*Xx/valids+invalids+inconclusives where Xx is one of the subsets.
Why make something more complicated than it needs to be?
As i pointed out above, what matters is how many of the WUs a system processes are Errors, or how many are Inconclusive, or how many are Invalid out if it's total output. Credit New is an example of a system that is way more complicated than it needs to be. There is no need to make this an even bigger mess when a simple system will get a good result.


So for invalids it is 100*'invalids in the last 24 hours"/(valids+inconclusives"24 hours"+invalids"24 hours").
Making the simple unnecessarily complicated.
Validation Inconclusive/Validation Pending*100= Inconclusives as a percentage.
Both simple, and accurate.

You can sample those numbers every 10 min & workout the result over a 1 day 5 day or 7 day or one month period, but the single value will be pretty close to all of those other results if things are relatively steady. All the other sample will just let you see if things re getting work or better, which you can do anyway by comparing the result at any time you chose to do the calculation.
No need to add complication to something when the result is no better than the simpler method.
Grant
Darwin NT
ID: 2031317 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031318 - Posted: 8 Feb 2020, 0:33:58 UTC - in response to Message 2031314.  

The only significant ratio is of one of those subsets to the overall set. 100*Xx/valids+invalids+inconclusives where Xx is one of the subsets.
You shouldn't count inconclusives because then you would count them twice as they will eventually become valids or invalids. Inconclusives are equivalent to pendings and should stay out of this.

But I was suggesting counting the actual validated tasks. Each time a task is transitioned to valid or invalid state, the exponentially decaying average would be updated.
ID: 2031318 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031319 - Posted: 8 Feb 2020, 0:44:09 UTC - in response to Message 2031315.  

But FLOP counting is very imprecise art when you cant rely on every CPU and GPU used having hardware support for counting them.
Which is one of the arguments for why Credit New was introduced.
The FLOP counting gave consistent results.


FLOP guessing would be more appropriate term than FLOP counting.
That was the system before the FLOP counting, and it was even messier than Credit New. Then when Credit new came it went back to FLOP guessing, with all sorts of addition massaging of the numbers, hence the mess we have now.


Also better optimized clients could use less FLOPs for the same task so actual FLOP counting would penalize them unfairly.
That was the problem with the original FLOPS guessing system (and one of the problems with the present Credit New system). But it didn't occur with the FLOPS counting system because the FLOP counter was independent of the application that processed the WU- so a highly optimised application that didn't have to do as many operations as poorly optimised application still claimed similar Credit, even though their processing times differed hugely.
Grant
Darwin NT
ID: 2031319 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2031320 - Posted: 8 Feb 2020, 0:45:38 UTC

And everyone seems to have gotten way, away, way off topic yet again.
Grant
Darwin NT
ID: 2031320 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031321 - Posted: 8 Feb 2020, 1:07:43 UTC - in response to Message 2031317.  

. . Sorry Grant but you are looking at the wrong set of numbers. Pendings have NOT been through the validation process and are irrelevant.
Actually it is what makes them relevant. What matters is what percentage of the WUs being processed are Invalid/Inconclusive/Errors. It's not about absolute numbers, but the percentage of crud out of all the work done.
You are demonstrating that you don't understand what those different states mean.

Pendings and Inconclusivees spend different times in the database than valids and invalids. And they both will eventually become valids or invalids. When you compare numbers of currently existing results in two sets that spend different times in the database, you get meaningless garbage. And when you count tasks on both sides of the validation process, you'll include the same task twice in your counts.

The only meaningful ratio you can derive from the data displayed on the web site is the ratio between invalids and valids. They are the only two states that are mutually exclusive (the same task can't be in both) and spend the same time span in the database and are thus comparable.

But the ratio of the counts of the existing results is way too coarse to determine the invalid ratios of the hosts with sufficient accuracy to flag them. If my slower host's cpu produced 1% invalids, you would see an invalid on the web page during about one day each month as it is doing AstroPulses exclusively and poops out about three results per day.
ID: 2031321 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031323 - Posted: 8 Feb 2020, 1:12:32 UTC

I want to comment on the fact that the percentage of Inconclusives has shot WAY up since all the issues with the validators and bad hosts.

Now even if two hosts get the EXACT same results for a early overflow, it needs to go out to a third wingman.

That alone has inflated the percentage of Inconclusives. I normally would be sitting on around 2.9% Inconclusives on every host of mine. Now I'm looking at 19.6% of Inconclusives.

Case in point: https://setiathome.berkeley.edu/workunit.php?wuid=3873231309

Host 8030022

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.

Best spike: peak=26.3049, time=57.04, d_freq=1419564949.49, chirp=10.1, fft_len=64k
Best autocorr: peak=18.55554, time=6.711, delay=4.3744, d_freq=1419560425.62, chirp=-18.068, fft_len=128k
Best gaussian: peak=6.091602, mean=0.6524873, ChiSq=1.386699, time=46.14, d_freq=1419558524.42,
score=-1.352335, null_hyp=2.124649, chirp=-35.051, fft_len=16k
Best pulse: peak=6.526803, time=9.463, period=1.346, d_freq=1419563648.09, score=0.9578, chirp=59.644, fft_len=512
Best triplet: peak=0, time=-2.124e+11, period=0, d_freq=0, chirp=0, fft_len=0

Spike count: 28
Autocorr count: 2
Pulse count: 0
Triplet count: 0
Gaussian count: 0

Host 8826748

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.

Best spike: peak=26.3049, time=57.04, d_freq=1419564949.49, chirp=10.1, fft_len=64k
Best autocorr: peak=18.55554, time=6.711, delay=4.3744, d_freq=1419560425.62, chirp=-18.068, fft_len=128k
Best gaussian: peak=6.091602, mean=0.6524873, ChiSq=1.386699, time=46.14, d_freq=1419558524.42,
score=-1.352335, null_hyp=2.124649, chirp=-35.051, fft_len=16k
Best pulse: peak=6.526803, time=9.463, period=1.346, d_freq=1419563648.09, score=0.9578, chirp=59.644, fft_len=512
Best triplet: peak=0, time=-2.124e+11, period=0, d_freq=0, chirp=0, fft_len=0

Spike count: 28
Autocorr count: 2
Pulse count: 0
Triplet count: 0
Gaussian count: 0

Identical result down to the umpteenth decimal place for all spikes, autocorrs, gaussians, pulses and triplets. This task should have been validated on the first two results and not gone out for a third wingman.

I know necessary evil now with the bad Windows and bad AMD hosts but even with the current mechanism in place, bad results are still going into the database and invalidating good results from good hosts.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031323 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031324 - Posted: 8 Feb 2020, 1:12:32 UTC - in response to Message 2031320.  

And everyone seems to have gotten way, away, way off topic yet again.
I think this is quite appropriate thread for speculating how to fix server issues.
ID: 2031324 · Report as offensive
Previous · 1 . . . 76 · 77 · 78 · 79 · 80 · 81 · 82 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.