The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 77 · 78 · 79 · 80 · 81 · 82 · 83 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13904
Credit: 208,696,464
RAC: 304
Australia
Message 2031326 - Posted: 8 Feb 2020, 1:17:02 UTC - in response to Message 2031324.  

And everyone seems to have gotten way, away, way off topic yet again.
I think this is quite appropriate thread for speculating how to fix server issues.
So lets leave RAC out of any such discussions then.
Grant
Darwin NT
ID: 2031326 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031328 - Posted: 8 Feb 2020, 1:18:47 UTC - in response to Message 2031323.  

I know necessary evil now with the bad Windows and bad AMD hosts but even with the current mechanism in place, bad results are still going into the database and invalidating good results from good hosts.
And this why a bad host flagging mechanism is needed to prevent them from ganging up against a good host.

But Setiathome staff can't do this by configuring their servers. This needs new code in Boinc.
ID: 2031328 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13904
Credit: 208,696,464
RAC: 304
Australia
Message 2031329 - Posted: 8 Feb 2020, 1:19:46 UTC - in response to Message 2031323.  

That alone has inflated the percentage of Inconclusives. I normally would be sitting on around 2.9% Inconclusives on every host of mine. Now I'm looking at 19.6% of Inconclusives.
That's a huge improvement over what it was.
I'm down to around 16%, it was up to 50% for a while there, and i saw some systems in the low 60% region.
Grant
Darwin NT
ID: 2031329 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031334 - Posted: 8 Feb 2020, 1:32:10 UTC - in response to Message 2031329.  

I'm down to around 16%, it was up to 50% for a while there, and i saw some systems in the low 60% region.
These percentages are again meaningless numbers. There is no way to see from the web page data what has been the total number of tasks in the same timespan those inconclusives cover so it is impossible to calculate the real percentage.
ID: 2031334 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13904
Credit: 208,696,464
RAC: 304
Australia
Message 2031336 - Posted: 8 Feb 2020, 1:48:06 UTC - in response to Message 2031334.  

I'm down to around 16%, it was up to 50% for a while there, and i saw some systems in the low 60% region.
These percentages are again meaningless numbers. There is no way to see from the web page data what has been the total number of tasks in the same timespan those inconclusives cover so it is impossible to calculate the real percentage.
They are very meaning full and give a valid indication of how things stand at that time.
Grant
Darwin NT
ID: 2031336 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 7358
Credit: 2,084,789
RAC: 3
Message 2031348 - Posted: 8 Feb 2020, 2:58:21 UTC
Last modified: 8 Feb 2020, 3:01:20 UTC

It could be still for only a narrowband signal perhaps meant to be here, and such a thing could be among the results, for only scores.

Here noticing the precisely identical results for only such a thing, except not any developer thinking about intended purpose, for also means of detection.

For just a -9 result_overflow, the spike count only got to 28 here, for not any 30, while the other counts were much lower, including also Autocorr as well.
ID: 2031348 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19550
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031365 - Posted: 8 Feb 2020, 8:00:10 UTC - in response to Message 2031328.  

I know necessary evil now with the bad Windows and bad AMD hosts but even with the current mechanism in place, bad results are still going into the database and invalidating good results from good hosts.
And this why a bad host flagging mechanism is needed to prevent them from ganging up against a good host.

But Setiathome staff can't do this by configuring their servers. This needs new code in Boinc.

When there are problems with hardware or drivers is it possible for the project to look at two lines in the "Computer Information" page,
CPU type 	GenuineIntel Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz [Family 6 Model 158 Stepping 10]
and
Coprocessors 	NVIDIA GeForce RTX 2060 (4095MB) driver: 442.19 OpenCL: 1.2


and if the information in either match a list of known problems, stop sending these hosts any tasks, or in the present case of the driver problem, tasks of specific Angle Range.
ID: 2031365 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22721
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031367 - Posted: 8 Feb 2020, 8:41:44 UTC - in response to Message 2031309.  

You just do not understand that that would work just as well your suggestion of doing as calculation to see if the invalid rate has exceeded the permitted level. Indeed it would work better in that it would catch a new computer with a very low RAC almost as soon as it started.
How would your scheme work for a computer with 0 RAC and returning invalids from the start - answer IT WOULD FAIL. A computer that never generated a RAC, but only returned invalids using your scheme would NEVER be trapped, whereas using the real count it would very soon, after the first couple of returned tasks be against the wall.

While BOINC on the computer requests work, the server responds with the "Not sending any, too many errors". We do want to stop sending work to that class of computers and get it out of the system.
Your closing comments show that you have a very low understanding of the vagaries of RAC generation.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031367 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031369 - Posted: 8 Feb 2020, 9:10:32 UTC - in response to Message 2031365.  


When there are problems with hardware or drivers is it possible for the project to look at two lines in the "Computer Information" page,
...
and if the information in either match a list of known problems, stop sending these hosts any tasks, or in the present case of the driver problem, tasks of specific Angle Range.
Overflows can have any angle range and stopping amd hosts from getting any work would mean we lose all the good results from them too. What they really need is the ability to prevent a bad host from getting a specific task if there already is another bad host among the other hosts that same task has been sent to.

Then if an amd host returned a bad result, its wingman would't be amd host and the results would mismatch. Also the tie-breaker host the task is then resent to can't be another amd host, so they can't gang up and make the good result a minority.

What they did was make any overflow result to be automatically sent to a third host. This produces lot of extra server load in a situation where some file is producing lot of overflows and because the scheduler doesn't look at what host it is sent to, nothing prevents two of the three results coming from bad hosts. This actually made the situation worse:

If we assume 10% of the hosts are bad ones, then before they did anything an affected workunit had a 1% chance for both of the initial results be bad producing a false positive and 1.8% chance for one of the inital hosts and the first resend host be bad producing both a false positive and a false negative. So 2.8% chance for bad data to enter database and 1.8% chance for a good host to receive an unfair invalid.

This change to triple validation of overflows changed nothing for the latter case because they would be resent anyway but made the first case worse. The bad result will still go into science database because the two initial bad results have more votes than the third result. If the third result is good, it'll automatically become false negative. Still 2.8% of the affected workunits enter bad data in science database but now 2.7% instead of 1.8% produce an unfair invalid. the change didn't help the original problem at all but made the collateral damage worse. In addition to making all the real overflow results cause 50% more server load.
ID: 2031369 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031370 - Posted: 8 Feb 2020, 9:23:03 UTC - in response to Message 2031367.  
Last modified: 8 Feb 2020, 9:31:26 UTC

work better in that it would catch a new computer with a very low RAC almost as soon as it started.
Another alternative would be to count the valid percentage instead of invalid percentage. Just add to the variable for valids instead of invalids. Then a new host/app with zero value would be assumed bad from the start and would earn its good status over time.

How would your scheme work for a computer with 0 RAC and returning invalids from the start - answer IT WOULD FAIL. A computer that never generated a RAC, but only returned invalids using your scheme would NEVER be trapped
RAC has nothing to do with this. If the host produced only invalids, then its invalid score would rise very fast!
ID: 2031370 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22721
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031376 - Posted: 8 Feb 2020, 10:11:41 UTC

Now you are heading to toward my suggestion, but working on a percentage at very low numbers is not as easy - the key is to keep things simple and on the server. This is for four reasons, first, if you do it on the host every client, on every operating system and hardware must have the current version of the BOINC client. Not everyone likes to, or wants to, have the latest version,m and in some cases there is nobody left in the BOINC community left to develop a new BOINC client. Second is that,as you know, being open-source people are a liberty to alter the client code and one would have to have a mechanism to ensure that every single depository would have all the appropriate code in it, and that it was not possible to remove or block that part of the code. Third, it is actually much faster do an addition to a value stored in a field than to do a multiplication or division - I know many these days don't get into the "joys" of clock-tick counting, but in an application like that one the SETI server the number of clock ticks is fairly important. Finally, for now, the "problem" may only apply to a limited number of projects of which SETI is one, but the BOINC client has to support them all. The the server has a common core of functions, then there are a number of customisable routines around it - error management is one such routine; dangerous as it sounds there are projects out there who have not implemented any error management!!!!
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031376 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22721
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031378 - Posted: 8 Feb 2020, 10:24:35 UTC - in response to Message 2031370.  

RAC has nothing to do with this. If the host produced only invalids, then its invalid score would rise very fast!


Why then did YOU mention RAC? - It is YOU who proposed a system based on the RAC of a host.

If the host produced only invalids, then its invalid score would rise very fast!


Yes, its invalid COUNT would rise, use that as the control, not some derived variable - far simpler, and far more able to catch the event early. Think about a computer with a 100k RAC suddenly starting to throw invalids, set the trigger at 1% - it needs to throw 1000 before it is trapped - now consider a computer with a RAC of 1M (and they do exist) - that figure now becomes 10k - both of which are far more than "just a few" - Using a simple count based scheme both of these would be caught very early and have their daily allowance reduced to getting very few tasks per day until the problem was resolved.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031378 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19550
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031381 - Posted: 8 Feb 2020, 10:38:57 UTC

We need to cut the devices that make "noise bombs" out of all the tasks sent to that device to zero asap.
One of the main reasons is that they are only taking ~10 sec/task and even if you compare it to a similar spec GPU that's probably 20* more tasks the device is requesting than normal. But it's not just GPU's that needs to be considered, it's the CPU's as well that can take hours to complete a task.
So it is not inconceivable that a faulty GPU could be requesting a 100 times more tasks than normal.
The damage done to the reliability of the Science database is such that maybe, without a lot of work, the whole period from before this started until it is absolutely sure the problems have been cleared will have to be deleted. Or we will be getting into an East Anglia situation.
ID: 2031381 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031383 - Posted: 8 Feb 2020, 10:53:40 UTC - in response to Message 2031378.  

Think about a computer with a 100k RAC suddenly starting to throw invalids, set the trigger at 1% - it needs to throw 1000 before it is trapped.
Now you're the one muddling the units!

A RAC of 100K - measured in credits, that's what the 'C' stands for - is probably a count of ~1K. So a 1% trigger is only ~10 tasks.

Or so a pedant might say. The underlying principle of keeping the maths as simple as possible for a server which has to process 150,000 tasks per hour is still valid.
ID: 2031383 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031389 - Posted: 8 Feb 2020, 11:37:12 UTC - in response to Message 2031378.  

Why then did YOU mention RAC? - It is YOU who proposed a system based on the RAC of a host.
Where did I mention that? The the first time in this discussion 'RAC' appears in any of my posts is when I was replying to YOU bringing that up.

If the host produced only invalids, then its invalid score would rise very fast!
Yes, its invalid COUNT would rise, use that as the control, not some derived variable - far simpler, and far more able to catch the event early.
And the count would stay high forever after the host stops producing invalids. That's why a decay mechanism is needed.

Think about a computer with a 100k RAC suddenly starting to throw invalids, set the trigger at 1% - it needs to throw 1000 before it is trapped - now consider a computer with a RAC of 1M (and they do exist) - that figure now becomes 10k - both of which are far more than "just a few"
RAC would have absolutely nothing to do with it. Higher RAC host would trigger it faster because it chews through the tasks faster but any host would trigger it after the exact same number of tasks.

If we assume the decay multiplier 0.999. the limit 1%, the host producing only invalids and the value starting form zero, then the value would evolve like this:
0.001, 0.001999, 0.002997. 0.003994, 0.004990, 0.005985, 0.006979, 0.007972, 0.008964, 0.009955, 0.010945. So 11th consecutive invalid result would trigger the trap.
ID: 2031389 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19550
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031399 - Posted: 8 Feb 2020, 12:52:15 UTC - in response to Message 2031389.  

But a host will probably be producing good results on the CPU and if fitted other GPU's.
ID: 2031399 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 7358
Credit: 2,084,789
RAC: 3
Message 2031400 - Posted: 8 Feb 2020, 12:56:26 UTC - in response to Message 2031381.  
Last modified: 8 Feb 2020, 12:59:09 UTC

Sorry, but noise is a thing only coming our way, for being a radio telescope being pointed at the sky, for detecting radio signals which also could be intelligent in nature.

Therefore the blc35_2bit_guppi tasks among these, for only a bit of noise, except also the possible transmission that also could be meant to be.

Again just sorry for only the discussion of that of RAC for also credit it should be here, for also the scheduled maintenance happening at times.
ID: 2031400 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031404 - Posted: 8 Feb 2020, 13:02:55 UTC - in response to Message 2031399.  

But a host will probably be producing good results on the CPU and if fitted other GPU's.
Even those buggy GPUs/drivers only fail on some tasks. Not all of them.
ID: 2031404 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22721
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031413 - Posted: 8 Feb 2020, 14:42:14 UTC - in response to Message 2031383.  

Thanks Richard - I had a look at that post and thought "there's something wrong there, but what...."
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031413 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031424 - Posted: 8 Feb 2020, 15:33:00 UTC - in response to Message 2031413.  

Thanks Richard - I had a look at that post and thought "there's something wrong there, but what...."
I was trained - a long, long, time ago - at the Cavendish lab, Cambridge university. Two pieces of teaching stuck:

* No number has a meaning unless the units are stated.
* Do every calculation twice. Once, using the most accurate mechanical/electronic device available - and then again, on the back of an envelope or a chalk board. The second only has to be done using order-of-magnitude approximations, but it proves that the decimal point hasn't slipped in the first one.
ID: 2031424 · Report as offensive
Previous · 1 . . . 77 · 78 · 79 · 80 · 81 · 82 · 83 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.