The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 76 · 77 · 78 · 79 · 80 · 81 · 82 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030897 - Posted: 5 Feb 2020, 17:44:58 UTC - in response to Message 2030894.  

I haven`t noticed anyone comment on this yet but the reason for the growing assimilation number is quite a simple one.... They have less spindles on the storage drive. Less spindles means lower read and write rates.
Not really because the new spindles read or write many times more bytes per rotation. But it does affect the performance of multiple simultaneous reads or writes as with less spindles there's lower chance for the simultaneous operations affecting different spindles.
ID: 2030897 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2030902 - Posted: 5 Feb 2020, 18:31:36 UTC - in response to Message 2030897.  

we don't even have confirmation that the new database system is even bought/built/implemented yet.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2030902 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2030980 - Posted: 6 Feb 2020, 5:07:55 UTC - in response to Message 2030902.  

we don't even have confirmation that the new database system is even bought/built/implemented yet.
I would expect the system to be down for a day or more when it comes time for getting the new NAS going. First the normal weekly outage to compact & tidy up the database, then the time it takes to transfer it all across, then get the new hardware and transferred database recognised by the rest of the system.
I seem to recall a full database transfer taking much longer than was expected once upon a time in the distant past.
Grant
Darwin NT
ID: 2030980 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031001 - Posted: 6 Feb 2020, 8:17:27 UTC - in response to Message 2030980.  
Last modified: 6 Feb 2020, 8:17:50 UTC

I would expect the system to be down for a day or more when it comes time for getting the new NAS going. First the normal weekly outage to compact & tidy up the database, then the time it takes to transfer it all across, then get the new hardware and transferred database recognised by the rest of the system.
They have the replica db they can copy to the new NAS without impacting the running system. Then they can make the new NAS using db the replica dp and let the replication process bring it up to date. Then the only thing they need to do during the downtime is to swap the roles of the databases so it won't necessarily have any impact on the length of the downtime.

We had a period of time a week ago or so where the replica db was offline and the web site was using the master db directly. Perhaps they were doing just this.
ID: 2031001 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22203
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031002 - Posted: 6 Feb 2020, 8:19:46 UTC

And then of course there is getting like the purchasing done (even for fully pre-funded equipment) within a university - that can be a very fraught and time consuming activity :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031002 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031006 - Posted: 6 Feb 2020, 8:33:05 UTC

Looks like the splitter throttling is much more effective now when the overflow storm is over.

The result table has now grown to 20 million and the splitters are being throttled but when they stop, the table drops under 20 mil almost immediately so the splitters spend only short periods stopped making this almost unnoticeable. During the overflow storm the validators kept adding lot of resends to the result table so the table kept growing fast despite the splitters not splitting anything.
ID: 2031006 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2031126 - Posted: 7 Feb 2020, 2:08:55 UTC
Last modified: 7 Feb 2020, 2:10:30 UTC

I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785. Splitter rate is over 67 a second
ID: 2031126 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031192 - Posted: 7 Feb 2020, 11:37:27 UTC - in response to Message 2031126.  
Last modified: 7 Feb 2020, 11:37:59 UTC

I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785
The highest number the ssp has had within the last day or so was 20,012,235 and it spends most of its time below 20 mil only doing brief dips above it. I guess you are mixing some non-result fields in your count getting a weird hybrid number that doesn't match the size of any table.

That 20 mil is the size of the result table. You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'. If you add the workunit and file fields, then you will count some results up to four times. And you can't really count the size of the workunit table because ssp only shows a subset of them.
ID: 2031192 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 2031205 - Posted: 7 Feb 2020, 14:15:42 UTC

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...

https://setiathome.berkeley.edu/workunit.php?wuid=3871356807

Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids).
Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids).

Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?

Tom
ID: 2031205 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031209 - Posted: 7 Feb 2020, 14:39:36 UTC - in response to Message 2031205.  

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...

https://setiathome.berkeley.edu/workunit.php?wuid=3871356807

Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids).
Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids).

Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?

Tom

. . The two hosts with lots of invalids have NAVI 5700 GPUs, so there are still some out there not upgrading their drivers to fix this problem.

Stephen

:(
ID: 2031209 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031212 - Posted: 7 Feb 2020, 14:58:49 UTC - in response to Message 2031205.  

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...
Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?
It did just that. Twice!
But the initial hosts were both bad hosts and returned bad results that matched each other better than the two good results matched each other. Convincing the validator to believe the bad results were more reliable.
ID: 2031212 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22203
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031250 - Posted: 7 Feb 2020, 19:16:44 UTC

I've said this before, but I'll say it again.
It is about time "invalid" tasks were treated in much the same was as "error" tasks.
Ignore the odd one, but if a computer is returning loads then it gets its allowance progressively cut until the cycle is broken.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031250 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031256 - Posted: 7 Feb 2020, 20:05:33 UTC - in response to Message 2031250.  
Last modified: 7 Feb 2020, 20:12:24 UTC

I've said this before, but I'll say it again.
It is about time "invalid" tasks were treated in much the same was as "error" tasks.
Ignore the odd one, but if a computer is returning loads then it gets its allowance progressively cut until the cycle is broken.
There's even more reason to do that with invalids than errors! Errors can never result in bad data going into the science database. Results that should be invalids could end up as false positives and pollute the science data.

I also think that validators should trust results from hosts that produce a high percentage of invalids less than results from hosts that produce almost no invalids. The results should be considered valid only when at least one of the pair of matching results is from a 'good' host. If such a match is not found, it should keep resending the task until such a match can be found. Even better would be if the scheduler could filter what it sends to each hosts and make sure no more than one 'bad' host is ever included in the replication of one workunit.

Also when a host has produced so much invalids that it gets classified as 'bad' one, a message should appear in the 'messages' tab of boingmgr that states this fact and requests the user to fix his host.

This good/bad status should be considered separately for each application. If the host is not an anonymous platform with just one app for the particular processing unit, then the server could also reduce the amount of work it sends to that particular app in that host and use other apps instead. But the amount should not be reduced to zero because then the host can never clear the bad status.
ID: 2031256 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2031259 - Posted: 7 Feb 2020, 20:17:09 UTC - in response to Message 2031256.  

+1
May not be easy to implement, but makes sense! I agree with Ville Saari. Errors can happen for many reasons, including me making a bad edit in an xml file :) but Invalids need to be driven to zero.
ID: 2031259 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031270 - Posted: 7 Feb 2020, 20:59:54 UTC - in response to Message 2031259.  

+1
May not be easy to implement, but makes sense! I agree with Ville Saari. Errors can happen for many reasons, including me making a bad edit in an xml file :) but Invalids need to be driven to zero.


+1

. . Zero invalids should be the target ...

Stephen

. .
ID: 2031270 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031272 - Posted: 7 Feb 2020, 21:01:56 UTC - in response to Message 2031205.  

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...

https://setiathome.berkeley.edu/workunit.php?wuid=3871356807

Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids).
Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids).

Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?

Tom

I warned of that in https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2027128#2027128, after I got invalid to two bad ATI hosts which I had observed in https://setiathome.berkeley.edu/forum_thread.php?id=84508&postid=2026843#2026843
ID: 2031272 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2031275 - Posted: 7 Feb 2020, 21:08:41 UTC - in response to Message 2031192.  
Last modified: 7 Feb 2020, 21:15:59 UTC

I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785
The highest number the ssp has had within the last day or so was 20,012,235 and it spends most of its time below 20 mil only doing brief dips above it. I guess you are mixing some non-result fields in your count getting a weird hybrid number that doesn't match the size of any table.

That 20 mil is the size of the result table. You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'. If you add the workunit and file fields, then you will count some results up to four times. And you can't really count the size of the workunit table because ssp only shows a subset of them.

Thanks for the information
ID: 2031275 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031276 - Posted: 7 Feb 2020, 21:12:15 UTC - in response to Message 2031275.  

Your answer is in your quoted message.
You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'.

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031276 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2031278 - Posted: 7 Feb 2020, 21:15:22 UTC - in response to Message 2031276.  

Your answer is in your quoted message.
You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'.

So it is thanks Keith I will change my original post
ID: 2031278 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031287 - Posted: 7 Feb 2020, 21:41:35 UTC - in response to Message 2031272.  

Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ...

https://setiathome.berkeley.edu/workunit.php?wuid=3871356807

Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids).
Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids).

Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ?

Tom

I warned of that in https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2027128#2027128, after I got invalid to two bad ATI hosts which I had observed in https://setiathome.berkeley.edu/forum_thread.php?id=84508&postid=2026843#2026843


. . The problem with the NAVI AMD cards has been an issue for a couple of months now and has its own thread.

Stephen

<shrug>
ID: 2031287 · Report as offensive
Previous · 1 . . . 76 · 77 · 78 · 79 · 80 · 81 · 82 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.