The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next

AuthorMessage
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 2030025 - Posted: 31 Jan 2020, 3:51:40 UTC - in response to Message 2030023.  

Switched to AP only until this is sorted.

Using E@H for heating:-)

They have "Tasks in progress suppressed pending completion" set on AP as well, so might even be a problem there as well.
https://setiathome.berkeley.edu/workunit.php?wuid=3861006207


NNT it is then:-(
Kevin


ID: 2030025 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030033 - Posted: 31 Jan 2020, 4:47:48 UTC
Last modified: 31 Jan 2020, 4:50:41 UTC

And even with this self Validation, the Validation & Assimilation backlogs continue to grow.

And my Inconclusives look to be heading for an all time record, and there would appear to the return of the BLC35 noise bombs.
Grant
Darwin NT
ID: 2030033 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2030034 - Posted: 31 Jan 2020, 4:48:41 UTC

If we all set NNT no work will be processed and no resends will get processed. That is just my opinion
ID: 2030034 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030036 - Posted: 31 Jan 2020, 4:57:52 UTC - in response to Message 2029934.  

We could cut our caches to 0.0 + 0.0 - return every task after just 5 minutes and get the 'first back' reward? No, I didn't think so either.
My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.
Not started to clear, but fully cleared.

Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again.
Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work).
Grant
Darwin NT
ID: 2030036 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2030037 - Posted: 31 Jan 2020, 5:00:50 UTC - in response to Message 2030036.  

I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapes
ID: 2030037 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030039 - Posted: 31 Jan 2020, 5:04:07 UTC - in response to Message 2030037.  

I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapes
Hence just pull all files named BLC35 and hold them over till such time as the servers can handle the load they generate.
Plenty of other files to be processed, so no need to do these ones now.
Grant
Darwin NT
ID: 2030039 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 2030043 - Posted: 31 Jan 2020, 5:18:28 UTC

I got 3 invalids on the 30th. In all three cases the wingman, who got "valid" credit, returned a StdErr file that was empty - just one line
<core_client_version>7.14.2</core_client_version>
Not sensible that a result was "valid" that returned no info. I'm going NNT as others have.
ID: 2030043 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030048 - Posted: 31 Jan 2020, 5:29:24 UTC - in response to Message 2030036.  

My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.
Not started to clear, but fully cleared.
This would alienate all those users who are not following these forums making them quit or switch to other projects permanently. Loss of users would help the server congestion but hurt the science progress.

I think letting the backlogs clear at the start of a Tuesday downtime would make a big difference. Especially if they also trigger the validation of all those results that have missed validation for various reasons over the last weeks and are now waiting for the deadlines. The resend cycle wouldn't clear but they are a small percentage of all the tasks. The huge 'Workunits waiting for assimilation' backlog that is now 3.5 million and still rising would clear.

Those workunits waiting for assimilation must have corresponding result rows still in the database at least for the canonical result but probably for all the results because I have never seen any workunit in the website show part of the results deleted while the workunit still exists. The number or results waiting for assimilation is not shown on SSP, so I guess those results may be still counted in the validation queue. It this is the case, then those may explain over 7 million of the current 12 million result validation queue!

Once we have the new NAS device up and running, bump up the limits &...
When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit.
ID: 2030048 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030050 - Posted: 31 Jan 2020, 5:58:42 UTC - in response to Message 2030036.  
Last modified: 31 Jan 2020, 6:03:23 UTC

My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.
Not started to clear, but fully cleared.

Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again.
Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work).


+1
I think this is a great idea. We will all still get work... just make it the resends, until db reaches a good size.
p.s. The idea of processing data without a wingman, or having a bad result put in over my good result is BS and worthless. I love Seti, but I don't want feel good theater, I want SCIENCE! so I'm NNT until it gets better.
ID: 2030050 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030051 - Posted: 31 Jan 2020, 6:01:53 UTC
Last modified: 31 Jan 2020, 6:08:23 UTC

The Splitters have fallen off again, most requests are receiving 'Project has No tasks...' again. Caches are falling.... one is down by 50% already, and so it continues.

Oh, the problem with failed Uploads has also returned. Probably has something to do with returning around 60 to 70 completed tasks every 5 minutes.
ID: 2030051 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030053 - Posted: 31 Jan 2020, 6:07:14 UTC
Last modified: 31 Jan 2020, 6:07:45 UTC

Now they have apparently switched to 'initial replication 1': 3861450832

So no more risk of bad results returned first making good results returned later fail, but also no chance whatsoever of catching the bad results.
ID: 2030053 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030057 - Posted: 31 Jan 2020, 6:28:29 UTC - in response to Message 2030048.  

Once we have the new NAS device up and running, bump up the limits &...
When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit.
Depending on how much better they perform, the need for all of it to fit in RAM may not arise (although that is rather wishful thinking- i am expecting the new storage to be significantly faster than the exiting storage, however i don't expect it to be significant enough.).
Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Or a new server with more RAM. Or better yet, both.
Grant
Darwin NT
ID: 2030057 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030058 - Posted: 31 Jan 2020, 6:34:02 UTC

Results received in last hour = 197,095
just a matter of time now, probably not long.
ID: 2030058 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030059 - Posted: 31 Jan 2020, 6:39:20 UTC - in response to Message 2030058.  

Results received in last hour = 197,095
just a matter of time now, probably not long.
Already getting "Project has no tasks available" messages, i think Tbar posted similarly in another thread. Caches running down.
Not surprising considering the return rate & the increasing Validation & Assimilation backlogs- both have reached new record highs.
Grant
Darwin NT
ID: 2030059 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030066 - Posted: 31 Jan 2020, 7:55:22 UTC - in response to Message 2030057.  

Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.
Setiathome Boinc database running from flash would burn out the flash in a short time!
ID: 2030066 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030068 - Posted: 31 Jan 2020, 7:59:22 UTC
Last modified: 31 Jan 2020, 8:00:49 UTC

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??

edit: Ville - I love your Pluto pic.
ID: 2030068 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030069 - Posted: 31 Jan 2020, 8:05:43 UTC - in response to Message 2030066.  
Last modified: 31 Jan 2020, 8:11:50 UTC

Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.
Setiathome Boinc database running from flash would burn out the flash in a short time!
After several decades.
Yes, if you were to use consumer/client SSDs they would die rather quickly, however SSDs designed for enterprise use will last an extremely long time, under much heavier use than Seti provides.

For example, DWPD (Drive Writes Per Day, where the entire capacity of the drive is written to in a 24hr period). Consumer drivers are rated at around .1 to .5, Enterprise drives are rated as high as 3 DWPD, some specialised write drives even higher.
And of course with multiple drives in an array or pool, even udner the heaviest of loads, they will never come close to their rated maximum DWPD limit.
Grant
Darwin NT
ID: 2030069 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030072 - Posted: 31 Jan 2020, 8:24:42 UTC - in response to Message 2030068.  

Ville - I love your Pluto pic.
I got the idea of using a Pluto pic from you ;)
ID: 2030072 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030073 - Posted: 31 Jan 2020, 8:26:32 UTC - in response to Message 2030068.  

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??
My theory is that the Transitioner isn't (hasn't) marked all those returns as 'ready to validate' - I think the bulk of them have been sitting there untouched since the December troubles.

Eric replied - very late on Thursday night, his time -

I will see if I can figure out a transitioner trick tomorrow, in which case I will revert to standard replication.
(I suggested that Matt might have a script for that - I think we've done it before)
ID: 2030073 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030075 - Posted: 31 Jan 2020, 8:35:39 UTC

The first machine is now out of work, https://setiathome.berkeley.edu/results.php?hostid=6796479
The Next machine's cache is down by 60%, it will be out soon, https://setiathome.berkeley.edu/results.php?hostid=6813106
The load is still above 190k, and the Splitters can't keep up, https://setiathome.berkeley.edu/show_server_status.php
ID: 2030075 · Report as offensive
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.