The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030048 - Posted: 31 Jan 2020, 5:29:24 UTC - in response to Message 2030036.  

My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.
Not started to clear, but fully cleared.
This would alienate all those users who are not following these forums making them quit or switch to other projects permanently. Loss of users would help the server congestion but hurt the science progress.

I think letting the backlogs clear at the start of a Tuesday downtime would make a big difference. Especially if they also trigger the validation of all those results that have missed validation for various reasons over the last weeks and are now waiting for the deadlines. The resend cycle wouldn't clear but they are a small percentage of all the tasks. The huge 'Workunits waiting for assimilation' backlog that is now 3.5 million and still rising would clear.

Those workunits waiting for assimilation must have corresponding result rows still in the database at least for the canonical result but probably for all the results because I have never seen any workunit in the website show part of the results deleted while the workunit still exists. The number or results waiting for assimilation is not shown on SSP, so I guess those results may be still counted in the validation queue. It this is the case, then those may explain over 7 million of the current 12 million result validation queue!

Once we have the new NAS device up and running, bump up the limits &...
When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit.
ID: 2030048 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030050 - Posted: 31 Jan 2020, 5:58:42 UTC - in response to Message 2030036.  
Last modified: 31 Jan 2020, 6:03:23 UTC

My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared.
Not started to clear, but fully cleared.

Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again.
Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work).


+1
I think this is a great idea. We will all still get work... just make it the resends, until db reaches a good size.
p.s. The idea of processing data without a wingman, or having a bad result put in over my good result is BS and worthless. I love Seti, but I don't want feel good theater, I want SCIENCE! so I'm NNT until it gets better.
ID: 2030050 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030051 - Posted: 31 Jan 2020, 6:01:53 UTC
Last modified: 31 Jan 2020, 6:08:23 UTC

The Splitters have fallen off again, most requests are receiving 'Project has No tasks...' again. Caches are falling.... one is down by 50% already, and so it continues.

Oh, the problem with failed Uploads has also returned. Probably has something to do with returning around 60 to 70 completed tasks every 5 minutes.
ID: 2030051 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030053 - Posted: 31 Jan 2020, 6:07:14 UTC
Last modified: 31 Jan 2020, 6:07:45 UTC

Now they have apparently switched to 'initial replication 1': 3861450832

So no more risk of bad results returned first making good results returned later fail, but also no chance whatsoever of catching the bad results.
ID: 2030053 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14041
Credit: 208,696,464
RAC: 304
Australia
Message 2030057 - Posted: 31 Jan 2020, 6:28:29 UTC - in response to Message 2030048.  

Once we have the new NAS device up and running, bump up the limits &...
When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit.
Depending on how much better they perform, the need for all of it to fit in RAM may not arise (although that is rather wishful thinking- i am expecting the new storage to be significantly faster than the exiting storage, however i don't expect it to be significant enough.).
Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Or a new server with more RAM. Or better yet, both.
Grant
Darwin NT
ID: 2030057 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030058 - Posted: 31 Jan 2020, 6:34:02 UTC

Results received in last hour = 197,095
just a matter of time now, probably not long.
ID: 2030058 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14041
Credit: 208,696,464
RAC: 304
Australia
Message 2030059 - Posted: 31 Jan 2020, 6:39:20 UTC - in response to Message 2030058.  

Results received in last hour = 197,095
just a matter of time now, probably not long.
Already getting "Project has no tasks available" messages, i think Tbar posted similarly in another thread. Caches running down.
Not surprising considering the return rate & the increasing Validation & Assimilation backlogs- both have reached new record highs.
Grant
Darwin NT
ID: 2030059 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030066 - Posted: 31 Jan 2020, 7:55:22 UTC - in response to Message 2030057.  

Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.
Setiathome Boinc database running from flash would burn out the flash in a short time!
ID: 2030066 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030068 - Posted: 31 Jan 2020, 7:59:22 UTC
Last modified: 31 Jan 2020, 8:00:49 UTC

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??

edit: Ville - I love your Pluto pic.
ID: 2030068 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14041
Credit: 208,696,464
RAC: 304
Australia
Message 2030069 - Posted: 31 Jan 2020, 8:05:43 UTC - in response to Message 2030066.  
Last modified: 31 Jan 2020, 8:11:50 UTC

Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM.
Setiathome Boinc database running from flash would burn out the flash in a short time!
After several decades.
Yes, if you were to use consumer/client SSDs they would die rather quickly, however SSDs designed for enterprise use will last an extremely long time, under much heavier use than Seti provides.

For example, DWPD (Drive Writes Per Day, where the entire capacity of the drive is written to in a 24hr period). Consumer drivers are rated at around .1 to .5, Enterprise drives are rated as high as 3 DWPD, some specialised write drives even higher.
And of course with multiple drives in an array or pool, even udner the heaviest of loads, they will never come close to their rated maximum DWPD limit.
Grant
Darwin NT
ID: 2030069 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030072 - Posted: 31 Jan 2020, 8:24:42 UTC - in response to Message 2030068.  

Ville - I love your Pluto pic.
I got the idea of using a Pluto pic from you ;)
ID: 2030072 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030073 - Posted: 31 Jan 2020, 8:26:32 UTC - in response to Message 2030068.  

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??
My theory is that the Transitioner isn't (hasn't) marked all those returns as 'ready to validate' - I think the bulk of them have been sitting there untouched since the December troubles.

Eric replied - very late on Thursday night, his time -

I will see if I can figure out a transitioner trick tomorrow, in which case I will revert to standard replication.
(I suggested that Matt might have a script for that - I think we've done it before)
ID: 2030073 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030075 - Posted: 31 Jan 2020, 8:35:39 UTC

The first machine is now out of work, https://setiathome.berkeley.edu/results.php?hostid=6796479
The Next machine's cache is down by 60%, it will be out soon, https://setiathome.berkeley.edu/results.php?hostid=6813106
The load is still above 190k, and the Splitters can't keep up, https://setiathome.berkeley.edu/show_server_status.php
ID: 2030075 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030077 - Posted: 31 Jan 2020, 8:43:25 UTC - in response to Message 2030068.  
Last modified: 31 Jan 2020, 8:48:04 UTC

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??
If my theory about results belonging to workunits waiting for assimilation being shown as waiting for validation is correct, then we could have about 7.5 million of the 12.2 million results there being ones that have been validated but not assimilated yet. And that is growing fast.

The 'Workunits waiting for assimilation' is a supposed to be close to zero in normal situation because workunits get assimilated immediately after they have been validated. But for more than a week now that number has been steadily growing. Recently by about 30000 per hour. The Astropulse number has also been growing the last couple of hours. There is some serious performance problem in assimilation.
ID: 2030077 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030078 - Posted: 31 Jan 2020, 8:45:18 UTC - in response to Message 2030075.  
Last modified: 31 Jan 2020, 8:49:03 UTC

The first machine is now out of work
You are crunching too fast. My caches are nearly full in both machines.
ID: 2030078 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 23016
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2030079 - Posted: 31 Jan 2020, 8:47:19 UTC

Because I'm nowhere near the bulk of my computers I've had to resort to using the web options page to set don't do any SETI work - the first time I've had to resort to this sort of thing due to actions by SETI :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2030079 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030080 - Posted: 31 Jan 2020, 8:54:49 UTC

There is a noise bombing window in blc35 at around 58692_07 and _08. Those are probably causing the current high return rate.
ID: 2030080 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030085 - Posted: 31 Jan 2020, 9:24:58 UTC - in response to Message 2030073.  

Eric replied - very late on Thursday night, his time -
Is he aware of the assimilation problem?
ID: 2030085 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030086 - Posted: 31 Jan 2020, 9:54:27 UTC - in response to Message 2030085.  

Eric replied - very late on Thursday night, his time -
Is he aware of the assimilation problem?
He didn't mention it, but I would expect so, yes: that's an unambiguous figure on the face of the SSP (and the more complete figures which, I presume, they have access to via internal monitoring).

The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different? The first usually hovers around 4 million, but recently it's been 12 million. Why?

The rise started when the 'in progress' limit was raised - an obvious direct connection, no alarm bells. But why is it still so high? That needs explanation, and I've suggested a possible way of finding out the answer. Lets hope it works, else someone is going to have to come up with another suggestion.
ID: 2030086 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 23016
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2030087 - Posted: 31 Jan 2020, 9:58:34 UTC

If this is an attempt to reduce the amount of work sitting around waiting to be validated it's not working as the number has increased from about 11,500,000 last night to about 12,250,000 this morning.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2030087 · Report as offensive
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.