The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 50 · 51 · 52 · 53 · 54 · 55 · 56 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029138 - Posted: 25 Jan 2020, 8:58:52 UTC - in response to Message 2029136.  

But it says 'not running' which means they are out of work (obviously not the case) or failed. So apparently the splitters try to run but soon crash. And then stay in the 'not running' state until someone manually restarts them. Just to crash again after a few minutes.
They automatically stop when the Ready-to-send buffer is full- and their status becomes "Not running".
Grant
Darwin NT
ID: 2029138 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029140 - Posted: 25 Jan 2020, 9:02:01 UTC - in response to Message 2029137.  

And the splitters are running again.
I wonder if there's been a booboo and they've been set to stop when the Ready-to-send buffer reaches 600 instead of 600k?

Ha ha LOL. I've been wondering the same. The behavior looks like a throttling script running.
Yeah, but instead of slowing down their output it's stopping it dead. And then it takes a while for them to decide to startup again as the Ready-to-send buffer is a long way from full. Then that finally start cranking up the output, and then stop again- well before the Ready-to-send buffer even makes triple digits.
Grant
Darwin NT
ID: 2029140 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029141 - Posted: 25 Jan 2020, 9:05:29 UTC
Last modified: 25 Jan 2020, 9:08:52 UTC

The situation seems to be a bit better than yesterday. Yesterday it was very rare to catch them in running state and when that happened, it never lasted more than one ssp update cycle. Now they have been running for several cycles in a row. A few dropped out in the last cycle but the rest keep running.
ID: 2029141 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029142 - Posted: 25 Jan 2020, 9:09:13 UTC - in response to Message 2029141.  
Last modified: 25 Jan 2020, 9:09:31 UTC

The situation seems to be a bit better than yesterday. Yesterday it was very rare to see catch them in running state and when that happened, it never lasted more than one ssp update cycle. Now they have been running for several cycles in a row. A few dropped out in the last cycle but the rest keep running.
I think it's just a case of the status not being updated again.
Many of the numbers haven't changed in over half an hour.

Things are very much borked.
Grant
Darwin NT
ID: 2029142 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029144 - Posted: 25 Jan 2020, 9:19:56 UTC - in response to Message 2029136.  

If they are intentionally stopped, the status says 'disabled'. But it says 'not running' which means they are out of work (obviously not the case) or failed. So apparently the splitters try to run but soon crash. And then stay in the 'not running' state until someone manually restarts them. Just to crash again after a few minutes.
There's been a fairly recent change (a few months ago), where the splitters show 'not running' when the automatic limiter kicks in. That was when the 'ready to send' limit was around 600K, and the limiters were regularly kicking in and out in the course of an average day.

I've not worked out exactly what Eric did to implement "To that end we are throttling work generation to a rate at which the table size is shrinking." - it seems like that change is still having repercussions, perhaps because of the excessive numbers of overflow tasks recently.
ID: 2029144 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029145 - Posted: 25 Jan 2020, 9:34:58 UTC - in response to Message 2029142.  

Many of the numbers haven't changed in over half an hour.
And now they have- no work ready to go, no work being produced- all splitters not running.
Grant
Darwin NT
ID: 2029145 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029146 - Posted: 25 Jan 2020, 9:37:39 UTC - in response to Message 2029144.  
Last modified: 25 Jan 2020, 9:38:09 UTC

I've not worked out exactly what Eric did to implement "To that end we are throttling work generation to a rate at which the table size is shrinking." - it seems like that change is still having repercussions, perhaps because of the excessive numbers of overflow tasks recently.
Could be; although the In progress numbers are way down the Validation & Assimilation backlogs haven't really improved much at all (any reduction in Validation numbers just results in an increase in Assimilation numbers).
Grant
Darwin NT
ID: 2029146 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029147 - Posted: 25 Jan 2020, 9:39:42 UTC - in response to Message 2029142.  

think it's just a case of the status not being updated again.
Many of the numbers haven't changed in over half an hour.
All the numbers that were supposed to change did change during those five consecutive ssp updates that showed running splitters. The result generation rate for example was 6.3545, 61.8328, 77.8624, 78.4933, 91.5302 on those five updates.
ID: 2029147 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029148 - Posted: 25 Jan 2020, 9:40:12 UTC - in response to Message 2029145.  

Many of the numbers haven't changed in over half an hour.
And now they have- no work ready to go, no work being produced- all splitters not running.
But an awful lot of 'Results returned and awaiting validation'. That's the table Eric was trying to get down to manageable size - maybe he's put an extra term in the throttle trigger? If we're still running an extra check on overflow tasks, all those BLC35s will go straight into that category, and stay there a while.
ID: 2029148 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029149 - Posted: 25 Jan 2020, 9:48:23 UTC - in response to Message 2029147.  

The result generation rate for example was 6.3545, 61.8328, 77.8624, 78.4933, 91.5302 on those five updates.
Not when i was refreshing the Server status page, it just showed as 91 or so with 6 ready to send over that 30min or so. Time stamp on the server status page changed with each refresh, As of time & the status numbers didn't.
*shrug*

It's broken, and i don't see it getting sorted till Monday- I don't see it being a quick fix. Let them have the weekend to relax and get stuck in to it next week.
A chance for me to further reduce my power bill.
Grant
Darwin NT
ID: 2029149 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029150 - Posted: 25 Jan 2020, 9:53:52 UTC - in response to Message 2029148.  

If we're still running an extra check on overflow tasks, all those BLC35s will go straight into that category, and stay there a while.
Just picked up a few more WUs on one system- 80%+ BLC35s.
Grant
Darwin NT
ID: 2029150 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029152 - Posted: 25 Jan 2020, 10:14:26 UTC
Last modified: 25 Jan 2020, 10:14:46 UTC

I'm trying to throw my high-replication tasks like WU 3835497267 back as quickly as possible, so they can start their purdah in the 24-hour purge queue as soon as possible. Perhaps Eric could lower that delay for the time being?
ID: 2029152 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22941
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029155 - Posted: 25 Jan 2020, 10:20:37 UTC

OK, it's my fault - I was planning on putting a couple more RPi on stream....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029155 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029156 - Posted: 25 Jan 2020, 10:21:59 UTC - in response to Message 2029149.  

The result generation rate for example was 6.3545, 61.8328, 77.8624, 78.4933, 91.5302 on those five updates.
Not when i was refreshing the Server status page, it just showed as 91 or so with 6 ready to send over that 30min or so.
Ready to send was 13, 10, 0, 41, 41. Only on the last two of the five updates it stayed the same.
ID: 2029156 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14013
Credit: 208,696,464
RAC: 304
Australia
Message 2029159 - Posted: 25 Jan 2020, 10:32:37 UTC - in response to Message 2029152.  

I'm trying to throw my high-replication tasks like WU 3835497267 back as quickly as possible, so they can start their purdah in the 24-hour purge queue as soon as possible. Perhaps Eric could lower that delay for the time being?
But first they have to be validated (which has barely reduced in size), then they have to be assimilated (which is still growing in size as things eventually do get validated), then they can be deleted, then purged.
It's the initial "Results returned and awaiting validation" & then the "Workunits waiting for assimilation" that just isn't making any sort of dent in their backlogs at present. Once they clear, then we'll see how well the Purger is or isn't coping (and it probably won't cope as it's on Bruno, which is the upload server, which has been having issues for months now).
Grant
Darwin NT
ID: 2029159 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029164 - Posted: 25 Jan 2020, 10:53:17 UTC - in response to Message 2029159.  

Well, at least that tie-breaker validated immediately, and knocked another five off the 'waiting for validation' list. Sure, there are more stages to complete - but it's on its way again now.
ID: 2029164 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029168 - Posted: 25 Jan 2020, 11:53:36 UTC
Last modified: 25 Jan 2020, 12:11:25 UTC

I wonder how much would the database shrink if setiathome reduced their ridiculously long deadlines. My oldest task that is still waiting for validation I returned in October.

Astropulse was added a couple years later so its deadline is more reasonable but the deadline of normal tasks is a relic from the nineties. Computers (even Raspberry Pis) are orders of magnitude faster now.

When the tasks linger in the database for months, at some point we reach the point where maintaining those database rows has consumed more computer resources than what would have been needed if the servers crunched the tasks themselves.
ID: 2029168 · Report as offensive
Profile Schatten

Send message
Joined: 12 Oct 02
Posts: 18
Credit: 14,047,388
RAC: 9
Germany
Message 2029171 - Posted: 25 Jan 2020, 12:08:24 UTC
Last modified: 25 Jan 2020, 12:13:02 UTC

Getting some Workunits but many of the Vlars are very short (I hope really are or I a have a problem). That's a bit sad.

Disclaimer: I am using the new Driver 20.1.3, also the updated apps (Since the 21th of January 2020). I know that some bad invalid tasks will show up sooner or later from the time before I used the new Apps. I am sorry for that. :-/
ID: 2029171 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22941
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029172 - Posted: 25 Jan 2020, 12:10:58 UTC - in response to Message 2029168.  

The size reduction wouldn't be that much.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029172 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029175 - Posted: 25 Jan 2020, 12:18:27 UTC - in response to Message 2029172.  

The size reduction wouldn't be that much.
56% of the tasks in my 'Validation pending' list are ones I returned over 1 week ago.
ID: 2029175 · Report as offensive
Previous · 1 . . . 50 · 51 · 52 · 53 · 54 · 55 · 56 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.