Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 35 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912674 - Posted: 13 Jan 2018, 0:16:27 UTC

Splitters still struggling. There was brief boost, but not enough to top up the ready-to-send buffer, or even stop the decline- just slow it down for a bit.
About 3-3.5hrs work left at the current rate of consumption.
Grant
Darwin NT
ID: 1912674 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912708 - Posted: 13 Jan 2018, 2:49:09 UTC

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
ID: 1912708 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912712 - Posted: 13 Jan 2018, 2:58:10 UTC - in response to Message 1912708.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?

At the back of my mind also .... great minds thinking alike and all :-}
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912712 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912734 - Posted: 13 Jan 2018, 5:33:46 UTC
Last modified: 13 Jan 2018, 5:34:50 UTC

I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer.
Deleter I/O affecting splitter I/O?
Grant
Darwin NT
ID: 1912734 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 1912743 - Posted: 13 Jan 2018, 6:06:59 UTC - in response to Message 1912734.  

I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer.
Deleter I/O affecting splitter I/O?

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results
ID: 1912743 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912749 - Posted: 13 Jan 2018, 6:29:29 UTC - in response to Message 1912743.  
Last modified: 13 Jan 2018, 6:29:58 UTC

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results

MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s).
Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage.

Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more.
I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage.
Grant
Darwin NT
ID: 1912749 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912767 - Posted: 13 Jan 2018, 8:49:53 UTC

MB WU-awating-deletion on the rise, splitter output on the decline (below 30/s now).
About 5 hrs of work left in the Ready-to-send buffer at the present rate of it's decline.
Grant
Darwin NT
ID: 1912767 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1912773 - Posted: 13 Jan 2018, 9:48:15 UTC - in response to Message 1912712.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
At the back of my mind also .... great minds thinking alike and all :-}
I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware.

Servers are different beasts from consumer PCs, and they do a different job.
ID: 1912773 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912838 - Posted: 13 Jan 2018, 17:28:27 UTC - in response to Message 1912773.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
At the back of my mind also .... great minds thinking alike and all :-}
I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware.

Servers are different beasts from consumer PCs, and they do a different job.

I have my suspicions too. After all servers do LOTS of I/O transactions. The most deleterious effect the patch had on server software was in any app that used a lot of I/O transactions in all the online tests I read.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912838 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912845 - Posted: 13 Jan 2018, 18:31:26 UTC

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.
ID: 1912845 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912847 - Posted: 13 Jan 2018, 18:42:23 UTC - in response to Message 1912845.  

Change the unused pfb splitters over to gbt splitters.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912847 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1912849 - Posted: 13 Jan 2018, 18:45:56 UTC - in response to Message 1912845.  

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.
Although Eric attended the same teleconference with Kevin Reed, Eric joined us a few minutes late: Kevin told us about the slowdown in the general chit-chat before the start of the formal business (which was about something completely different), so Eric didn't hear the actual statement. But I expect he's found out about it by now.
ID: 1912849 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912853 - Posted: 13 Jan 2018, 19:17:31 UTC

Thanks so much for the update Richard. Sounds like a real concern that we hope Eric addresses soon.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912853 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912875 - Posted: 13 Jan 2018, 21:29:27 UTC - in response to Message 1912845.  
Last modified: 13 Jan 2018, 21:40:54 UTC

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.

I'm sure it is, but it would be better IMHO to sort out what is causing the slow downs.

The current splitters are capable of sustaining 50+/s. Is the issue I/O contention? There's a pretty strong correlation looking at the graphs for contention between deletion & splitting- but correlation isn't causation. Have the exploit patches even been applied yet? (ie if they haven't then it sounds like things will be even worse than they are now). Will more RAM in the servers involved help with more caching? Or will it require a move to flash based storage? And if we make that move, will the current hardware running the queries be good enough to take advantage of that storage for some time to come, or will it quickly become the next bottleneck?

The Ready-to-send buffer seems to have settled around 100k for now. The splitters crank up, then fall over, crank up, fall over. Along with the deleters clearing the backlog, then losing ground, then clearing it, then losing ground.
Cause & effect or just another symptom?
*shrug*
Results & WU awaiting purge are also on the climb.

Received-last-hour is sitting at 142k (after being over 145k for some time). The servers really are working hard at present.
And it looks like we've just about finished off all those BLC05 file that were loaded in one batch.
Grant
Darwin NT
ID: 1912875 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912885 - Posted: 13 Jan 2018, 23:12:06 UTC

I thought for sure I saw mention of them applying the security patch and Jeff Cobb was involved. But I can't find the post now and I might be imagining.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912885 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 1912898 - Posted: 14 Jan 2018, 0:34:52 UTC - in response to Message 1912749.  

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results

MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s).
Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage.

Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more.
I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage.

There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour
ID: 1912898 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1912902 - Posted: 14 Jan 2018, 0:58:09 UTC - in response to Message 1912898.  

There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour

It we get a batch of the longest running WUs it I suspect it could be 90k or less.
My GPUs take 5min 10sec/44min to process these present WUs. The longer running WUs take 8min/1hr 15min+ to process.
Grant
Darwin NT
ID: 1912902 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1912997 - Posted: 14 Jan 2018, 15:43:46 UTC - in response to Message 1912902.  

Looks like the Replica is falling further and further behind. What is coming next.........

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1912997 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913016 - Posted: 14 Jan 2018, 17:55:27 UTC

Just got word from Eric that he's gonna try to add a couple more GBT splitters.
Meow!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1913016 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913019 - Posted: 14 Jan 2018, 18:02:52 UTC - in response to Message 1913017.  

Just got word from Eric that he's gonna try to add a couple more GBT splitters.
Meow!

Well, that's good, but won't help if they don't add more files to split soon......
It doesn't bother me much though, as I'm doing 100% Beta for a while.

It would not surprise me if Eric took care of adding more splitter cache at the same time.
We all know there is tons of it to work on .
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1913019 · Report as offensive
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 35 · Next

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.