Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 36 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912610 - Posted: 12 Jan 2018, 20:17:22 UTC

That's fine and works if you only have one main project at a time. But if you run multiple projects at the same time, it does not.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912610 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1912619 - Posted: 12 Jan 2018, 21:07:52 UTC - in response to Message 1912599.  

oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU.....
... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed!
ID: 1912619 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11358
Credit: 29,581,041
RAC: 66
United States
Message 1912621 - Posted: 12 Jan 2018, 21:11:48 UTC - in response to Message 1912619.  

oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU.....
... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed!

Yep you gotta be careful, very dangerous after the cocktail hour.
ID: 1912621 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1912629 - Posted: 12 Jan 2018, 21:55:06 UTC - in response to Message 1912571.  

I kept threatening, mainly toggling preferences and the Triple Update. Haven't resorted to kicking server. Cache is full right now. Not going to do anything. Will have to see where I am at in the morning. Calling a night.


. . I hope you got a good night's sleep :)

Stephen

:)

Down about 80 tasks in the caches. Doesn't help that overnight, eventually all the machines sync up with their work requests timings. Never figured out why, something to do with BoincTasks monitoring and server set gpu backoffs or whatever. Triple update on all machines staggered by a minute got everyone full.

I've been seeing the same on my Linux machines today. Similar to a rolling Blackout. The Server will stop sending the tasks requested by the Client and just send a few tasks at random. Once the Host is down by around 100 tasks the Server will recover and fill the cache. A while later the same will happen on a different machine. The current victim is down by about 70 tasks and just received 5 new tasks instead of the 70 or so the client is requesting. The 3 update routine hasn't had any effect so far. The cache should be around 220 on this machine, https://setiathome.berkeley.edu/results.php?hostid=6906726&offset=140
ID: 1912629 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912630 - Posted: 12 Jan 2018, 21:55:48 UTC - in response to Message 1912619.  

oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU.....
... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed!


Ha! LOL. Been there ..... done that. I have you beat. I forgot to switch to NNT for an hour once. Accumulated over 5000 tasks. Couldn't even abort them all in one shot and had to take whacks at a couple a hundred at a time.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912630 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1912635 - Posted: 12 Jan 2018, 22:03:52 UTC - in response to Message 1912621.  

oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU.....
... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed!

Yep you gotta be careful, very dangerous after the cocktail hour.


Dang, I hate when that happens...Wait..Hold this....
ID: 1912635 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1912637 - Posted: 12 Jan 2018, 22:07:43 UTC

The splitter output has fallen even further.
They were good for 50+/s, then it dropped down to around 42/s, now they're struggling to provide 30/s.
That's about 108,000 per hour. Unfortunately current demand is 130,00/hr min, averaging around 135,000. We need 39/s as a minimum to meet peak demand (140,000) and keep a ready-to-send buffer with the present load.
In a few hours there will be no work left in the ready-to-send buffer & caches will start to run down (more than they normally do) and not get refilled till the splitter output recovers.

I think Eric might need to do some further splitter trouble shooting. Or it could be related to the general server system malaise- Replica keeps dropping behind, WU deleters likewise can't keep up.
Grant
Darwin NT
ID: 1912637 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1912639 - Posted: 12 Jan 2018, 22:11:43 UTC - in response to Message 1912637.  

Why those things always happening on the friday? TGIF Cocktail hours? Ops 510 PM I'm late for the first one.
ID: 1912639 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1912674 - Posted: 13 Jan 2018, 0:16:27 UTC

Splitters still struggling. There was brief boost, but not enough to top up the ready-to-send buffer, or even stop the decline- just slow it down for a bit.
About 3-3.5hrs work left at the current rate of consumption.
Grant
Darwin NT
ID: 1912674 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912708 - Posted: 13 Jan 2018, 2:49:09 UTC

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
ID: 1912708 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912712 - Posted: 13 Jan 2018, 2:58:10 UTC - in response to Message 1912708.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?

At the back of my mind also .... great minds thinking alike and all :-}
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912712 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1912734 - Posted: 13 Jan 2018, 5:33:46 UTC
Last modified: 13 Jan 2018, 5:34:50 UTC

I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer.
Deleter I/O affecting splitter I/O?
Grant
Darwin NT
ID: 1912734 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1912743 - Posted: 13 Jan 2018, 6:06:59 UTC - in response to Message 1912734.  

I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer.
Deleter I/O affecting splitter I/O?

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results
ID: 1912743 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1912749 - Posted: 13 Jan 2018, 6:29:29 UTC - in response to Message 1912743.  
Last modified: 13 Jan 2018, 6:29:58 UTC

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results

MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s).
Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage.

Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more.
I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage.
Grant
Darwin NT
ID: 1912749 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1912767 - Posted: 13 Jan 2018, 8:49:53 UTC

MB WU-awating-deletion on the rise, splitter output on the decline (below 30/s now).
About 5 hrs of work left in the Ready-to-send buffer at the present rate of it's decline.
Grant
Darwin NT
ID: 1912767 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1912773 - Posted: 13 Jan 2018, 9:48:15 UTC - in response to Message 1912712.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
At the back of my mind also .... great minds thinking alike and all :-}
I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware.

Servers are different beasts from consumer PCs, and they do a different job.
ID: 1912773 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912838 - Posted: 13 Jan 2018, 17:28:27 UTC - in response to Message 1912773.  

Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance?
At the back of my mind also .... great minds thinking alike and all :-}
I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware.

Servers are different beasts from consumer PCs, and they do a different job.

I have my suspicions too. After all servers do LOTS of I/O transactions. The most deleterious effect the patch had on server software was in any app that used a lot of I/O transactions in all the online tests I read.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912838 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1912845 - Posted: 13 Jan 2018, 18:31:26 UTC

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.
ID: 1912845 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912847 - Posted: 13 Jan 2018, 18:42:23 UTC - in response to Message 1912845.  

Change the unused pfb splitters over to gbt splitters.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912847 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1912849 - Posted: 13 Jan 2018, 18:45:56 UTC - in response to Message 1912845.  

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.
Although Eric attended the same teleconference with Kevin Reed, Eric joined us a few minutes late: Kevin told us about the slowdown in the general chit-chat before the start of the formal business (which was about something completely different), so Eric didn't hear the actual statement. But I expect he's found out about it by now.
ID: 1912849 · Report as offensive
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 36 · Next

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.