Panic Mode On (109) Server Problems?

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912610 - Posted: 12 Jan 2018, 20:17:22 UTC That's fine and works if you only have one main project at a time. But if you run multiple projects at the same time, it does not. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912610 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1912619 - Posted: 12 Jan 2018, 21:07:52 UTC - in response to Message 1912599. oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU..... ... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed! ID: 1912619 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1912621 - Posted: 12 Jan 2018, 21:11:48 UTC - in response to Message 1912619. oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU..... ... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed! Yep you gotta be careful, very dangerous after the cocktail hour. ID: 1912621 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1912629 - Posted: 12 Jan 2018, 21:55:06 UTC - in response to Message 1912571. I kept threatening, mainly toggling preferences and the Triple Update. Haven't resorted to kicking server. Cache is full right now. Not going to do anything. Will have to see where I am at in the morning. Calling a night. . . I hope you got a good night's sleep :) Stephen :) Down about 80 tasks in the caches. Doesn't help that overnight, eventually all the machines sync up with their work requests timings. Never figured out why, something to do with BoincTasks monitoring and server set gpu backoffs or whatever. Triple update on all machines staggered by a minute got everyone full. I've been seeing the same on my Linux machines today. Similar to a rolling Blackout. The Server will stop sending the tasks requested by the Client and just send a few tasks at random. Once the Host is down by around 100 tasks the Server will recover and fill the cache. A while later the same will happen on a different machine. The current victim is down by about 70 tasks and just received 5 new tasks instead of the 70 or so the client is requesting. The 3 update routine hasn't had any effect so far. The cache should be around 220 on this machine, https://setiathome.berkeley.edu/results.php?hostid=6906726&offset=140 ID: 1912629 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912630 - Posted: 12 Jan 2018, 21:55:48 UTC - in response to Message 1912619. oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU..... ... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed! Ha! LOL. Been there ..... done that. I have you beat. I forgot to switch to NNT for an hour once. Accumulated over 5000 tasks. Couldn't even abort them all in one shot and had to take whacks at a couple a hundred at a time. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912630 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1912635 - Posted: 12 Jan 2018, 22:03:52 UTC - in response to Message 1912621. oh, and if you forget to change 0.1 back to 10 when you move from Einstein to Seti, you figure it out quickly as you don't get the full allotment of 100 work units per GPU/CPU..... ... And if you forget to change 4.0+0.01 when changing to Einstein (with RS <> 0) to find E@H doesn't have a 100 task limit ... last time I did that and turned my back for a few minutes and have had IIRC 736 tasks ... way, way over committed! Yep you gotta be careful, very dangerous after the cocktail hour. Dang, I hate when that happens...Wait..Hold this.... ID: 1912635 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1912637 - Posted: 12 Jan 2018, 22:07:43 UTC The splitter output has fallen even further. They were good for 50+/s, then it dropped down to around 42/s, now they're struggling to provide 30/s. That's about 108,000 per hour. Unfortunately current demand is 130,00/hr min, averaging around 135,000. We need 39/s as a minimum to meet peak demand (140,000) and keep a ready-to-send buffer with the present load. In a few hours there will be no work left in the ready-to-send buffer & caches will start to run down (more than they normally do) and not get refilled till the splitter output recovers. I think Eric might need to do some further splitter trouble shooting. Or it could be related to the general server system malaise- Replica keeps dropping behind, WU deleters likewise can't keep up. Grant Darwin NT ID: 1912637 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1912639 - Posted: 12 Jan 2018, 22:11:43 UTC - in response to Message 1912637. Why those things always happening on the friday? TGIF Cocktail hours? Ops 510 PM I'm late for the first one. ID: 1912639 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1912674 - Posted: 13 Jan 2018, 0:16:27 UTC Splitters still struggling. There was brief boost, but not enough to top up the ready-to-send buffer, or even stop the decline- just slow it down for a bit. About 3-3.5hrs work left at the current rate of consumption. Grant Darwin NT ID: 1912674 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1912708 - Posted: 13 Jan 2018, 2:49:09 UTC Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance? ID: 1912708 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912712 - Posted: 13 Jan 2018, 2:58:10 UTC - in response to Message 1912708. Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance? At the back of my mind also .... great minds thinking alike and all :-} Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912712 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1912734 - Posted: 13 Jan 2018, 5:33:46 UTC Last modified: 13 Jan 2018, 5:34:50 UTC I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer. Deleter I/O affecting splitter I/O? Grant Darwin NT ID: 1912734 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1912743 - Posted: 13 Jan 2018, 6:06:59 UTC - in response to Message 1912734. I notice that at the same time the Deleters cleared a huge backlog, the Splitters picked up their pace. They've since dropped their output again, but at least they're still producing enough to slowly build up the Ready-to-send buffer. Deleter I/O affecting splitter I/O? You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results ID: 1912743 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1912749 - Posted: 13 Jan 2018, 6:29:29 UTC - in response to Message 1912743. Last modified: 13 Jan 2018, 6:29:58 UTC You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s). Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage. Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more. I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage. Grant Darwin NT ID: 1912749 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1912767 - Posted: 13 Jan 2018, 8:49:53 UTC MB WU-awating-deletion on the rise, splitter output on the decline (below 30/s now). About 5 hrs of work left in the Ready-to-send buffer at the present rate of it's decline. Grant Darwin NT ID: 1912767 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1912773 - Posted: 13 Jan 2018, 9:48:15 UTC - in response to Message 1912712. Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance? At the back of my mind also .... great minds thinking alike and all :-} I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware. Servers are different beasts from consumer PCs, and they do a different job. ID: 1912773 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912838 - Posted: 13 Jan 2018, 17:28:27 UTC - in response to Message 1912773. Here's a theory for y'all. Do you suppose Meltdown and Spectre patches have been applied to that server, possibly degrading performance? At the back of my mind also .... great minds thinking alike and all :-} I heard Kevin Reed say that the World Community Grid servers had slowed by between 20% and 30% when they applied the patches. Fortunately, WCG had recently upgraded the hardware, so they had enough headroom - but they would have been struggling with the previous hardware. Servers are different beasts from consumer PCs, and they do a different job. I have my suspicions too. After all servers do LOTS of I/O transactions. The most deleterious effect the patch had on server software was in any app that used a lot of I/O transactions in all the online tests I read. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912838 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1912845 - Posted: 13 Jan 2018, 18:31:26 UTC Back about a week ago, Eric wrote: If we don't start building a queue I'll add more GBT splitters. I wonder if that's still an option. ID: 1912845 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912847 - Posted: 13 Jan 2018, 18:42:23 UTC - in response to Message 1912845. Change the unused pfb splitters over to gbt splitters. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912847 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1912849 - Posted: 13 Jan 2018, 18:45:56 UTC - in response to Message 1912845. Back about a week ago, Eric wrote: If we don't start building a queue I'll add more GBT splitters. I wonder if that's still an option. Although Eric attended the same teleconference with Kevin Reed, Eric joined us a few minutes late: Kevin told us about the slowdown in the general chit-chat before the start of the formal business (which was about something completely different), so Eric didn't hear the actual statement. But I expect he's found out about it by now. ID: 1912849 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.