Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1979786 - Posted: 10 Feb 2019, 22:31:05 UTC - in response to Message 1979774.  

They started to reduce the backlog of wu and result deletions again when they fixed Georgem. Anytime the deleters crank up you can't get any work. Down to a less than a dozen gpu tasks now on one host.


results out in the field has dropped to 4.7 million and it is usually closer to 5. I think it is better than the system crashing though, but it is bothersome when machine caches run dry.
ID: 1979786 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979796 - Posted: 10 Feb 2019, 23:28:27 UTC

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...
ID: 1979796 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979802 - Posted: 10 Feb 2019, 23:56:51 UTC - in response to Message 1979796.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?
ID: 1979802 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1979803 - Posted: 10 Feb 2019, 23:58:39 UTC

Panic ... Warning ... INCOMING !!!!
ID: 1979803 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979804 - Posted: 11 Feb 2019, 0:03:25 UTC - in response to Message 1979803.  

Splitters are offline now and RTS buffer at 760K.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979804 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1979806 - Posted: 11 Feb 2019, 0:07:40 UTC - in response to Message 1979802.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

Math. What a concept! :)
ID: 1979806 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979808 - Posted: 11 Feb 2019, 0:08:10 UTC - in response to Message 1979802.  
Last modified: 11 Feb 2019, 0:11:49 UTC

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

If you would have thought about it first, you probably wouldn't have posted that. Think about it a little. It just stared working again, did the traffic slow down any to speak of?
It's a problem with computers communicating with each other, like one saying 'got any work', the other saying, 'sure'. That is obviously not working, and traffic has nothing to do with it.
ID: 1979808 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979809 - Posted: 11 Feb 2019, 0:14:09 UTC

Work requests are being recognized and filled again. Panic over.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979809 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979811 - Posted: 11 Feb 2019, 0:27:15 UTC - in response to Message 1979808.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

If you would have thought about it first, you probably wouldn't have posted that. Think about it a little. It just stared working again, did the traffic slow down any to speak of?
It's a problem with computers communicating with each other, like one saying 'got any work', the other saying, 'sure'. That is obviously not working, and traffic has nothing to do with it.


. . It isn't just about having the number of tasks but moving that data around, both in the local network and across the global network. Handling a tiny amount of the data that has to be shifted in main makes Beta far less prone to congestion issues. But yes, glitches in data transfer between servers are probably at the core of the problem, but comparing it to Beta is not really a balanced comparison.

Stephen

<shrug>
ID: 1979811 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979813 - Posted: 11 Feb 2019, 0:35:16 UTC - in response to Message 1979811.  

The thing hadn't sent work for around 2 HOURS, yet there was too much traffic for the machines to talk to each other?
Please...
ID: 1979813 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979815 - Posted: 11 Feb 2019, 0:40:51 UTC - in response to Message 1979813.  

Yes, I think that was the case. When the file deleters kick in, they make a huge I/O contention on the servers that starve other processes out.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979815 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979817 - Posted: 11 Feb 2019, 0:52:42 UTC - in response to Message 1979815.  

Right.... it stops a simple query from one machine to the other, but, other things keep going. Including queries between the same machines from BETA.
If you say so....
ID: 1979817 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979819 - Posted: 11 Feb 2019, 1:07:58 UTC - in response to Message 1979817.  

Right.... it stops a simple query from one machine to the other, but, other things keep going. Including queries between the same machines from BETA.
If you say so....


. . OK, have you heard of a DOS attack? Do you understand how that works? Same principle here except that there is no hostile intent and the traffic is real not fake. But the outcome is the same.

Stephen

. .
ID: 1979819 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979821 - Posted: 11 Feb 2019, 1:29:14 UTC - in response to Message 1979819.  
Last modified: 11 Feb 2019, 1:31:39 UTC

Pretty targeted attack wouldn't you say? It Only affects the query from the scheduler and the RTS machine, everything else is unaffected.
Ever heard of a typo in the code? It's much more believable, and has happened before. Again, the traffic Stopped for Two Hours and during those Two Hours the machines never communicated on that One query. Other queries were unaffected. Why didn't other events stop if there was too much traffic to communicate? Ever think of that? Oh, and BETA still worked fine during all this alleged traffic, it uses the Same machines BTW.
ID: 1979821 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979823 - Posted: 11 Feb 2019, 1:36:17 UTC

I never really paid any attention to the hardware on Beta. Same servers as Main EXCEPT for Oscar, Carolyn, Paddym, Georgem, Marvin, Lando and Centurion. So yes the projects do share some of the same servers, but there is double the amount of I/O going on at Main compared to Beta just in the number of interconnections to databases. Not even accounting for the 10X number of users.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979823 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1979825 - Posted: 11 Feb 2019, 1:53:25 UTC - in response to Message 1979823.  
Last modified: 11 Feb 2019, 1:54:12 UTC

Main;
scheduling server synergy
scheduler process synergy
feeder synergy
db purge bruno

BETA;
Scheduler bruno
feeder.el6.x86_64 synergy

But wait... I thought synergy & bruno were incommunicado. Yet BETA communicated with them just fine.

Two Hours on main and the same machines weren't talking because of traffic? Rightttttt
ID: 1979825 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979826 - Posted: 11 Feb 2019, 1:55:32 UTC - in response to Message 1979821.  

Pretty targeted attack wouldn't you say? It Only affects the query from the scheduler and the RTS machine, everything else is unaffected.
Ever heard of a typo in the code? It's much more believable, and has happened before. Again, the traffic Stopped for Two Hours and during those Two Hours the machines never communicated on that One query. Other queries were unaffected. Why didn't other events stop if there was too much traffic to communicate? Ever think of that? Oh, and BETA still worked fine during all this alleged traffic, it uses the Same machines BTW.


. . I am not convinced on the MORE believable part but it is certainly a possibility. Again it would have to be in a part of the code where it only manifests under some conditions and very specific activity. Either way we are not in a position to even investigate it properly much less do anything about it.

Stephen

<shrug>
ID: 1979826 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1979888 - Posted: 11 Feb 2019, 16:16:05 UTC

Project has no tasks available.

Out of GPU WU's, Einstein is keeping them warm:-)
Kevin


ID: 1979888 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1979889 - Posted: 11 Feb 2019, 16:16:53 UTC

OK I'll do it: PANIC again again...
ID: 1979889 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1979891 - Posted: 11 Feb 2019, 16:31:53 UTC

tasks don't seem to be validating either, i'm still sending the work back, but RAC keeps dropping, and pendings increasing.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1979891 · Report as offensive
Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.