Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 41 · 42 · 43 · 44 · 45 · 46 · 47 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979768 - Posted: 10 Feb 2019, 21:34:02 UTC

Back to Project has NO Tasks again...
Hosts down by hundreds of tasks...
Another day, another Panic.
ID: 1979768 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9909
Credit: 936,643,663
RAC: 1,510,885
United States
Message 1979774 - Posted: 10 Feb 2019, 21:50:56 UTC
Last modified: 10 Feb 2019, 21:58:14 UTC

They started to reduce the backlog of wu and result deletions again when they fixed Georgem. Anytime the deleters crank up you can't get any work. Down to a less than a dozen gpu tasks now on one host.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1979774 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 572
Credit: 1,954,114
RAC: 860
United States
Message 1979786 - Posted: 10 Feb 2019, 22:31:05 UTC - in response to Message 1979774.  

They started to reduce the backlog of wu and result deletions again when they fixed Georgem. Anytime the deleters crank up you can't get any work. Down to a less than a dozen gpu tasks now on one host.


results out in the field has dropped to 4.7 million and it is usually closer to 5. I think it is better than the system crashing though, but it is bothersome when machine caches run dry.
ID: 1979786 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979796 - Posted: 10 Feb 2019, 23:28:27 UTC

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...
ID: 1979796 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4708
Credit: 152,120,480
RAC: 244,738
Australia
Message 1979802 - Posted: 10 Feb 2019, 23:56:51 UTC - in response to Message 1979796.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?
ID: 1979802 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2770
Credit: 576,964,478
RAC: 920,968
Canada
Message 1979803 - Posted: 10 Feb 2019, 23:58:39 UTC

Panic ... Warning ... INCOMING !!!!
ID: 1979803 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9909
Credit: 936,643,663
RAC: 1,510,885
United States
Message 1979804 - Posted: 11 Feb 2019, 0:03:25 UTC - in response to Message 1979803.  

Splitters are offline now and RTS buffer at 760K.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1979804 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1444
Credit: 171,853,506
RAC: 358,850
United States
Message 1979806 - Posted: 11 Feb 2019, 0:07:40 UTC - in response to Message 1979802.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

Math. What a concept! :)
ID: 1979806 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979808 - Posted: 11 Feb 2019, 0:08:10 UTC - in response to Message 1979802.  
Last modified: 11 Feb 2019, 0:11:49 UTC

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

If you would have thought about it first, you probably wouldn't have posted that. Think about it a little. It just stared working again, did the traffic slow down any to speak of?
It's a problem with computers communicating with each other, like one saying 'got any work', the other saying, 'sure'. That is obviously not working, and traffic has nothing to do with it.
ID: 1979808 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9909
Credit: 936,643,663
RAC: 1,510,885
United States
Message 1979809 - Posted: 11 Feb 2019, 0:14:09 UTC

Work requests are being recognized and filled again. Panic over.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1979809 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4708
Credit: 152,120,480
RAC: 244,738
Australia
Message 1979811 - Posted: 11 Feb 2019, 0:27:15 UTC - in response to Message 1979808.  

Oh well. Machines are running out of work. I just shifted one to Beta. it's nice that BETA doesn't seem to have this problem.....oh wait.
Maybe they should look at Why BETA doesn't have this problem? It might be helpful determining Why BETA doesn't have this problem...


. . May I suggest because Beta handles a mere fraction of the traffic through main?

Stephen

? ?

If you would have thought about it first, you probably wouldn't have posted that. Think about it a little. It just stared working again, did the traffic slow down any to speak of?
It's a problem with computers communicating with each other, like one saying 'got any work', the other saying, 'sure'. That is obviously not working, and traffic has nothing to do with it.


. . It isn't just about having the number of tasks but moving that data around, both in the local network and across the global network. Handling a tiny amount of the data that has to be shifted in main makes Beta far less prone to congestion issues. But yes, glitches in data transfer between servers are probably at the core of the problem, but comparing it to Beta is not really a balanced comparison.

Stephen

<shrug>
ID: 1979811 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979813 - Posted: 11 Feb 2019, 0:35:16 UTC - in response to Message 1979811.  

The thing hadn't sent work for around 2 HOURS, yet there was too much traffic for the machines to talk to each other?
Please...
ID: 1979813 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9909
Credit: 936,643,663
RAC: 1,510,885
United States
Message 1979815 - Posted: 11 Feb 2019, 0:40:51 UTC - in response to Message 1979813.  

Yes, I think that was the case. When the file deleters kick in, they make a huge I/O contention on the servers that starve other processes out.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1979815 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979817 - Posted: 11 Feb 2019, 0:52:42 UTC - in response to Message 1979815.  

Right.... it stops a simple query from one machine to the other, but, other things keep going. Including queries between the same machines from BETA.
If you say so....
ID: 1979817 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4708
Credit: 152,120,480
RAC: 244,738
Australia
Message 1979819 - Posted: 11 Feb 2019, 1:07:58 UTC - in response to Message 1979817.  

Right.... it stops a simple query from one machine to the other, but, other things keep going. Including queries between the same machines from BETA.
If you say so....


. . OK, have you heard of a DOS attack? Do you understand how that works? Same principle here except that there is no hostile intent and the traffic is real not fake. But the outcome is the same.

Stephen

. .
ID: 1979819 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979821 - Posted: 11 Feb 2019, 1:29:14 UTC - in response to Message 1979819.  
Last modified: 11 Feb 2019, 1:31:39 UTC

Pretty targeted attack wouldn't you say? It Only affects the query from the scheduler and the RTS machine, everything else is unaffected.
Ever heard of a typo in the code? It's much more believable, and has happened before. Again, the traffic Stopped for Two Hours and during those Two Hours the machines never communicated on that One query. Other queries were unaffected. Why didn't other events stop if there was too much traffic to communicate? Ever think of that? Oh, and BETA still worked fine during all this alleged traffic, it uses the Same machines BTW.
ID: 1979821 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9909
Credit: 936,643,663
RAC: 1,510,885
United States
Message 1979823 - Posted: 11 Feb 2019, 1:36:17 UTC

I never really paid any attention to the hardware on Beta. Same servers as Main EXCEPT for Oscar, Carolyn, Paddym, Georgem, Marvin, Lando and Centurion. So yes the projects do share some of the same servers, but there is double the amount of I/O going on at Main compared to Beta just in the number of interconnections to databases. Not even accounting for the 10X number of users.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1979823 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4869
Credit: 595,738,429
RAC: 1,406,225
United States
Message 1979825 - Posted: 11 Feb 2019, 1:53:25 UTC - in response to Message 1979823.  
Last modified: 11 Feb 2019, 1:54:12 UTC

Main;
scheduling server synergy
scheduler process synergy
feeder synergy
db purge bruno

BETA;
Scheduler bruno
feeder.el6.x86_64 synergy

But wait... I thought synergy & bruno were incommunicado. Yet BETA communicated with them just fine.

Two Hours on main and the same machines weren't talking because of traffic? Rightttttt
ID: 1979825 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4708
Credit: 152,120,480
RAC: 244,738
Australia
Message 1979826 - Posted: 11 Feb 2019, 1:55:32 UTC - in response to Message 1979821.  

Pretty targeted attack wouldn't you say? It Only affects the query from the scheduler and the RTS machine, everything else is unaffected.
Ever heard of a typo in the code? It's much more believable, and has happened before. Again, the traffic Stopped for Two Hours and during those Two Hours the machines never communicated on that One query. Other queries were unaffected. Why didn't other events stop if there was too much traffic to communicate? Ever think of that? Oh, and BETA still worked fine during all this alleged traffic, it uses the Same machines BTW.


. . I am not convinced on the MORE believable part but it is certainly a possibility. Again it would have to be in a part of the code where it only manifests under some conditions and very specific activity. Either way we are not in a position to even investigate it properly much less do anything about it.

Stephen

<shrug>
ID: 1979826 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 787
Credit: 220,732,928
RAC: 204,531
United Kingdom
Message 1979888 - Posted: 11 Feb 2019, 16:16:05 UTC

Project has no tasks available.

Out of GPU WU's, Einstein is keeping them warm:-)
Kevin


ID: 1979888 · Report as offensive
Previous · 1 . . . 41 · 42 · 43 · 44 · 45 · 46 · 47 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.