Panic Mode On (83) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (83) Server Problems?

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next
Author Message
ExchangeMan
Volunteer tester
Send message
Joined: 9 Jan 00
Posts: 113
Credit: 142,315,520
RAC: 196,892
United States
Message 1363073 - Posted: 1 May 2013, 3:30:20 UTC - in response to Message 1363062.

What screams at me looking at the Munin graphs is that only when we ran out of AstroPulse units did the MB units start to plummet from 300K to 0. This suggests to me that a lot of GPU AstroPulse is being done and when that ran out the MB reserve was quickly devoured by hungry hungry GPUs.

It's interesting that 30 units/s creation rate isn't enough, at least during a shorty storm.

I noticed this too before the shutdown. When MB and AP units were both being split, things kind of behaved themselves and I got all the MB I could crunch. I sure hope this is fixed tomorrow.

The current server status page shows 6 MB splitters active. 8 were active earlier and couldn't keep up. This is very frustrating for dedicated Seti crunchers like me.

____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,374,501
RAC: 48,787
Australia
Message 1363106 - Posted: 1 May 2013, 6:51:32 UTC - in response to Message 1363062.
Last modified: 1 May 2013, 7:00:02 UTC

It's interesting that 30 units/s creation rate isn't enough, at least during a shorty storm.

During a shorty storm 55/s is the minimum needed to meet demand.
The storm is over, but still the splitters aren't able to crank out enough work. For a while they were doing about 30/s (barely enough when there's a lot of VLARs in the mix). Now they've dropped down to less than 20.

Someone in the lab needs to take a look at what is going on- the splitters used to be able sustain 70/s no problems at all, now they can't even reach it as a peak.


EDIT- at least the shorty storm was over for a while. The work my systems were able to get after the outage while i was at work didn't have many shorties in it, but one of the systems was just able to get some more work (still nowhere near enough...) & it was almost all shorties.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,374,501
RAC: 48,787
Australia
Message 1363114 - Posted: 1 May 2013, 7:22:33 UTC - in response to Message 1363109.

Might be something with the new pfb splitting thingy, although I think it was Richard that expressed the opinion he thought they might be a little faster than the old splitters.....hmmmmmm.
Might depend on the work they are splitting.

I saw that mentioned, but so far they've proved to be about half the speed of the previous ones. 3/4 speed if they're really flying.
They're definately borked.
____________
Grant
Darwin NT.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,885,157
RAC: 2,465
United States
Message 1363145 - Posted: 1 May 2013, 9:22:24 UTC - in response to Message 1363106.

It's interesting that 30 units/s creation rate isn't enough, at least during a shorty storm.

During a shorty storm 55/s is the minimum needed to meet demand.
The storm is over, but still the splitters aren't able to crank out enough work. For a while they were doing about 30/s (barely enough when there's a lot of VLARs in the mix). Now they've dropped down to less than 20.

Someone in the lab needs to take a look at what is going on- the splitters used to be able sustain 70/s no problems at all, now they can't even reach it as a peak.


EDIT- at least the shorty storm was over for a while. The work my systems were able to get after the outage while i was at work didn't have many shorties in it, but one of the systems was just able to get some more work (still nowhere near enough...) & it was almost all shorties.


The reason I was talking about a shorty storm was I had drained my queues to update BOINC and move from r390 to r1817 (my 3 cores pretty well match my low end GPU in performance so they were fairly close to running out at the same time) and when I turned off NNT I only got shorties for my CPU and the few GPU units I got were actually a mix but leaned toward shorties as well.

Edit: and once again without fail, the lack of GPU units get resolved as I'm writing about it. Woo Hoo. I finally have full queues again.

____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,374,501
RAC: 48,787
Australia
Message 1363287 - Posted: 1 May 2013, 17:55:37 UTC - in response to Message 1363145.


Splitters still not keeping up.
____________
Grant
Darwin NT.

Profile Gatekeeper
Avatar
Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1363298 - Posted: 1 May 2013, 18:38:42 UTC

Just within the last hour have I finally been able to top off my allocation on all three rigs. Looks like the shortie storm is at least temporarily over.
____________

Kevin Benfield
Send message
Joined: 29 Dec 03
Posts: 39
Credit: 16,327,677
RAC: 11,659
United Kingdom
Message 1363311 - Posted: 1 May 2013, 19:50:19 UTC

For some reason, I do have not been able to get any new units for most of the day, completely out of units to crunch as well, tried the usual kick to try and get things going but nothing :(
____________

Profile Gatekeeper
Avatar
Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1363352 - Posted: 1 May 2013, 22:20:36 UTC - in response to Message 1363311.

For some reason, I do have not been able to get any new units for most of the day, completely out of units to crunch as well, tried the usual kick to try and get things going but nothing :(


And, as is usually the case, 25 minutes after you posted, one of your rigs got work.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,374,501
RAC: 48,787
Australia
Message 1363408 - Posted: 2 May 2013, 3:34:45 UTC - in response to Message 1363352.


Splitters still unable to build a ready-to-send buffer.
____________
Grant
Darwin NT.

Profile Donald L. JohnsonProject donor
Avatar
Send message
Joined: 5 Aug 02
Posts: 6252
Credit: 734,067
RAC: 1,195
United States
Message 1363490 - Posted: 2 May 2013, 8:34:19 UTC
Last modified: 2 May 2013, 8:35:25 UTC

On 8 April 2013, as we came back up after the move to the Colocation Facility, Matt wrote this (emphasis mine):


Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.

Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon.

I wonder if this is one of the throttles he's mentioned, to slow down the MB/pfb splitters so we don't overwhelm the result storage server and/or run out of data....
____________
Donald
Infernal Optimist / Submariner, retired

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,374,501
RAC: 48,787
Australia
Message 1363492 - Posted: 2 May 2013, 8:41:29 UTC - in response to Message 1363490.

On 8 April 2013, as we came back up after the move to the Colocation Facility, Matt wrote this (emphasis mine):

Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.

Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon.

I wonder if this is one of the throttles he's mentioned, to slow down the MB/pfb splitters so we don't overwhelm the result storage server and/or run out of data....


Possibly, although the problem continues even when there aren't tonnes of shorties going through the system. Prior to the move & the change of splitter type it wasn't an issue.
____________
Grant
Darwin NT.

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11919
Credit: 14,593,761
RAC: 12,009
United States
Message 1363563 - Posted: 2 May 2013, 13:33:18 UTC

The blue line on the cricket graph is certainly interesting for the last 18 or 19 hours.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11919
Credit: 14,593,761
RAC: 12,009
United States
Message 1363572 - Posted: 2 May 2013, 13:55:50 UTC - in response to Message 1363570.

The blue line on the cricket graph is certainly interesting for the last 18 or 19 hours.

More datasets being transferred from the lab to the colo in preparation for splitting, I suspect.

I hope so, with only about 300 MBs ready to send.

God bless the new bandwidth, eh?

Yup.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12695
Credit: 7,172,751
RAC: 14,959
United States
Message 1363620 - Posted: 2 May 2013, 17:09:09 UTC - in response to Message 1363581.

The blue line on the cricket graph is certainly interesting for the last 18 or 19 hours.

More datasets being transferred from the lab to the colo in preparation for splitting, I suspect.

I hope so, with only about 300 MBs ready to send.

That's a problem with the splitting rate, not the amount of data waiting to be split. There is plenty of data awaiting splitting right now.

We need another pfb splitter online.

Unless the powers that be are deliberately holding back to spare the bandwidth or database a bit.

With the comments about not wanting to be too big a data hog, the limits on total units, the comment that we are going through the data faster than they are collecting it, I would think the conclusion is they are holding back on purpose.

Face it, there are more crunchers than there is Seti data. This is a good thing. Now how do we convince more of them to say join the 84+ cents a month club so that ntpckr development can continue?

____________

ExchangeMan
Volunteer tester
Send message
Joined: 9 Jan 00
Posts: 113
Credit: 142,315,520
RAC: 196,892
United States
Message 1363687 - Posted: 2 May 2013, 19:39:48 UTC - in response to Message 1363581.

The blue line on the cricket graph is certainly interesting for the last 18 or 19 hours.

More datasets being transferred from the lab to the colo in preparation for splitting, I suspect.

I hope so, with only about 300 MBs ready to send.

That's a problem with the splitting rate, not the amount of data waiting to be split. There is plenty of data awaiting splitting right now.

We need another pfb splitter online.

Unless the powers that be are deliberately holding back to spare the bandwidth or database a bit.

If they are holding back data, then they need to let us know this explicitly. It's a lot harder to get work units now compared to last week, but I'm making the assumption that this is temporary and we will return to a somwhat normal condition in a little while. If this is not going to be the case, I just want to know.

____________

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (83) Server Problems?

Copyright © 2014 University of California