Panic Mode On (63) Server problems?

Message boards : Number crunching : Panic Mode On (63) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
AndyJ
Avatar

Send message
Joined: 17 Aug 02
Posts: 248
Credit: 27,380,797
RAC: 0
United Kingdom
Message 1179112 - Posted: 18 Dec 2011, 19:10:26 UTC - in response to Message 1179076.  

Well, it's rather refreshing to have a break in the shorty storm. Most of my rigs have their cache limits hit, the only fly in the ointment.
But at least the tasks that are cached have some run time for the GPUs, not 80-90% 2 minute drills.

Now, about them there limits..........


I dont know, but I have a very bad feeling that limits will not be raised soon, if ever. Servers are still maxed out with limits in place. Bearing in mind the disconnect between Seti staff who are still looking for "spare" computing power, and top crunchers with farms.
Please prove me wrong, but on the technical board, as far as I know, there has been no positive reply to a request to raise the limits.

Regards,

A


ID: 1179112 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1179148 - Posted: 18 Dec 2011, 23:07:56 UTC

omg i have work
ID: 1179148 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36798
Credit: 261,360,520
RAC: 489
Australia
Message 1179200 - Posted: 19 Dec 2011, 6:24:46 UTC - in response to Message 1179148.  

I just had a long weekend away and on return I only had 1 rig to give a nudge to its uploads but nothing ran out of work so I have no reason at all to panic. :D

Cheers.
ID: 1179200 · Report as offensive
DMMD
Avatar

Send message
Joined: 14 Feb 00
Posts: 118
Credit: 71,564,960
RAC: 0
Message 1179202 - Posted: 19 Dec 2011, 6:51:59 UTC

Its hotter than a monky's bum in here
ID: 1179202 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36798
Credit: 261,360,520
RAC: 489
Australia
Message 1179212 - Posted: 19 Dec 2011, 9:25:15 UTC - in response to Message 1179202.  
Last modified: 19 Dec 2011, 9:34:36 UTC

Well it's been 10-15C cooler here than usual for this time of year, so no worry here with heat yet, but with the help of a proxy my Q6600 is now hitting the limit again after not connecting for 3.5 days so it seems that I'm getting 5-6 days worth of work out of my 10 day cache setting on that rig (joining my other 2 rigs bouncing off the limits). :)

Cheers.
ID: 1179212 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1179215 - Posted: 19 Dec 2011, 10:15:42 UTC - in response to Message 1179202.  
Last modified: 19 Dec 2011, 11:03:44 UTC

Its hotter than a monky's bum in here


Isn't that supposed to be, "It's hot enough to boil a monkey's bum in here, Your Majesty..."


"...and she smiled quietly to herself."

EDIT - Yes, it is.

Went to my office Sunday night about 9:00pm. Some kind soul had turned the furnace completely off. No heat and no air circulation. I walked in the front door, it was freezing. I climbed the 39 Steps to my office and was cold. Grabbed the doorknob to my office, inserted the key and noticed that my hand was warm on the imitation brass.

I threw open the door and was met by a blast of hot air. There were fans blowing like crazy in four computers. I'll bet it was 100+F in there.

I quickly crossed the room hardly noticing the body of the secretary laying lifeless on the floor (I checked for a pulse...I didn't have one, so she must have been dead), and flung open the windows on two corners of my corner office. I wish I had raised them, but I flung them so I had to go down to the street and pick them up and replace the glass - again.

After nearly two hours of replacing window panes and running the central heating's fan (but no heat) with two windows broken, it got cool enough to breath in my office, so I went to check the body I had seen on my floor... It was gone. The body had simply vanished when the temperature dropped. I put both my brain cells to work and was able to deduce that I must have been fooled by a full body mirage. (like, in a desert, where it's like, warm and stuff)

Of course! A mirage! I don't even HAVE a secretary!

But if I did have, and she'd been locked in that office, then, look-out because she'd have been passed-out or dead or something because it was HOT in there. I know hot when I feel it and it was hot. I felt it. (and so, incidentally was the non-existent secretary, although she would have been hotter had she not been all icky and dead; hot I mean, the dead secretary was "hot", not felt).

If I could just pipe that into the return air ducts at my office, the rest of the building wouldn't need to be heated... unless I flung a lot more windows.

(the computer exhaust, folks, I'm saying pipe the heat into the return air ducts not pipe the hot, dead, non-existent secretary -- couldn't feel her even if you stuff her in a return air duct; and how much fun would it be to feel a dead secretary in an air duct anyway?)

EDIT EDIT EDIT - Why do I have to explain my sentences to you people so much?
ID: 1179215 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1179231 - Posted: 19 Dec 2011, 12:40:40 UTC

Well, thanks to the weather here being unseasonably warm for this time of year and the crunchers all having Seti to do, the kitties are all toasty warm here too. Haven't had to fire up the furnace yet this year.

Nothing is melting down here though.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1179231 · Report as offensive
Profile Belthazor
Volunteer tester
Avatar

Send message
Joined: 6 Apr 00
Posts: 219
Credit: 10,373,795
RAC: 13
Russia
Message 1179299 - Posted: 19 Dec 2011, 16:53:25 UTC - in response to Message 1179297.  

While again switching over to AP only, I get so depressed that I can hardly work. I only have 85 AP's in progress, and if I don't get more in the next week or so, my computers will run dry.

The pain, the pain.


A lot of APs were splitted today. Why didn't you got some of them?

Merry Christmas, and happy new WUs folks :-)


Yeah, the same to you. As a present for these holidays, I give to you AP WUs which I could get for myself, but I leave them to you :)
ID: 1179299 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1179356 - Posted: 19 Dec 2011, 19:22:12 UTC - in response to Message 1179112.  

Well, it's rather refreshing to have a break in the shorty storm. Most of my rigs have their cache limits hit, the only fly in the ointment.
But at least the tasks that are cached have some run time for the GPUs, not 80-90% 2 minute drills.

Now, about them there limits..........


I dont know, but I have a very bad feeling that limits will not be raised soon, if ever. Servers are still maxed out with limits in place. Bearing in mind the disconnect between Seti staff who are still looking for "spare" computing power, and top crunchers with farms.
Please prove me wrong, but on the technical board, as far as I know, there has been no positive reply to a request to raise the limits.

Regards,

A


I think now, with the major shorty storm passed for the moment, would be a good time.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1179356 · Report as offensive
AndyJ
Avatar

Send message
Joined: 17 Aug 02
Posts: 248
Credit: 27,380,797
RAC: 0
United Kingdom
Message 1179370 - Posted: 19 Dec 2011, 19:59:27 UTC - in response to Message 1179356.  
Last modified: 19 Dec 2011, 20:19:05 UTC

I think now, with the major shorty storm passed for the moment, would be a good time.


Yes, it would be a good time, but I really believe nobody at Seti has the slightest intention of raising the limits.
Just my opinion.

Regards,

A

Edit: As in my post above this one, could somebody prove this rather miserable theory wrong?
Anything from the technical board regarding replies to requests from crunchers to raise the limits (there are a lot,) by Seti staff?
ID: 1179370 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1179427 - Posted: 20 Dec 2011, 0:04:15 UTC

The limits are in place to protect against massive overfetching when GPU work is again properly estimated server side, IOW when changeset [trac]changeset:24217[/trac] is revised or removed. Getting the AP validator revised to exclude early exit times from averaging is needed first to ensure the APR for Astropulse applications is reasonable, unless someone decides the ~1200 new hosts per day don't deserve protection (new is either really new or simply a new hostID).

The project staff could decide to throw caution to the winds, but I don't expect that. Similarly, I don't expect BOINC to have a fair limiting mechanism anytime soon.
                                                                   Joe
ID: 1179427 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6658
Credit: 121,090,076
RAC: 0
United States
Message 1179429 - Posted: 20 Dec 2011, 0:05:41 UTC - in response to Message 1179427.  

The limits are in place to protect against massive overfetching when GPU work is again properly estimated server side, IOW when changeset [trac]changeset:24217[/trac] is revised or removed. Getting the AP validator revised to exclude early exit times from averaging is needed first to ensure the APR for Astropulse applications is reasonable, unless someone decides the ~1200 new hosts per day don't deserve protection (new is either really new or simply a new hostID).

The project staff could decide to throw caution to the winds, but I don't expect that. Similarly, I don't expect BOINC to have a fair limiting mechanism anytime soon.
                                                                   Joe

Thank you Joe! It is clear that continued patience is what is called for.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1179429 · Report as offensive
BWX

Send message
Joined: 31 May 03
Posts: 36
Credit: 156,754,993
RAC: 24
United States
Message 1179443 - Posted: 20 Dec 2011, 1:55:43 UTC

Only in the past several hours does it look like everyone has finally reached their limits (see Cricket graphs). My 2 big rigs only today started getting work to keep both the CPU's and GPU's busy, and only reached their limits in the last hour or so.

Now would be the time to raise the limits in small increments (with an expected rush at the beginning of each), waiting for everyone to reach the new limit before raising again. Eventually we may get back to as it was before.
ID: 1179443 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1179449 - Posted: 20 Dec 2011, 2:57:21 UTC

I will play Devil's Advocate here.

Why raise the limits and stress an already overstressed system even further ?

The current limits equate to about a 1 1/2 to 2 day cache on my fastest rigs, more than enough to get them through the weekly planned outage. By keeping the current limits it helps avoid any unplanned outages. As long as the limit can be maintained I see no problem.

As long as the rigs don't run dry we should all be happy.

(BTW I normally run a 4 day cache)

T.A.
ID: 1179449 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1179451 - Posted: 20 Dec 2011, 3:11:23 UTC - in response to Message 1179449.  
Last modified: 20 Dec 2011, 3:22:45 UTC

I will play Devil's Advocate here.

Why raise the limits and stress an already overstressed system even further ?

The current limits equate to about a 1 1/2 to 2 day cache on my fastest rigs, more than enough to get them through the weekly planned outage. By keeping the current limits it helps avoid any unplanned outages. As long as the limit can be maintained I see no problem.

As long as the rigs don't run dry we should all be happy.

(BTW I normally run a 4 day cache)

T.A.


In it's most basic form it's about what caches (in general) are 'really for', as opposed to just having 'lots of tasks on the hosts'.

In computer science, as with transport logistics & warehousing, operating with no (or insufficient) local caches results in supply interruptions whenever there is a stall at the slowest link in the chain. Having sufficient caches places initial startup load demand, but smooths out & allows for service interruptions without bringing the whole chain to a stall if some link in the chain drops out temporarily or periodically. It acheives that through techniques called 'latency hiding' which if perfectly balanced for transport might be something like 'Just in time delivery', which theorises a perfect sized cache at each stage to keep steady flow at the consuming end with minimal storage or interruptions.

In those respects, obviously there are finite limits on the total work that can be supplied, though reasonable local caches allows for periodic interruption or maintenance to occur to improve overall efficiency of the system effectively smoothing the peak demands. Too big caches obviously also has its own problems with the finite workunit storage setup being centralised, though by rights total in/out turnover should be no different but more constant & spread out if balanced caches are in play. Less peak server stress & end system downtime without much cost.

Unfortunately the bigger part of that cost to build 'our' caches would likely be that workunit storage, so a very real physical limit at this time. Fingers crossed the efforts of the GPU Users group will be a big help along those lines.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1179451 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1179456 - Posted: 20 Dec 2011, 3:46:12 UTC - in response to Message 1179451.  
Last modified: 20 Dec 2011, 3:47:42 UTC

In it's most basic form it's about what caches (in general) are 'really for', as opposed to just having 'lots of tasks on the hosts'.

In computer science, as with transport logistics & warehousing, operating with no (or insufficient) local caches results in supply interruptions whenever there is a stall at the slowest link in the chain. Having sufficient caches places initial startup load demand, but smooths out & allows for service interruptions without bringing the whole chain to a stall if some link in the chain drops out temporarily or periodically. It acheives that through techniques called 'latency hiding' which if perfectly balanced for transport might be something like 'Just in time delivery', which theorises a perfect sized cache at each stage to keep steady flow at the consuming end with minimal storage or interruptions.

In those respects, obviously there are finite limits on the total work that can be supplied, though reasonable local caches allows for periodic interruption or maintenance to occur to improve overall efficiency of the system effectively smoothing the peak demands. Too big caches obviously also has its own problems with the finite workunit storage setup being centralised, though by rights total in/out turnover should be no different but more constant & spread out if balanced caches are in play. Less peak server stress & end system downtime without much cost.

Unfortunately the bigger part of that cost to build 'our' caches would likely be that workunit storage, so a very real physical limit at this time.

Jason

Agreed Jason
But I think you overlook the fact that after an outage there will always be a rush as those that are totally out of work seek to restart and those who still have work but whose caches have run down seek to build back up to the preset level.

On a restart cache size makes no difference to the demand for work/server load. For what you say to be effective the servers would have to look at who still has plenty of work, who is running low and who is empty and prioritise the allocation of work units accordingly. This does not happen and we get the usual "log jam" which can take a week to clear. If the system could prioritise work unit allocation effectively so that
1) Everybody has SOME work
2) Everybody has (eg) 2 days work
3) Top up caches to the set limit

As happens in the real world examples you quote above, where after an "outage" in a "just in time" supply situation the supplier gives priority to those with the least amount of product on hand and then catches up with the rest over the next few days, the recovery from a server outage would be a lot smoother.

T.A.
ID: 1179456 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1179459 - Posted: 20 Dec 2011, 4:08:56 UTC - in response to Message 1179456.  

Agreed Jason
But I think you overlook the fact that after an outage there will always be a rush as those that are totally out of work seek to restart and those who still have work but whose caches have run down seek to build back up to the preset level.

On a restart cache size makes no difference to the demand for work/server load. For what you say to be effective the servers would have to look at who still has plenty of work, who is running low and who is empty and prioritise the allocation of work units accordingly. This does not happen and we get the usual "log jam" which can take a week to clear. If the system could prioritise work unit allocation effectively so that
1) Everybody has SOME work
2) Everybody has (eg) 2 days work
3) Top up caches to the set limit

As happens in the real world examples you quote above, where after an "outage" in a "just in time" supply situation the supplier gives priority to those with the least amount of product on hand and then catches up with the rest over the next few days, the recovery from a server outage would be a lot smoother.

T.A.


That's right, we're talking about a theoretical optimum concept versus realities. One thing that would happen in such an iodealised mechanism when things come online is perhaps a staged onramp and priorotising the biggest consumers first as described. That's still a logistics perspective, and somewhat at odds with the way project backoffs and manual retry buttons work for us. In that 'ideal' system you would never have run completely dry in the first place, therefore the peak is briefer and controllable.

We know the time estimates are at best marginal, so perhaps a better quantity for onramp rationing is the current # of tasks, but not allowing that to ramp up will pretty much guarantee sustained peak, or in CPU cache analog terms 'thrashing', or in manufacturer terms perhaps plant shutdowns & layoffs due to supply shortages.

Even if we take # of tasks as currently implemented & ramp that up gradually until the demand levels out (hopefully at around 70% of max bandwidth), there's still the server workunit storage issues not solved as a hard limit. For what it's worth under smooth running I also find my systems run continuously with 3 days cache, tolerant of normal short maintenance outages AND backoffs. What they can;t seem to tolerate is the continuos heavy contention on the comms line, during which most requests are at best met with 'Project has no tasks available'. Might as well have saved the bandwidth & not asked for work.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1179459 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1179474 - Posted: 20 Dec 2011, 7:17:44 UTC

Yeah, I'll be happy with the dang limits until the next time the servers trip up for more than a day or so, or there is trouble coming back up from an outage.

I am NOT happy having to choose Einstein or any other Boinc project when Seti is my passion and having some steps taken on the remedial action necessary to get the cache limits lifted has not been very forthcoming. The foot dragging on this is getting a bit irksome.

Up until the Boinc botch, my caches have allowed me to weather many a Seti storm of one kind or another.

You can't blame me for desiring a return of my ability to cache enough work to cover an extended outage.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1179474 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1179478 - Posted: 20 Dec 2011, 7:50:09 UTC

The limits are ok when everything is running as it should be, but unfortunately whenever we seem to have a hic-cup the server seems to stoke the fire with masses of AP's and shorties.

If there could be a little bit of pre-sorting of the tapes so that we don't get flooded at peak demand times.

Or is it because the servers get too many shorties in one go it causes the hic-cup in the first place?



Kevin


ID: 1179478 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22534
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1179504 - Posted: 20 Dec 2011, 9:39:43 UTC

Each tape is split twice, once for the MB, and a second time for the AP.
The MB come in two runtime "sizes" normal, and shorties; there is no difference in the splitting process between the two, it is determined by the way the telescope was moving, or not, during the data collection. S@H has no control over this movement as it is piggybacking on other people's data collection.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1179504 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Panic Mode On (63) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.