Panic Mode On (63) Server problems?

Author	Message
AndyJ Send message Joined: 17 Aug 02 Posts: 248 Credit: 27,380,797 RAC: 0	Message 1179112 - Posted: 18 Dec 2011, 19:10:26 UTC - in response to Message 1179076. Well, it's rather refreshing to have a break in the shorty storm. Most of my rigs have their cache limits hit, the only fly in the ointment. But at least the tasks that are cached have some run time for the GPUs, not 80-90% 2 minute drills. Now, about them there limits.......... I dont know, but I have a very bad feeling that limits will not be raised soon, if ever. Servers are still maxed out with limits in place. Bearing in mind the disconnect between Seti staff who are still looking for "spare" computing power, and top crunchers with farms. Please prove me wrong, but on the technical board, as far as I know, there has been no positive reply to a request to raise the limits. Regards, A ID: 1179112 ·

j tramer Send message Joined: 6 Oct 03 Posts: 242 Credit: 5,412,368 RAC: 0	Message 1179148 - Posted: 18 Dec 2011, 23:07:56 UTC omg i have work ID: 1179148 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1179200 - Posted: 19 Dec 2011, 6:24:46 UTC - in response to Message 1179148. I just had a long weekend away and on return I only had 1 rig to give a nudge to its uploads but nothing ran out of work so I have no reason at all to panic. :D Cheers. ID: 1179200 ·

DMMD Send message Joined: 14 Feb 00 Posts: 118 Credit: 71,564,960 RAC: 0	Message 1179202 - Posted: 19 Dec 2011, 6:51:59 UTC Its hotter than a monky's bum in here ID: 1179202 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1179212 - Posted: 19 Dec 2011, 9:25:15 UTC - in response to Message 1179202. Last modified: 19 Dec 2011, 9:34:36 UTC Well it's been 10-15C cooler here than usual for this time of year, so no worry here with heat yet, but with the help of a proxy my Q6600 is now hitting the limit again after not connecting for 3.5 days so it seems that I'm getting 5-6 days worth of work out of my 10 day cache setting on that rig (joining my other 2 rigs bouncing off the limits). :) Cheers. ID: 1179212 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1179215 - Posted: 19 Dec 2011, 10:15:42 UTC - in response to Message 1179202. Last modified: 19 Dec 2011, 11:03:44 UTC Its hotter than a monky's bum in here Isn't that supposed to be, "It's hot enough to boil a monkey's bum in here, Your Majesty..." "...and she smiled quietly to herself." EDIT - Yes, it is. Went to my office Sunday night about 9:00pm. Some kind soul had turned the furnace completely off. No heat and no air circulation. I walked in the front door, it was freezing. I climbed the 39 Steps to my office and was cold. Grabbed the doorknob to my office, inserted the key and noticed that my hand was warm on the imitation brass. I threw open the door and was met by a blast of hot air. There were fans blowing like crazy in four computers. I'll bet it was 100+F in there. I quickly crossed the room hardly noticing the body of the secretary laying lifeless on the floor (I checked for a pulse...I didn't have one, so she must have been dead), and flung open the windows on two corners of my corner office. I wish I had raised them, but I flung them so I had to go down to the street and pick them up and replace the glass - again. After nearly two hours of replacing window panes and running the central heating's fan (but no heat) with two windows broken, it got cool enough to breath in my office, so I went to check the body I had seen on my floor... It was gone. The body had simply vanished when the temperature dropped. I put both my brain cells to work and was able to deduce that I must have been fooled by a full body mirage. (like, in a desert, where it's like, warm and stuff) Of course! A mirage! I don't even HAVE a secretary! But if I did have, and she'd been locked in that office, then, look-out because she'd have been passed-out or dead or something because it was HOT in there. I know hot when I feel it and it was hot. I felt it. (and so, incidentally was the non-existent secretary, although she would have been hotter had she not been all icky and dead; hot I mean, the dead secretary was "hot", not felt). If I could just pipe that into the return air ducts at my office, the rest of the building wouldn't need to be heated... unless I flung a lot more windows. (the computer exhaust, folks, I'm saying pipe the heat into the return air ducts not pipe the hot, dead, non-existent secretary -- couldn't feel her even if you stuff her in a return air duct; and how much fun would it be to feel a dead secretary in an air duct anyway?) EDIT EDIT EDIT - Why do I have to explain my sentences to you people so much? ID: 1179215 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1179231 - Posted: 19 Dec 2011, 12:40:40 UTC Well, thanks to the weather here being unseasonably warm for this time of year and the crunchers all having Seti to do, the kitties are all toasty warm here too. Haven't had to fire up the furnace yet this year. Nothing is melting down here though. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1179231 ·

Belthazor Volunteer tester Send message Joined: 6 Apr 00 Posts: 219 Credit: 10,373,795 RAC: 13	Message 1179299 - Posted: 19 Dec 2011, 16:53:25 UTC - in response to Message 1179297. While again switching over to AP only, I get so depressed that I can hardly work. I only have 85 AP's in progress, and if I don't get more in the next week or so, my computers will run dry. The pain, the pain. A lot of APs were splitted today. Why didn't you got some of them? Merry Christmas, and happy new WUs folks :-) Yeah, the same to you. As a present for these holidays, I give to you AP WUs which I could get for myself, but I leave them to you :) ID: 1179299 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1179356 - Posted: 19 Dec 2011, 19:22:12 UTC - in response to Message 1179112. Well, it's rather refreshing to have a break in the shorty storm. Most of my rigs have their cache limits hit, the only fly in the ointment. But at least the tasks that are cached have some run time for the GPUs, not 80-90% 2 minute drills. Now, about them there limits.......... I dont know, but I have a very bad feeling that limits will not be raised soon, if ever. Servers are still maxed out with limits in place. Bearing in mind the disconnect between Seti staff who are still looking for "spare" computing power, and top crunchers with farms. Please prove me wrong, but on the technical board, as far as I know, there has been no positive reply to a request to raise the limits. Regards, A I think now, with the major shorty storm passed for the moment, would be a good time. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1179356 ·

AndyJ Send message Joined: 17 Aug 02 Posts: 248 Credit: 27,380,797 RAC: 0	Message 1179370 - Posted: 19 Dec 2011, 19:59:27 UTC - in response to Message 1179356. Last modified: 19 Dec 2011, 20:19:05 UTC I think now, with the major shorty storm passed for the moment, would be a good time. Yes, it would be a good time, but I really believe nobody at Seti has the slightest intention of raising the limits. Just my opinion. Regards, A Edit: As in my post above this one, could somebody prove this rather miserable theory wrong? Anything from the technical board regarding replies to requests from crunchers to raise the limits (there are a lot,) by Seti staff? ID: 1179370 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1179427 - Posted: 20 Dec 2011, 0:04:15 UTC The limits are in place to protect against massive overfetching when GPU work is again properly estimated server side, IOW when changeset [trac]changeset:24217[/trac] is revised or removed. Getting the AP validator revised to exclude early exit times from averaging is needed first to ensure the APR for Astropulse applications is reasonable, unless someone decides the ~1200 new hosts per day don't deserve protection (new is either really new or simply a new hostID). The project staff could decide to throw caution to the winds, but I don't expect that. Similarly, I don't expect BOINC to have a fair limiting mechanism anytime soon. Joe ID: 1179427 ·

SciManStev Volunteer tester Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0	Message 1179429 - Posted: 20 Dec 2011, 0:05:41 UTC - in response to Message 1179427. The limits are in place to protect against massive overfetching when GPU work is again properly estimated server side, IOW when changeset [trac]changeset:24217[/trac] is revised or removed. Getting the AP validator revised to exclude early exit times from averaging is needed first to ensure the APR for Astropulse applications is reasonable, unless someone decides the ~1200 new hosts per day don't deserve protection (new is either really new or simply a new hostID). The project staff could decide to throw caution to the winds, but I don't expect that. Similarly, I don't expect BOINC to have a fair limiting mechanism anytime soon. Joe Thank you Joe! It is clear that continued patience is what is called for. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website ID: 1179429 ·

BWX Send message Joined: 31 May 03 Posts: 36 Credit: 156,754,993 RAC: 24	Message 1179443 - Posted: 20 Dec 2011, 1:55:43 UTC Only in the past several hours does it look like everyone has finally reached their limits (see Cricket graphs). My 2 big rigs only today started getting work to keep both the CPU's and GPU's busy, and only reached their limits in the last hour or so. Now would be the time to raise the limits in small increments (with an expected rush at the beginning of each), waiting for everyone to reach the new limit before raising again. Eventually we may get back to as it was before. ID: 1179443 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1179449 - Posted: 20 Dec 2011, 2:57:21 UTC I will play Devil's Advocate here. Why raise the limits and stress an already overstressed system even further ? The current limits equate to about a 1 1/2 to 2 day cache on my fastest rigs, more than enough to get them through the weekly planned outage. By keeping the current limits it helps avoid any unplanned outages. As long as the limit can be maintained I see no problem. As long as the rigs don't run dry we should all be happy. (BTW I normally run a 4 day cache) T.A. ID: 1179449 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1179451 - Posted: 20 Dec 2011, 3:11:23 UTC - in response to Message 1179449. Last modified: 20 Dec 2011, 3:22:45 UTC I will play Devil's Advocate here. Why raise the limits and stress an already overstressed system even further ? The current limits equate to about a 1 1/2 to 2 day cache on my fastest rigs, more than enough to get them through the weekly planned outage. By keeping the current limits it helps avoid any unplanned outages. As long as the limit can be maintained I see no problem. As long as the rigs don't run dry we should all be happy. (BTW I normally run a 4 day cache) T.A. In it's most basic form it's about what caches (in general) are 'really for', as opposed to just having 'lots of tasks on the hosts'. In computer science, as with transport logistics & warehousing, operating with no (or insufficient) local caches results in supply interruptions whenever there is a stall at the slowest link in the chain. Having sufficient caches places initial startup load demand, but smooths out & allows for service interruptions without bringing the whole chain to a stall if some link in the chain drops out temporarily or periodically. It acheives that through techniques called 'latency hiding' which if perfectly balanced for transport might be something like 'Just in time delivery', which theorises a perfect sized cache at each stage to keep steady flow at the consuming end with minimal storage or interruptions. In those respects, obviously there are finite limits on the total work that can be supplied, though reasonable local caches allows for periodic interruption or maintenance to occur to improve overall efficiency of the system effectively smoothing the peak demands. Too big caches obviously also has its own problems with the finite workunit storage setup being centralised, though by rights total in/out turnover should be no different but more constant & spread out if balanced caches are in play. Less peak server stress & end system downtime without much cost. Unfortunately the bigger part of that cost to build 'our' caches would likely be that workunit storage, so a very real physical limit at this time. Fingers crossed the efforts of the GPU Users group will be a big help along those lines. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1179451 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1179456 - Posted: 20 Dec 2011, 3:46:12 UTC - in response to Message 1179451. Last modified: 20 Dec 2011, 3:47:42 UTC In it's most basic form it's about what caches (in general) are 'really for', as opposed to just having 'lots of tasks on the hosts'. In computer science, as with transport logistics & warehousing, operating with no (or insufficient) local caches results in supply interruptions whenever there is a stall at the slowest link in the chain. Having sufficient caches places initial startup load demand, but smooths out & allows for service interruptions without bringing the whole chain to a stall if some link in the chain drops out temporarily or periodically. It acheives that through techniques called 'latency hiding' which if perfectly balanced for transport might be something like 'Just in time delivery', which theorises a perfect sized cache at each stage to keep steady flow at the consuming end with minimal storage or interruptions. In those respects, obviously there are finite limits on the total work that can be supplied, though reasonable local caches allows for periodic interruption or maintenance to occur to improve overall efficiency of the system effectively smoothing the peak demands. Too big caches obviously also has its own problems with the finite workunit storage setup being centralised, though by rights total in/out turnover should be no different but more constant & spread out if balanced caches are in play. Less peak server stress & end system downtime without much cost. Unfortunately the bigger part of that cost to build 'our' caches would likely be that workunit storage, so a very real physical limit at this time. Jason Agreed Jason But I think you overlook the fact that after an outage there will always be a rush as those that are totally out of work seek to restart and those who still have work but whose caches have run down seek to build back up to the preset level. On a restart cache size makes no difference to the demand for work/server load. For what you say to be effective the servers would have to look at who still has plenty of work, who is running low and who is empty and prioritise the allocation of work units accordingly. This does not happen and we get the usual "log jam" which can take a week to clear. If the system could prioritise work unit allocation effectively so that 1) Everybody has SOME work 2) Everybody has (eg) 2 days work 3) Top up caches to the set limit As happens in the real world examples you quote above, where after an "outage" in a "just in time" supply situation the supplier gives priority to those with the least amount of product on hand and then catches up with the rest over the next few days, the recovery from a server outage would be a lot smoother. T.A. ID: 1179456 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1179459 - Posted: 20 Dec 2011, 4:08:56 UTC - in response to Message 1179456. Agreed Jason But I think you overlook the fact that after an outage there will always be a rush as those that are totally out of work seek to restart and those who still have work but whose caches have run down seek to build back up to the preset level. On a restart cache size makes no difference to the demand for work/server load. For what you say to be effective the servers would have to look at who still has plenty of work, who is running low and who is empty and prioritise the allocation of work units accordingly. This does not happen and we get the usual "log jam" which can take a week to clear. If the system could prioritise work unit allocation effectively so that 1) Everybody has SOME work 2) Everybody has (eg) 2 days work 3) Top up caches to the set limit As happens in the real world examples you quote above, where after an "outage" in a "just in time" supply situation the supplier gives priority to those with the least amount of product on hand and then catches up with the rest over the next few days, the recovery from a server outage would be a lot smoother. T.A. That's right, we're talking about a theoretical optimum concept versus realities. One thing that would happen in such an iodealised mechanism when things come online is perhaps a staged onramp and priorotising the biggest consumers first as described. That's still a logistics perspective, and somewhat at odds with the way project backoffs and manual retry buttons work for us. In that 'ideal' system you would never have run completely dry in the first place, therefore the peak is briefer and controllable. We know the time estimates are at best marginal, so perhaps a better quantity for onramp rationing is the current # of tasks, but not allowing that to ramp up will pretty much guarantee sustained peak, or in CPU cache analog terms 'thrashing', or in manufacturer terms perhaps plant shutdowns & layoffs due to supply shortages. Even if we take # of tasks as currently implemented & ramp that up gradually until the demand levels out (hopefully at around 70% of max bandwidth), there's still the server workunit storage issues not solved as a hard limit. For what it's worth under smooth running I also find my systems run continuously with 3 days cache, tolerant of normal short maintenance outages AND backoffs. What they can;t seem to tolerate is the continuos heavy contention on the comms line, during which most requests are at best met with 'Project has no tasks available'. Might as well have saved the bandwidth & not asked for work. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1179459 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1179474 - Posted: 20 Dec 2011, 7:17:44 UTC Yeah, I'll be happy with the dang limits until the next time the servers trip up for more than a day or so, or there is trouble coming back up from an outage. I am NOT happy having to choose Einstein or any other Boinc project when Seti is my passion and having some steps taken on the remedial action necessary to get the cache limits lifted has not been very forthcoming. The foot dragging on this is getting a bit irksome. Up until the Boinc botch, my caches have allowed me to weather many a Seti storm of one kind or another. You can't blame me for desiring a return of my ability to cache enough work to cover an extended outage. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1179474 ·

Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572	Message 1179478 - Posted: 20 Dec 2011, 7:50:09 UTC The limits are ok when everything is running as it should be, but unfortunately whenever we seem to have a hic-cup the server seems to stoke the fire with masses of AP's and shorties. If there could be a little bit of pre-sorting of the tapes so that we don't get flooded at peak demand times. Or is it because the servers get too many shorties in one go it causes the hic-cup in the first place? Kevin ID: 1179478 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1179504 - Posted: 20 Dec 2011, 9:39:43 UTC Each tape is split twice, once for the MB, and a second time for the AP. The MB come in two runtime "sizes" normal, and shorties; there is no difference in the splitting process between the two, it is determined by the way the telescope was moving, or not, during the data collection. S@H has no control over this movement as it is piggybacking on other people's data collection. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1179504 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.