recent woes

Author	Message
Dudo Send message Joined: 25 Dec 99 Posts: 2 Credit: 6,648,547 RAC: 0	Message 1039269 - Posted: 7 Oct 2010, 23:18:40 UTC In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-) ID: 1039269 ·

Widouxmaker Send message Joined: 7 May 02 Posts: 12 Credit: 457,920 RAC: 0	Message 1039284 - Posted: 8 Oct 2010, 0:06:54 UTC I'd like more units but I'm grateful for the few I'm getting. You guys keep up the good work and I'll send another case...heh...Thanks! You talk'n to me? ID: 1039284 ·

Jim_S Send message Joined: 23 Feb 00 Posts: 4705 Credit: 64,560,357 RAC: 31	Message 1039299 - Posted: 8 Oct 2010, 0:28:11 UTC Thanks for the news Jeff...I'm here for the Science...I'll still be here when needed. You guys work Hard...I for one appreciate it. I Desire Peace and Justice, Jim Scott (Mod-Ret.) ID: 1039299 ·

rebest Volunteer tester Send message Joined: 16 Apr 00 Posts: 1296 Credit: 45,357,093 RAC: 0	Message 1039301 - Posted: 8 Oct 2010, 0:30:26 UTC Thanks for the update! Join the PACK! ID: 1039301 ·

Odysseus Volunteer tester Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6	Message 1039303 - Posted: 8 Oct 2010, 0:33:27 UTC - in response to Message 1039162. I have tasks ready to upload that will expire in 6 days! I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the â€œdeadlineâ€ can be understood, for all practical purposes, as the earliest moment that a task will be liable to validationâ€”rather than automatic rejection. Itâ€™s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a â€˜resendâ€™, it effectively gets a deadline extension to match the replacement tasks. Anyway, the short version is that I wouldnâ€™t give up on any work until I saw that the corresponding WUs had been validated without it. ID: 1039303 ·

ScarabDrowner Volunteer tester Send message Joined: 13 Sep 03 Posts: 90 Credit: 456,378 RAC: 0	Message 1039370 - Posted: 8 Oct 2010, 2:44:40 UTC - in response to Message 1039303. The server status page is mostly green again, yay! ID: 1039370 ·

H Elzinga Volunteer tester Send message Joined: 20 Aug 99 Posts: 125 Credit: 8,277,116 RAC: 0	Message 1039454 - Posted: 8 Oct 2010, 7:11:10 UTC - in response to Message 1039269. In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-) BS We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours. Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection. We discoverd that we only had one casualty when we brought evrything up again. In the months following no other machines failed. If the event had any noticable impact on the whole server park it could be a slight increase in drive failures. on the other hand this could be aswel due to the age of the machines so no hard evidence. On 300 peices of hardware ther has not been a single RAM or MB failure. ID: 1039454 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1039456 - Posted: 8 Oct 2010, 7:17:16 UTC - in response to Message 1039454. BS We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours. Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection. We discoverd that we only had one casualty when we brought evrything up again. In the months following no other machines failed. If the event had any noticable impact on the whole server park it could be a slight increase in drive failures. on the other hand this could be aswel due to the age of the machines so no hard evidence. On 300 peices of hardware ther has not been a single RAM or MB failure. Don't call 'BS' in this forum. Whatever the source of the hardware problems the project is experiencing, they are real. Caused by the AC failure or not. I personally have had rigs die from overheating. It does happen. So your personal 'claimed' experience does not mean that current Seti problems might not have been caused in part by the the AC failure. It certainly did not enhance the reliability of any of the servers in the closet. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1039456 ·

ScarabDrowner Volunteer tester Send message Joined: 13 Sep 03 Posts: 90 Credit: 456,378 RAC: 0	Message 1039457 - Posted: 8 Oct 2010, 7:19:51 UTC - in response to Message 1039454. 48C? I WISH my laptop would run at 48C... It's currently crunching at 68-70C. ID: 1039457 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1039458 - Posted: 8 Oct 2010, 7:22:10 UTC - in response to Message 1039457. 48C? I WISH my laptop would run at 48C... It's currently crunching at 68-70C. Laptops are, well, laptops. They have always been constrained by cooling problems. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1039458 ·

ScarabDrowner Volunteer tester Send message Joined: 13 Sep 03 Posts: 90 Credit: 456,378 RAC: 0	Message 1039462 - Posted: 8 Oct 2010, 7:28:34 UTC - in response to Message 1039458. Laptops are, well, laptops. They have always been constrained by cooling problems. I understand that, but to hear 48C described as an overheat experience... That's near chilly for a computer in my experience. ID: 1039462 ·

Ianab Volunteer tester Send message Joined: 11 Jun 08 Posts: 732 Credit: 20,635,586 RAC: 5	Message 1039466 - Posted: 8 Oct 2010, 7:36:01 UTC - in response to Message 1039457. 48C? I WISH my laptop would run at 48C... It's currently crunching at 68-70C. Yeah, but stick your laptop in a cupboard that's at 48C and see what happens. Raise the ambient temp by 20C, and you pretty much raise the component temp by 20C. Hopefully things shut down, or else something breaks down, often in unpredictable ways. The system board or power supply capacitors are a good example, high temps speed up their ageing, and they may not just die. They can just loose capacitance and make the machine hang at random. It's certainly possible that the heat treatment has prematurely aged a motherboard in one of the servers. Before the cookup it was just within spec, now it's just outside and weird things happen. Ian ID: 1039466 ·

RoosStar Send message Joined: 16 Oct 99 Posts: 51 Credit: 12,900,339 RAC: 20	Message 1039482 - Posted: 8 Oct 2010, 9:28:24 UTC - in response to Message 1039454. Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection. When I understand him correctly, he did not say that the servers where running at 48 degrees, but that it was the "roomtemperature" in the cabinets. :D ID: 1039482 ·

Richard Gardner Send message Joined: 9 Jul 03 Posts: 1 Credit: 736,823 RAC: 0	Message 1039484 - Posted: 8 Oct 2010, 9:52:48 UTC I have to say that for those of us who only have computers at work, where there is a shut down of everything over the weekend, the fact that the weekly outage always falls in the middle of the week means a very reduced window to upload/download. I have 3 big Astropulses that my machine hammered in a great time and finished on Monday evening but they may not report until next Monday. I know this is a bad time but the 3 day outage hits me like this every week. I appreciate that work can only be done when people are available but I thought I should point out that my long term results (7 years) are slowly dwindling away. ID: 1039484 ·

Eewec Send message Joined: 28 Nov 05 Posts: 19 Credit: 190,633 RAC: 0	Message 1039510 - Posted: 8 Oct 2010, 12:29:28 UTC .... I have a question. I've just looked at the server status page and in the list is db_purge.x86_64. Now, I've also noticed that during the downtimes when the upload / download servers are offline, unlike the other stat lines "Workunits waiting for db purging" and "Results waiting for db purging" never seem to zero out. I'm curious as to why this is. Surely it would make sense that if the db_purge server is up, that the purging zero's out with no new records being added to the queue. Like I said, just curious. ID: 1039510 ·

Earendil's Star Send message Joined: 1 Jun 03 Posts: 13 Credit: 6,542,706 RAC: 0	Message 1039512 - Posted: 8 Oct 2010, 12:30:45 UTC This entry on the Seti home page is making me smile: The weekly Seti outage allows for a "focus on science processing." Hmmm. I feel an urge to giggle hysterically when I read, "On Friday, you may experience connectivity issues as the servers catch up with demand." On the other hand the tasks of Jeff and Matt et al are no doubt arduous and often thankless and I am grateful for their efforts. There may be merit in what some have suggested in terms of shutting it all down until ways can be found to make Seti run reliably. ID: 1039512 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 1039538 - Posted: 8 Oct 2010, 13:32:30 UTC - in response to Message 1038958. I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck. Hi Richard, welcome to the forums. I expressed a similar theory in 2 posts in the thread http://setiathome.berkeley.edu/forum_thread.php?id=56160&nowrap=true#959975 in November last year, and again on 1 January this year. I have personally seen this problem in recent years, on a friend's Athlon XP system. It happened first when plugging or unplugging USB devices caused a system hang, with Windows BSOD. A few weeks later the Blue Screens got more frequent, and an examination of the motherboard revealed about 50% of the capacitors had bulging or brown stained tops. While I am sure that server grade hardware should be built to higher quality standards, and better tolerances than ordinary consumer equipment, I think some of the kit that SETI@home uses is pre-production or prototype. Add to that the fact that the ageing of the capacitors may have been accelerated by recent overheating. I hope Matt, Jeff and Eric may find this post helpful, I would certainly put flaky capacitors near the top of my suspect list. I have just done a Google search on "bulging capacitors" that produced lots of results, including many images. Keith. ID: 1039538 ·

George E. Lass Send message Joined: 18 May 99 Posts: 2 Credit: 2,323,131 RAC: 1	Message 1039578 - Posted: 8 Oct 2010, 15:02:02 UTC - in response to Message 1039303. I have tasks ready to upload that will expire in 6 days! I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the â€œdeadlineâ€ can be understood, for all practical purposes, as the earliest moment that a task will be liable to validationâ€”rather than automatic rejection. Itâ€™s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a â€˜resendâ€™, it effectively gets a deadline extension to match the replacement tasks. Anyway, the short version is that I wouldnâ€™t give up on any work until I saw that the corresponding WUs had been validated without it. Thanks for that info George ID: 1039578 ·

5subslr5 Send message Joined: 4 Nov 02 Posts: 9 Credit: 11,434 RAC: 0	Message 1039664 - Posted: 8 Oct 2010, 16:50:38 UTC - in response to Message 1038922. "We know it will be fixed when it gets fixed." (Steve) Deeeeply insightful !! heheheheheheeheh ID: 1039664 ·

Pascal Meeuws Send message Joined: 25 Nov 09 Posts: 5 Credit: 1,380,836 RAC: 0	Message 1039670 - Posted: 8 Oct 2010, 16:58:18 UTC I've read the comments, and would like to give the people some advice. Being a Software Engineer with database experience I can understand the amount of work the Seti people have to keep these systems up. The best way to help them now is, throtlle back on the number of tasks being processed so that they can work on the issues at hand. After the are solved, we can happily continue crunshing. Regards, Pascal ID: 1039670 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.