recent woes

Message boards : Technical News : recent woes
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Dudo

Send message
Joined: 25 Dec 99
Posts: 2
Credit: 6,648,547
RAC: 0
Croatia
Message 1039269 - Posted: 7 Oct 2010, 23:18:40 UTC

In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-)
ID: 1039269 · Report as offensive
Profile Widouxmaker
Avatar

Send message
Joined: 7 May 02
Posts: 12
Credit: 457,920
RAC: 0
United States
Message 1039284 - Posted: 8 Oct 2010, 0:06:54 UTC

I'd like more units but I'm grateful for the few I'm getting. You guys keep up the good work and I'll send another case...heh...Thanks!
You talk'n to me?
ID: 1039284 · Report as offensive
Profile Jim_S
Avatar

Send message
Joined: 23 Feb 00
Posts: 4705
Credit: 64,560,357
RAC: 31
United States
Message 1039299 - Posted: 8 Oct 2010, 0:28:11 UTC

Thanks for the news Jeff...I'm here for the Science...I'll still be here when needed.
You guys work Hard...I for one appreciate it.

I Desire Peace and Justice, Jim Scott (Mod-Ret.)
ID: 1039299 · Report as offensive
Profile rebest Project Donor
Volunteer tester
Avatar

Send message
Joined: 16 Apr 00
Posts: 1296
Credit: 45,357,093
RAC: 0
United States
Message 1039301 - Posted: 8 Oct 2010, 0:30:26 UTC

Thanks for the update!

Join the PACK!
ID: 1039301 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 1039303 - Posted: 8 Oct 2010, 0:33:27 UTC - in response to Message 1039162.  

I have tasks ready to upload that will expire in 6 days!

I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the “deadline” can be understood, for all practical purposes, as the earliest moment that a task will be liable to validation—rather than automatic rejection. It’s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a ‘resend’, it effectively gets a deadline extension to match the replacement tasks.

Anyway, the short version is that I wouldn’t give up on any work until I saw that the corresponding WUs had been validated without it.
ID: 1039303 · Report as offensive
Profile ScarabDrowner
Volunteer tester
Avatar

Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1039370 - Posted: 8 Oct 2010, 2:44:40 UTC - in response to Message 1039303.  

The server status page is mostly green again, yay!
ID: 1039370 · Report as offensive
H Elzinga
Volunteer tester

Send message
Joined: 20 Aug 99
Posts: 125
Credit: 8,277,116
RAC: 0
Netherlands
Message 1039454 - Posted: 8 Oct 2010, 7:11:10 UTC - in response to Message 1039269.  

In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-)


BS

We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours.
Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection.

We discoverd that we only had one casualty when we brought evrything up again.
In the months following no other machines failed.
If the event had any noticable impact on the whole server park it could be a slight increase in drive failures.
on the other hand this could be aswel due to the age of the machines so no hard evidence.

On 300 peices of hardware ther has not been a single RAM or MB failure.
ID: 1039454 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1039456 - Posted: 8 Oct 2010, 7:17:16 UTC - in response to Message 1039454.  


BS

We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours.
Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection.

We discoverd that we only had one casualty when we brought evrything up again.
In the months following no other machines failed.
If the event had any noticable impact on the whole server park it could be a slight increase in drive failures.
on the other hand this could be aswel due to the age of the machines so no hard evidence.

On 300 peices of hardware ther has not been a single RAM or MB failure.


Don't call 'BS' in this forum. Whatever the source of the hardware problems the project is experiencing, they are real. Caused by the AC failure or not.

I personally have had rigs die from overheating.

It does happen. So your personal 'claimed' experience does not mean that current Seti problems might not have been caused in part by the the AC failure.
It certainly did not enhance the reliability of any of the servers in the closet.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1039456 · Report as offensive
Profile ScarabDrowner
Volunteer tester
Avatar

Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1039457 - Posted: 8 Oct 2010, 7:19:51 UTC - in response to Message 1039454.  

48C? I *WISH* my laptop would run at 48C... It's currently crunching at 68-70C.
ID: 1039457 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1039458 - Posted: 8 Oct 2010, 7:22:10 UTC - in response to Message 1039457.  

48C? I *WISH* my laptop would run at 48C... It's currently crunching at 68-70C.

Laptops are, well, laptops.
They have always been constrained by cooling problems.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1039458 · Report as offensive
Profile ScarabDrowner
Volunteer tester
Avatar

Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1039462 - Posted: 8 Oct 2010, 7:28:34 UTC - in response to Message 1039458.  

Laptops are, well, laptops.
They have always been constrained by cooling problems.


I understand that, but to hear 48C described as an overheat experience... That's near chilly for a computer in my experience.
ID: 1039462 · Report as offensive
Ianab
Volunteer tester

Send message
Joined: 11 Jun 08
Posts: 732
Credit: 20,635,586
RAC: 5
New Zealand
Message 1039466 - Posted: 8 Oct 2010, 7:36:01 UTC - in response to Message 1039457.  

48C? I *WISH* my laptop would run at 48C... It's currently crunching at 68-70C.


Yeah, but stick your laptop in a cupboard that's at 48C and see what happens. Raise the ambient temp by 20C, and you pretty much raise the component temp by 20C. Hopefully things shut down, or else something breaks down, often in unpredictable ways. The system board or power supply capacitors are a good example, high temps speed up their ageing, and they may not just die. They can just loose capacitance and make the machine hang at random.

It's certainly possible that the heat treatment has prematurely aged a motherboard in one of the servers. Before the cookup it was just within spec, now it's just outside and weird things happen.

Ian
ID: 1039466 · Report as offensive
RoosStar

Send message
Joined: 16 Oct 99
Posts: 51
Credit: 12,900,339
RAC: 20
Netherlands
Message 1039482 - Posted: 8 Oct 2010, 9:28:24 UTC - in response to Message 1039454.  

Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection.

When I understand him correctly, he did not say that the servers where running at 48 degrees, but that it was the "roomtemperature" in the cabinets. :D
ID: 1039482 · Report as offensive
Richard Gardner

Send message
Joined: 9 Jul 03
Posts: 1
Credit: 736,823
RAC: 0
United Kingdom
Message 1039484 - Posted: 8 Oct 2010, 9:52:48 UTC

I have to say that for those of us who only have computers at work, where there is a shut down of everything over the weekend, the fact that the weekly outage always falls in the middle of the week means a very reduced window to upload/download. I have 3 big Astropulses that my machine hammered in a great time and finished on Monday evening but they may not report until next Monday.

I know this is a bad time but the 3 day outage hits me like this every week.
I appreciate that work can only be done when people are available but I thought I should point out that my long term results (7 years) are slowly dwindling away.
ID: 1039484 · Report as offensive
Eewec

Send message
Joined: 28 Nov 05
Posts: 19
Credit: 190,633
RAC: 0
United Kingdom
Message 1039510 - Posted: 8 Oct 2010, 12:29:28 UTC

.... I have a question. I've just looked at the server status page and in the list is db_purge.x86_64. Now, I've also noticed that during the downtimes when the upload / download servers are offline, unlike the other stat lines "Workunits waiting for db purging" and "Results waiting for db purging" never seem to zero out. I'm curious as to why this is. Surely it would make sense that if the db_purge server is up, that the purging zero's out with no new records being added to the queue.

Like I said, just curious.
ID: 1039510 · Report as offensive
Earendil's Star

Send message
Joined: 1 Jun 03
Posts: 13
Credit: 6,542,706
RAC: 0
Zimbabwe
Message 1039512 - Posted: 8 Oct 2010, 12:30:45 UTC

This entry on the Seti home page is making me smile:

The weekly Seti outage allows for a "focus on science processing." Hmmm.

I feel an urge to giggle hysterically when I read, "On Friday, you may experience connectivity issues as the servers catch up with demand."

On the other hand the tasks of Jeff and Matt et al are no doubt arduous and often thankless and I am grateful for their efforts.

There may be merit in what some have suggested in terms of shutting it all down until ways can be found to make Seti run reliably.


ID: 1039512 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 1039538 - Posted: 8 Oct 2010, 13:32:30 UTC - in response to Message 1038958.  

I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck.


Hi Richard, welcome to the forums.

I expressed a similar theory in 2 posts in the thread http://setiathome.berkeley.edu/forum_thread.php?id=56160&nowrap=true#959975 in November last year, and again on 1 January this year.

I have personally seen this problem in recent years, on a friend's Athlon XP system. It happened first when plugging or unplugging USB devices caused a system hang, with Windows BSOD. A few weeks later the Blue Screens got more frequent, and an examination of the motherboard revealed about 50% of the capacitors had bulging or brown stained tops.

While I am sure that server grade hardware should be built to higher quality standards, and better tolerances than ordinary consumer equipment, I think some of the kit that SETI@home uses is pre-production or prototype.

Add to that the fact that the ageing of the capacitors may have been accelerated by recent overheating.

I hope Matt, Jeff and Eric may find this post helpful, I would certainly put flaky capacitors near the top of my suspect list.
I have just done a Google search on "bulging capacitors" that produced lots of results, including many images.

Keith.
ID: 1039538 · Report as offensive
George E. Lass

Send message
Joined: 18 May 99
Posts: 2
Credit: 2,323,131
RAC: 1
United States
Message 1039578 - Posted: 8 Oct 2010, 15:02:02 UTC - in response to Message 1039303.  

I have tasks ready to upload that will expire in 6 days!

I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the “deadline” can be understood, for all practical purposes, as the earliest moment that a task will be liable to validation—rather than automatic rejection. It’s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a ‘resend’, it effectively gets a deadline extension to match the replacement tasks.

Anyway, the short version is that I wouldn’t give up on any work until I saw that the corresponding WUs had been validated without it.


Thanks for that info

George
ID: 1039578 · Report as offensive
5subslr5

Send message
Joined: 4 Nov 02
Posts: 9
Credit: 11,434
RAC: 0
United States
Message 1039664 - Posted: 8 Oct 2010, 16:50:38 UTC - in response to Message 1038922.  

"We know it will be fixed when it gets fixed." (Steve)

Deeeeply insightful !! heheheheheheeheh
ID: 1039664 · Report as offensive
Pascal Meeuws

Send message
Joined: 25 Nov 09
Posts: 5
Credit: 1,380,836
RAC: 0
Netherlands
Message 1039670 - Posted: 8 Oct 2010, 16:58:18 UTC

I've read the comments, and would like to give the people some advice.
Being a Software Engineer with database experience I can understand the amount of work the Seti people have to keep these systems up.

The best way to help them now is, throtlle back on the number of tasks being processed so that they can work on the issues at hand.

After the are solved, we can happily continue crunshing.

Regards,

Pascal
ID: 1039670 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Technical News : recent woes


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.