Sun Dies (Feb 22 2012)

Message boards : Technical News : Sun Dies (Feb 22 2012)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1198597 - Posted: 22 Feb 2012, 21:34:44 UTC

So... another week another minor server crisis. This one was brewing for a while - we've been getting memory errors/upsets on our main internal file server (which hosts, among other things, all the files that make up the SETI@home web site). We got replacement memory, and were hoping for a quiescent moment to swap it out, but after two crashes in one day (on Tuesday) I just went ahead and did the swap.

So far so good (i.e. no further crashes), except we're still getting memory upsets in the server log. I only replaced 2 of the faulty DIMMs (which were noted as faulty by the motherboard), but maybe others need replacing as well.

In the meantime I found that project recovery today was significantly slowed by the result web pages on our site, so those are turned off at the moment (as I'm writing this).

Meanwhile other tasks this week included cleaning up the lab (the fire marshall is visiting today) and resurrecting SERENDIP code I haven't touched in over a decade. I got it to compile, now I'm just removing the non-fatal compiler warnings one by one. We'll use this code to help process Kepler data (which happens to be in a similar format to our old SERENDIP data). Maybe I'll even get back to analyzing the SERENDIP IV data set (also over a decade old and it may be worth taking another look at it with this code).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1198597 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1198599 - Posted: 22 Feb 2012, 21:39:48 UTC

Thanks for the update Matt.


I have a "cure" for fire marshals - pm me for info ;-)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1198599 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1198604 - Posted: 22 Feb 2012, 21:50:29 UTC - in response to Message 1198597.  

Thanks for the update Matt,

Claggy
ID: 1198604 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1198609 - Posted: 22 Feb 2012, 21:59:29 UTC

@ Matt - I've been getting random memory errors on my home server for the last month, and bluescreen lockups, even with ECC memory.

Turned out to be a voltage regulator failure on the motherboard (affecting the memory termination voltage only) - the same memory is working fine in a replacement motherboard.

Might be worth checking voltages, if that motherboard has the right degree of instrumentation.
ID: 1198609 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1198642 - Posted: 23 Feb 2012, 0:05:23 UTC - in response to Message 1198609.  

@ Matt - I've been getting random memory errors on my home server for the last month, and bluescreen lockups, even with ECC memory.

Turned out to be a voltage regulator failure on the motherboard (affecting the memory termination voltage only) - the same memory is working fine in a replacement motherboard.

Might be worth checking voltages, if that motherboard has the right degree of instrumentation.

It also might be worth looking into seeing what the voltages the power supply are putting out. As was discovered/documented in my lengthy capacitor replacement thread, whilst bulged caps were part of the problem, the root of the problem was the power supply. After opening it up, every single capacitor in there is bulged and nearly to burst/ooze stage.

While it isn't very feasible to open the power supply up, you should either be able to see what the board reports for the voltages, or just back-probe connectors with a voltmeter and see what the rails are putting out.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1198642 · Report as offensive
Profile Ronald R CODNEY
Avatar

Send message
Joined: 19 Nov 11
Posts: 87
Credit: 420,920
RAC: 0
United States
Message 1198998 - Posted: 23 Feb 2012, 21:37:55 UTC

Matt: Ur the MAN. Thanks for the communication(s)..
ID: 1198998 · Report as offensive
Mooncalf

Send message
Joined: 5 Jan 11
Posts: 19
Credit: 20,196,239
RAC: 0
United States
Message 1199014 - Posted: 23 Feb 2012, 21:59:02 UTC - in response to Message 1198597.  

For days I have languished over my falling RAC; I am finally reading here that all is finally well in the land of Oz. Mr. Wizard: why, oh why does the server status say all is well in Oz, but in reality no project server can be found, either via proxy or direct?
ID: 1199014 · Report as offensive
davd

Send message
Joined: 20 May 03
Posts: 1
Credit: 1,551,912
RAC: 0
United States
Message 1199242 - Posted: 24 Feb 2012, 14:23:37 UTC - in response to Message 1198597.  

Is this why you have such a BIG problem keeping me full of SETI work? I've gone from 4,000+ units per day in late Jan (when I joined) thru mid Feb down to 2,400 and have often run compeletly out of work lately.
ID: 1199242 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1199275 - Posted: 24 Feb 2012, 17:02:08 UTC

"results ready to send" = 1, and the creation rate is below 1 per second, NTM that only four "tapes" are "hung"...
.

Hello, from Albany, CA!...
ID: 1199275 · Report as offensive
Wembley
Volunteer tester
Avatar

Send message
Joined: 16 Sep 09
Posts: 429
Credit: 1,844,293
RAC: 0
United States
Message 1199361 - Posted: 24 Feb 2012, 19:43:17 UTC - in response to Message 1198597.  

You getting a visit from this fire marshal?

http://www.youtube.com/watch?v=PlLPogmB8M8
ID: 1199361 · Report as offensive
Mooncalf

Send message
Joined: 5 Jan 11
Posts: 19
Credit: 20,196,239
RAC: 0
United States
Message 1199465 - Posted: 24 Feb 2012, 23:32:29 UTC - in response to Message 1199275.  

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).
ID: 1199465 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1199482 - Posted: 25 Feb 2012, 0:12:39 UTC - in response to Message 1199465.  

Of course they left for the weekend. It is the weekend.

Downloading several WU right now.

ID: 1199482 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1199484 - Posted: 25 Feb 2012, 0:15:09 UTC - in response to Message 1199465.  

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).

Your main problem could be the BOINC version that you're using, check out the Top Hosts table and see what most of the main setups are using (work is flowing here at a good rate). ;)

Cheers.
ID: 1199484 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1199718 - Posted: 25 Feb 2012, 17:52:15 UTC - in response to Message 1199484.  
Last modified: 25 Feb 2012, 17:53:26 UTC

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).

Your main problem could be the BOINC version that you're using, check out the Top Hosts table and see what most of the main setups are using (work is flowing here at a good rate). ;)

Cheers.


He's running 6.12.34, the most recent recommended version of the BOINC client... and well represented in the "top hosts" table
.

Hello, from Albany, CA!...
ID: 1199718 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1199729 - Posted: 25 Feb 2012, 19:16:00 UTC

Hi Folks,
I've re-installed the Lunatics apps:-) And once again I'm back to haveing ALL GPU tasks running in high priority mode.

Not only S@H but also E@H.

This doesnt apparently occur with CPU tasks.

Is this 'normal' behaviour?

Also since boinc is configured to use 100% resources in both S@H and E@H it 'normally' works at 50% for each, now however its persistantly running E@H GPU tasks non stop. Unless I suspend E@H, let S@H utilse both CPU & GPU then resume E@H.. If I then suspend S@H to let E@H CPU tasks get a look in [they take a long time and expire faster than S@H ones] Then I'm back to square one with E@H hogging the GPU..

So how does one adjust the priority of the GPU tasks. I dont know if this HP mode is detrimental to my GPU cards or not.. But I'm anyway wondering why its gone into hyperdrive as soon as Lunatics is installed

Another thing, it doesnt pick up dumped tasks and restart them, it grabs a new task and does that..
I see several tasks with various % done waiting to be restarted..
I guess they will eventually get to the top of the queue but I suspect only after all other WU are done..

Regards,

Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1199729 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1199780 - Posted: 25 Feb 2012, 21:18:33 UTC - in response to Message 1199718.  

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).

Your main problem could be the BOINC version that you're using, check out the Top Hosts table and see what most of the main setups are using (work is flowing here at a good rate). ;)

Cheers.


He's running 6.12.34, the most recent recommended version of the BOINC client... and well represented in the "top hosts" table

Funny that, I see more of the later 6.10.xx versions myself there.

Cheers.
ID: 1199780 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1200026 - Posted: 26 Feb 2012, 16:54:39 UTC - in response to Message 1199780.  

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).

Your main problem could be the BOINC version that you're using, check out the Top Hosts table and see what most of the main setups are using (work is flowing here at a good rate). ;)

Cheers.


He's running 6.12.34, the most recent recommended version of the BOINC client... and well represented in the "top hosts" table

Funny that, I see more of the later 6.10.xx versions myself there.

Cheers.


yes, but 6.12.34 is in there - so it isn't a problem with the client version... There's just more 6.10.60 clients out there...

.

Hello, from Albany, CA!...
ID: 1200026 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1200048 - Posted: 26 Feb 2012, 18:00:30 UTC - in response to Message 1200026.  
Last modified: 26 Feb 2012, 18:02:26 UTC

That is exactly my point!! I went from 75K+RAC/day to ZERO. "Project has no tasks available." This does not keep my 5 systems happy at all. Now it is Friday; all personel have left for the weekend; and SETI is sitting idle (it seems).

Your main problem could be the BOINC version that you're using, check out the Top Hosts table and see what most of the main setups are using (work is flowing here at a good rate). ;)

Cheers.

He's running 6.12.34, the most recent recommended version of the BOINC client... and well represented in the "top hosts" table

Funny that, I see more of the later 6.10.xx versions myself there.

Cheers.

yes, but 6.12.34 is in there - so it isn't a problem with the client version... There's just more 6.10.60 clients out there...

The issue with the 6.12.xx BOINC versions is the long back-off times compared to 6.10.58 or .60. Many of the larger/faster rigs that have stepped back to 6.10.60 are experiencing significantly less problems with workfetch since they "downgraded". There are several threads in Number Crunching that address that issue.
Donald
Infernal Optimist / Submariner, retired
ID: 1200048 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 1200705 - Posted: 28 Feb 2012, 16:35:58 UTC - in response to Message 1199729.  

Hi Cliff,
The problem with E@H seems to mainly involve two things they are doing.

First is the way E@H sets their deadlines. They attach -extremely- short deadlines on all their tasks. This causes BOINC to place their tasks' execution priority far ahead of all the other applications' tasks so that the E@H tasks don't expire. This is great for E@H, but gives what one might almost call an "unfair" advantage to their project. This is great for E@H, but makes it difficult for the other projects to either get equal run time or to get the resource distribution you desire.

Second is the rather large number of tasks E@H sends you every time BOINC requests new tasks. The sudden addition of so many longer-running tasks overloads the system and, combined with the short deadlines, gives their tasks priority over the other applications.

I have found that adjusting the Einstein resource allocations to a lower number for E@H helps, but it's -very- difficult to get the right mix of allocations to get E@H to request new tasks on a low allocation versus grabbing too much execution time on slightly higher allocations. I have for some time been experimenting with various ways to get the project mix back to what I prefer on my machines, but E@H's rather unusual combination of deadlines and download limits still interferes with both BOINC's processing distribution and new task request programming. Despite my best efforts, I have to periodically manually intervene to get the mix back to what I want.

The easiest manual intervention method I have found is the following:
1. Select the "Projects" tab.
2. Select (highlight) all projects.
3. Select "No new tasks".
4. Select (highlight) all projects I do not want to change.
5. Select "Suspend".
6. Select (highlight) the project that is now short on new tasks.
7. Select "Allow new tasks"
8. If the low-task project doesn't immediately request new tasks, select "Update".

Watch the project closely so you don't get too many tasks at once. As soon as enough tasks to get things back on track begin to download, reverse the process via the "Projects" tab as follows:
1. Select (highlight) the project that is downloading the new tasks.
2. Select "No new tasks"
3. Select (highlight) all suspended projects.
4. Select "Resume"
5. Select (highlight) all projects.
6. Select "Allow new tasks".

This is a rough way to get around it, but so far it's the only work-around I've found.

Perhaps the only true solution is for BOINC to issue some new "guidelines" to the various Project Administrators on deadlines and quotas and then modify BOINC's programming to handle any accidental variance from these guidelines.

As to your question about the "high priority" status on a running task, it appears to be related more to the deadline date. It seems that a task is placed in "high priority" status if the remaining time versus the deadline place it in jeopardy of not completing in time. I have not found evidence that this has anything to do with physical load on the GPU, but I may be wrong. Perhaps someone else knows the answer to this.

Finally, Cliff, the "dumped" task problem is again related to deadlines. If one or more tasks either have not yet started, or get dangerously close to not completing by their deadlines based on remaining processing time, BOINC will stop running one or more tasks that are not so time-critical and begin running the "at risk" tasks. Don't worry, as long as there are not too many tasks on your machine to complete in time, BOINC will resume computation on the "dumped" task as soon as the "at risk" tasks complete.

I hope this helps, Cliff, and I invite additional insight from anyone who may have found an easier way to deal with this problem. Just like Cliff and the rest of us experiencing this problem, I sure would like any help on how to get BOINC back to running smoothly without so much manual intervention.
ID: 1200705 · Report as offensive

Message boards : Technical News : Sun Dies (Feb 22 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.