Gravity (Jun 09 2011)

Message boards : Technical News : Gravity (Jun 09 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1115236 - Posted: 9 Jun 2011, 22:33:54 UTC
Last modified: 9 Jun 2011, 22:35:43 UTC

So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.

As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step.

In better news we moved some assimilator processes to synergy and were pleasantly surprised how much faster they ran. In fact, we are running the scientific analysis code now which has been causing the assimilators to back up, but they aren't. That's nice. Really nice, actually. [EDIT: I might have spoken too soon on this front - not so nice.]

Still trying to hash out the next phase for the NTPCkr and how to present all this to the public. We're doing a bunch of in-house analysis ourselves just to get a feel for the data and clean up junk, and as expected most of the "interesting" stuff is turning out to be RFI. We want to get it to a point where we're presenting people with candidates that contain signals which aren't always obvious RFI. That would be boring and useless.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1115236 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1115237 - Posted: 9 Jun 2011, 22:39:54 UTC - in response to Message 1115236.  

Matt, thanks for the news!


- Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. -
ID: 1115237 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1115239 - Posted: 9 Jun 2011, 22:47:20 UTC - in response to Message 1115236.  
Last modified: 9 Jun 2011, 22:54:54 UTC

Thanks for the update Matt,

Any chance of Seti Beta being brought online?

Claggy
ID: 1115239 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1115243 - Posted: 9 Jun 2011, 23:04:46 UTC - in response to Message 1115239.  

Thank you for the update, Matt! Much appreciated.
ID: 1115243 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1115246 - Posted: 9 Jun 2011, 23:09:01 UTC - in response to Message 1115243.  

Thanks too Matt,

Claggy
ID: 1115246 · Report as offensive
Joel Lynn

Send message
Joined: 8 Sep 05
Posts: 14
Credit: 26,446
RAC: 0
United States
Message 1115260 - Posted: 10 Jun 2011, 0:13:16 UTC

Thanks Matt. Just keep the Louisville handy if it all goes south.
ID: 1115260 · Report as offensive
Profile Jack Zhang
Volunteer tester
Avatar

Send message
Joined: 2 Jul 06
Posts: 206
Credit: 6,142,449
RAC: 0
Canada
Message 1115264 - Posted: 10 Jun 2011, 0:32:23 UTC

Memory incompatibility is rampant these days. Did you check the sticks with Memtest86+ yet?

Usually, a BIOS upgrade fixes incompatibility issues, but not all the time.
What if Fiction was Fact and Fact was Fiction and vice versa?
ID: 1115264 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31012
Credit: 53,134,872
RAC: 32
United States
Message 1115276 - Posted: 10 Jun 2011, 1:31:27 UTC

Thanks for the update Matt. Good luck on the DOA sticks.

ID: 1115276 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 1115383 - Posted: 10 Jun 2011, 12:25:58 UTC

Wow, Matt!
Am I reading you correctly that some signals you are finding have -not- been attributable to man-made noise? I understand how much must still be done to rule out natural phenomena and spurious anomalies. However, if you are actually getting some signals that are not ours, that alone is the first big step.

Still much to do, of course, but this -is- exciting and feeds the imagination.

I, for one, look forward with great anticipation to the day you are ready to present any such candidates for further analysis!!!

p.s. I wish you at least one week this month without -any- technical difficulties. You -certainly- deserve one...or two! :)
ID: 1115383 · Report as offensive
Profile Kibble (KB7TIB)
Avatar

Send message
Joined: 6 Dec 99
Posts: 27
Credit: 10,121,469
RAC: 2
United States
Message 1115385 - Posted: 10 Jun 2011, 12:32:17 UTC

Thank you for the info, Matt. However one issue here is that the newest BOINC version has a prominent place for notices which apparently mirrors the home page. It may have been better to post your news there as the last notice was dated May 28th and doesn't cover this sequence of events. Without current notices on the home page the BOINC development just becomes redundant and a mockery. Users still have to access the boards for any info on which they find concerns. Perhaps a brief comment on the current problem with a link to the appropriate board would work.
ID: 1115385 · Report as offensive
justsomeguy

Send message
Joined: 27 May 99
Posts: 84
Credit: 6,084,595
RAC: 11
United States
Message 1115415 - Posted: 10 Jun 2011, 14:29:51 UTC - in response to Message 1115236.  


As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step.

- Matt


For the Sunfire X4x00 series of boxes, Micron is what they started with, then moved to Samsung, both had issues. The best memory I put in any of the 60+ 4000 series boxes I was working on was Hynix. Performed well and 0 DOAs in 3 years. Unlike Micron that was as one point 25% DOA, and Samsung which had much better stats but still had to burn in for 3 days to verify non-DOA before putting into a production box.


Memory incompatibility is rampant these days. Did you check the sticks with Memtest86+ yet?

Usually, a BIOS upgrade fixes incompatibility issues, but not all the time.


Memory incompatibility wouldn't be the issue here...Sun is quirky this way. Incompatible memory and the box won't even fire. Since it was running for a couple of days, I would more likely suspect DOA depending on the brand. Also, while I personally would recommend memtest be run, with the amount of memory in this box (unsure but would suspect 16+ gig) it'll be down for a couple hours to get a good test.

As far as the firmware goes, If Matt would post the bios AND firmware levels of Thumper, I can check for the latest and be sure it's up to snuff.

BIOS and firmware need to be compatible, you can mix and match but you get flakey results. It was a nightmare to flash and sometimes required up to three retries to get it to go...

Kevin

"Two things are infinite: The universe and human stupidity; and I'm not sure about the universe." - Albert Einstein

ID: 1115415 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1115467 - Posted: 10 Jun 2011, 15:50:40 UTC - in response to Message 1115415.  

Thanks for taking the time to keep us updated Matt :)


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1115467 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1115552 - Posted: 10 Jun 2011, 19:43:19 UTC

Hi Matt, and thanks for the update. Sounds like some of your systems are throwing fits, and I sure am sorry to hear that. One of these days, I hope that things settle down and you guys get a MUCH DESERVED break. In the meantime though, I want you to know that all of us appreciate all of your hard work, and your dedication to the project.

LOOKING FORWARD TO GREENBANK WORK UNITS !!!!
ID: 1115552 · Report as offensive
CryptokiD
Avatar

Send message
Joined: 2 Dec 00
Posts: 150
Credit: 3,216,632
RAC: 0
United States
Message 1115746 - Posted: 11 Jun 2011, 3:07:39 UTC

why does it seem there is so many problems with server memory? i'm not just talking about the seti servers, but any servers in general seem very picky and there seems to be a high failure rate for server memory.

i cant remember the last time i had a desktop memory issue. and every computer i build gets a 24 hour memtest+ burn in to verify that part of the computer is running ok.

is there something inherent with servers or their memory which makes them have troubles? i have very little server expierence apart from small to medium offices of 25pc's or less.
ID: 1115746 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31012
Credit: 53,134,872
RAC: 32
United States
Message 1115781 - Posted: 11 Jun 2011, 5:03:02 UTC - in response to Message 1115746.  

why does it seem there is so many problems with server memory? i'm not just talking about the seti servers, but any servers in general seem very picky and there seems to be a high failure rate for server memory.

i cant remember the last time i had a desktop memory issue. and every computer i build gets a 24 hour memtest+ burn in to verify that part of the computer is running ok.

is there something inherent with servers or their memory which makes them have troubles? i have very little server expierence apart from small to medium offices of 25pc's or less.

Three things come to mind.

One, the specs on a server are likely to be correct and not have large margins of error built into them. I.E. Their memory buss speed really is what they say and slightly slow chips won't cut it.

Two, their memory is much more exercised than a desktop. This because they are running so many more jobs that they spend much more time using the main memory and much less the cache memory. Obviously good server design (software) is supposed to minimize that.

Three, they likely run hot and in hot places. Heat is the enemy.

ID: 1115781 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1115789 - Posted: 11 Jun 2011, 5:40:21 UTC - in response to Message 1115746.  

is there something inherent with servers or their memory which makes them have troubles?

The amount of memory.
Most desktop systems these days have 4GB, some 8GB, very few 12GB. A server system can have up to 2TB of memory. That's 64 memory slots.
In order for memory to work, the timing has to be spot on- the margin for error is almost non-existant. The more memory modules you have in a system, the greater the electrical load, and the greater the effect of capacitance & inductance- and they really screw timing up.
While a server system might work with 12 modules of one type of memory, it may not work with 14. In order for a system to be stable it needs memory that meets the design specs exactly. The slightest variance will introduce timing erros, and you end up with memory faults- even though the memory may not actually be faulty. It's just not suitable for that system with that many memory modules populated.
Grant
Darwin NT
ID: 1115789 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1115790 - Posted: 11 Jun 2011, 5:46:26 UTC - in response to Message 1115789.  
Last modified: 11 Jun 2011, 5:46:53 UTC

is there something inherent with servers or their memory which makes them have troubles?

The amount of memory.
Most desktop systems these days have 4GB, some 8GB, very few 12GB. A server system can have up to 2TB of memory. That's 64 memory slots.
In order for memory to work, the timing has to be spot on- the margin for error is almost non-existant. The more memory modules you have in a system, the greater the electrical load, and the greater the effect of capacitance & inductance- and they really screw timing up.
While a server system might work with 12 modules of one type of memory, it may not work with 14. In order for a system to be stable it needs memory that meets the design specs exactly. The slightest variance will introduce timing erros, and you end up with memory faults- even though the memory may not actually be faulty. It's just not suitable for that system with that many memory modules populated.

The overhead involved with mapping and keeping track of such vast amounts of RAM must be incredible indeed.
I am sure the fault limits allowed by a server must be miniscule.....once you toss something out there into that vast memory bank, you have to trust that it is safe to stay there for a bit...(no pun intended).
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1115790 · Report as offensive
Profile Igogo Project Donor
Volunteer tester
Avatar

Send message
Joined: 18 Dec 04
Posts: 125
Credit: 65,303,299
RAC: 44
Thailand
Message 1115801 - Posted: 11 Jun 2011, 7:21:54 UTC - in response to Message 1115790.  

Thanks Matt
ID: 1115801 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1116083 - Posted: 11 Jun 2011, 23:20:23 UTC - in response to Message 1115236.  

So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.

Any chance of doing the same with whatever it is that periodically freezes the Server Status Page, and apparently blocks new work production at the same time?
ID: 1116083 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1116338 - Posted: 12 Jun 2011, 17:19:29 UTC - in response to Message 1116083.  

So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.

Any chance of doing the same with whatever it is that periodically freezes the Server Status Page, and apparently blocks new work production at the same time?


Probably whatever does that (the freezes) is dependent on a process in Bruno... (or is a process actually running on Bruno...)

.

Hello, from Albany, CA!...
ID: 1116338 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Gravity (Jun 09 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.