Not sure what's going on... New cruncher saga

Message boards : Number crunching : Not sure what's going on... New cruncher saga
Message board moderation

To post messages, you must log in.

AuthorMessage
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1329202 - Posted: 19 Jan 2013, 20:46:05 UTC

OK,
I've got my new cruncher working, sort of...
Its an AMD 8350 (8 core) with 16Gb of RAM sitting in an Asus Crosshair V with an Asus GTX 690 - so it shouold be "no slouch", but not necessarily the fastest kid on the block.

First, after a lot of faffing around I got it to load Windoze7 - issues with the motherboard's BIOS needing to be updated.

I let it run for a few hours just on its own, and all was sweetness and light, but the wrong GPU drivers loaded (that is if you count "none" as wrong). So I loaded the drivers off the disk, and let it run for a few hours more, again all was OK. So I loaded BOINC (version 7.0.28), and tried to download the current apps - they took so long I was trashing work because the apps weren't there, but there was a queue of work downloading (great when BOINC gets its bits in a twist....)
Eventually the CPU app downloaded and a couple of tasks ran...
So I grabbed a copy of Lunatics that I have lying around and loaded that - and now things start to resemble a pear...
After a few minutes the whole lot stops. No excuses, no messing STOPS, display frozen and no response to the keyboard or mouse. So I think, hmm, lets update the GPU driver - so I download the latest from the Nvidia site (310.90) (I had to suspend all S@H processing to get it to download).
Latest drivers, and away we go, for about 10 minutes and (later inspection showed) a number of "computation errors". Frozen again....
Reboot, read the notes that are posted here about what to do about errant 6xx GPU. OK, get that, let's try setting setting environment variable.
Restart, suspend S@H again - big dump of updates from MS (again...), so let them through, reboot, and check the environment variable is set. Resume S@H, and a few minutes later nothing is happening, more WU end in errors...
Restart, clear up the mess from the last crash, and now download an older "good" version of the drivers.

And, by now you should see the pattern - after a few minutes the machine stops responding. And I'm getting confused and frustrated....

(Hmm, just had a look at the errors, most appear to have come from the CPU not the GPU...)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1329202 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1329206 - Posted: 19 Jan 2013, 21:04:02 UTC - in response to Message 1329202.  

The only words of wisdom i can suggest are:

Monitor the CPU temps, try only CPU crunching leaving the GPU usage suspended, If that's all O.K, perhaps your PSU isn't up supplying the CPU and GPU at the same time,
If it still seems unstable, perhaps your CPU cooler isn't up to it,

Make sure you installed the AMD Optimised app, the Intel ones don't work on AMDs any more,

Claggy
ID: 1329206 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1329209 - Posted: 19 Jan 2013, 21:21:11 UTC - in response to Message 1329206.  

Not thought about running GPU only.

But I've had another look at the errors, a couple from the GPU, but most are from the CPU

Here's the first bit of a typical stderr list:
Stderr output

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
Windows optimized S@H Enhanced application by Alex Kan
Version info: v8b2-SSE3x (AMD/Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSE3x Win32 Build 386 , Ported by : Jason G, Raistmer, JDWhale

CPUID: AMD FX(tm)-8350 Eight-Core Processor
Speed: 0 x 4207 MHz
Cache: L1=64K L2=2048K
Features: MMX SSE SSE2 SSE3

Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 1.016984


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004448B0 read attempt to address 0x439D6D20

Engaging BOINC Windows Runtime Debugger...



Both CPU and GPU temps have been low (about 50-60C) so not a problem, but I've just noticed that the CPU is somewhat overclocked - I'll reset that and see what happens.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1329209 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1329216 - Posted: 19 Jan 2013, 21:44:03 UTC

After a coffee break - a quick update

Claggy, to confirm I am running the "correct" AMD version of the Lunatics app, albeit a 32bit not 64bit one - I must have a dig and see if I can find a 64bit copy somewhere for both my main crunchers....
Having removed the overclocking (not sure where that came from, but probably a result of the "fun" getting Windoze to install...)
The beast looks to be behaving less badly (I won't say "well" until I've seen it rumble though a few more tasks without errors)


Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1329216 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1329228 - Posted: 19 Jan 2013, 22:13:43 UTC

What PSU are you using ?



With each crime and every kindness we birth our future.
ID: 1329228 · Report as offensive
cdemers
Volunteer tester

Send message
Joined: 18 May 99
Posts: 30
Credit: 17,235,002
RAC: 0
Canada
Message 1329235 - Posted: 19 Jan 2013, 22:41:54 UTC - in response to Message 1329202.  

My system ran 'funny' too till i found the AMD scheduler patches for windows 7. I have a AMD 8150, which now after those patches and latest drivers has been running smooth. Another good idea is to run memtest86 (or whatever your favorite memory test program is) if your seeing things unstable, I had to send back one of my 4GB modules of my 16 gig kit because it was bad.

ID: 1329235 · Report as offensive
Profile Tazz
Volunteer tester
Avatar

Send message
Joined: 5 Oct 99
Posts: 137
Credit: 34,342,390
RAC: 0
Canada
Message 1329236 - Posted: 19 Jan 2013, 22:45:35 UTC - in response to Message 1329209.  

Maybe check to see if the Turbo "feature" is enabled in the BIOS, and while your there check on the power saving settings too. I had to manually set the clock multipliers and turn some other settings off on my 8150 because the CPU speed was jumping up and down all by itself - even under the full load of S@H.

I couldn't find any concrete numbers for the max temp for the 8150 either, but 60 deg was the popular number. When crunching it was running around 61-63. I put a water cooler on it and now I start getting a little concerned when it gets up to 35 degrees. ;)
</Tazz>
ID: 1329236 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1329241 - Posted: 19 Jan 2013, 22:56:36 UTC

Depends on the motherboard you are using.
Disable C1 and C3 as well as turbo.
Check for CPU load line calibration.



With each crime and every kindness we birth our future.
ID: 1329241 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1329251 - Posted: 19 Jan 2013, 23:59:55 UTC - in response to Message 1329209.  


- exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004448B0 read attempt to address 0x439D6D20

I know this Error. Here's what BOINC says about it;
This error can be the result of a programming error in the code for:

The BOINC Client Software
The Science Applications
Any other program running on the computer
The Operating System

On the other hand, it could be because it is Tuesday.

Seriously, this is a serious error within the running application program and it is one that computer hardware and Operating System manage when the program attempts to do something that is a no-no.
There was an exception, in this case, an attempt to read from a memory location beyond those actually allocated to the program. 
    The remaining lines are data of interest to the Developers so that they can isolate the problem within the program that failed.

Since it's a new machine, running the MemTest wouldn't be a bad idea. However, sometimes even memory tests won't find bad memory. You have to run the machine a while and see if it's a general problem, not associated with a single program. Try not to do it on a Tuesday...
ID: 1329251 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1329329 - Posted: 20 Jan 2013, 4:54:22 UTC

I also want to say it looks like a RAM issue.

But lots of different problems portray themselves with errors like that.
#resist
ID: 1329329 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1329370 - Posted: 20 Jan 2013, 9:19:07 UTC

Thanks for all the advise and tips.
It would appear to be a "speed related" issue with the CPU. Having removed the overclocking (which the kind man who built the machine put on for me - unrequested) it has run quite happily over night.

Next time its down I'll run memtest again (always one of the first things I do on a new PC)

Answering question about the PSU - its rated at 1500W so should be well within its limits with one GPU on board.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1329370 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1329376 - Posted: 20 Jan 2013, 10:32:23 UTC

Since you are using a Assus board i suggest to adjust CPU Load line calibration to 40%.
Also set CPU phase control to extreme.
I guess you have a proggy called AI suite.
You can have a look there about those settings.




With each crime and every kindness we birth our future.
ID: 1329376 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1329381 - Posted: 20 Jan 2013, 11:19:32 UTC - in response to Message 1329380.  

That's not unusual, on my i7-2600K @4.7GHz it only shows the Stock speed:

Windows optimized S@H Enhanced application by Alex Kan
Version info: v8b2-SSE3 (AMD/Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSE3 Win64 Build 386 , Ported by : Jason G, Raistmer, JDWhale

CPUID: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Speed: 4 x 3411 MHz
Cache: L1=64K L2=256K
Features: MMX SSE SSE2 SSE3


Claggy
ID: 1329381 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1329384 - Posted: 20 Jan 2013, 11:58:55 UTC - in response to Message 1329370.  
Last modified: 20 Jan 2013, 12:01:32 UTC

Thanks for all the advise and tips.
It would appear to be a "speed related" issue with the CPU. Having removed the overclocking (which the kind man who built the machine put on for me - unrequested) it has run quite happily over night.

Next time its down I'll run memtest again (always one of the first things I do on a new PC)

Answering question about the PSU - its rated at 1500W so should be well within its limits with one GPU on board.


I'm sure a 1500W is fine, looking at your tasks, you'll want to upgrade to x41zc_Cuda5 as soon as possible, x41g_Cuda32 predates Keplers by some time,
you won't get optium speed from eithier x41g or from a Cuda32 app, grab the files from jgopt.org, unpack them into your project directory,
then with Boinc shut down, run aimerge.cmd, check your app_info looks O.K, then restart Boinc,
(But you'll also need Cuda5 drivers installed for that, otherwise try the x41zc_Cuda42 version instead)

Claggy
ID: 1329384 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1329440 - Posted: 20 Jan 2013, 15:23:17 UTC

I'm letting it "bed in" with what I've got.
I was about to download x41zc when they got pulled from the Lunatics site :-(



(Has anyone got a prognosis on when they will be restarting distributing their wares again??)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1329440 · Report as offensive
Profile Tim
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 211
Credit: 278,575,259
RAC: 0
Greece
Message 1329447 - Posted: 20 Jan 2013, 15:53:37 UTC - in response to Message 1329440.  
Last modified: 20 Jan 2013, 15:54:11 UTC

I'm letting it "bed in" with what I've got.
I was about to download x41zc when they got pulled from the Lunatics site :-(



(Has anyone got a prognosis on when they will be restarting distributing their wares again??)


Don’t go to Lunatics.

Claggy provide you the link to Jason site. Go to downloads and… there it is.

Tim
ID: 1329447 · Report as offensive

Message boards : Number crunching : Not sure what's going on... New cruncher saga


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.