To Many ERRORS

Message boards : Number crunching : To Many ERRORS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1263495 - Posted: 22 Jul 2012, 21:25:54 UTC

Every system has it`s own best setting,
Alocate an extra thread every day or two until the errors stop,
Then work with it around that piont,
Are you using the -hp (-high_priority switch) in app_info command line.
ID: 1263495 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1263552 - Posted: 23 Jul 2012, 1:53:07 UTC - in response to Message 1263495.  

No I don't have -hp set

Would that solve some of the problem


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1263552 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263562 - Posted: 23 Jul 2012, 3:38:47 UTC

No.



With each crime and every kindness we birth our future.
ID: 1263562 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1263613 - Posted: 23 Jul 2012, 9:10:51 UTC - in response to Message 1263463.  
Last modified: 23 Jul 2012, 10:09:48 UTC

First of all you quoted my reply to skildude.
So i got confused.

Anyways.

I fear you need to free 1 physical core per GPU.
Not one thread.

Try it please to see if this helps.
It certainly should.


Sorry for the misunderstanding, tried different values for period_iterations
, 2; 4; 5 ;8; 10 ;12 ;15;16; 18; 20; 30; 40 ;50 which doesn't help. Although GPU-load is ~90%.

So I'll free up 2 CPU cores for the GPUs and see how this behaves.
With period_iterations 50. Changed that to 20, again and let it run
till I see any change to the better. GPU use is ~95%.

Also switched all, unnecessary progs, like CoreTemp and Clock, off.
ID: 1263613 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1263628 - Posted: 23 Jul 2012, 10:55:56 UTC

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.
ID: 1263628 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263648 - Posted: 23 Jul 2012, 12:10:26 UTC - in response to Message 1263628.  

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.

Yes, thats true.
But with new synching method inside drivers it doesn´t help for the low GPU usage bug.

Lets say it doesn´t hurt if you set it.
I dont use it anymore on my card.
With multiple cards it might help a little.



With each crime and every kindness we birth our future.
ID: 1263648 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1263723 - Posted: 23 Jul 2012, 15:24:29 UTC - in response to Message 1263648.  
Last modified: 23 Jul 2012, 15:32:17 UTC

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.

Yes, thats true.
But with new synching method inside drivers it doesn´t help for the low GPU usage bug.

Lets say it doesn´t hurt if you set it.
I dont use it anymore on my card.
With multiple cards it might help a little.


Still using the Cat.12.4 driver and now using 2 cores for 2 GPUs,
it might help to use the -hp switch. GPU usage is ~95%.
VLAR and ~0.4AR WUs still give about 40% errors and take ~3500 seconds
runtime. And 150 to 200 seconds on the CPU.

B.t.w. the other 2 cores, (is 4 threads) are doing 4 SETI MB WUs.
ID: 1263723 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263752 - Posted: 23 Jul 2012, 16:43:33 UTC
Last modified: 23 Jul 2012, 16:44:36 UTC

Its getting a bit complicated now.
The times are nearly normal for your cards.
The question is at what percentage the units errs.

You can reduce more cores until the erros stop.

Also modifying DCF in client_state.xml would help.
But careful have you ever tried that ?
Stop Boinc before !!!!!!!!!

How many PCI lanes does the second slot have ?


With each crime and every kindness we birth our future.
ID: 1263752 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1263775 - Posted: 23 Jul 2012, 17:24:19 UTC
Last modified: 23 Jul 2012, 17:26:20 UTC

BTW my 7970 is completing non blanked AP WU's in about 30-45 minutes 3 at a time. My wingman ran an i7 980 at 100,000 seconds.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1263775 · Report as offensive
KneeDeep

Send message
Joined: 27 Sep 99
Posts: 131
Credit: 4,887,778
RAC: 0
United States
Message 1263803 - Posted: 23 Jul 2012, 18:29:31 UTC - in response to Message 1262218.  

LadyL asked ...
The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?


I don't see where this was answered and it seems the most likely reason to me.

ID: 1263803 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263827 - Posted: 23 Jul 2012, 20:04:00 UTC

Downclocking is not an issue in that case.



With each crime and every kindness we birth our future.
ID: 1263827 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1263840 - Posted: 23 Jul 2012, 20:56:05 UTC - in response to Message 1263723.  

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.


ID: 1263840 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263844 - Posted: 23 Jul 2012, 21:06:21 UTC - in response to Message 1263840.  

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.



Did you read my earlier comment ?



With each crime and every kindness we birth our future.
ID: 1263844 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1264058 - Posted: 24 Jul 2012, 12:30:28 UTC - in response to Message 1263844.  
Last modified: 24 Jul 2012, 13:05:48 UTC

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.



Did you read my earlier comment ?


@ Mike about modifying DCF, to 1 f.i., yes it's in the client_state.xml file.
Found, it looks like it, what caused these Time exeeded, I have to use
period_iteration 2(!), also (?) produces almost no lag.

Doing 2 instances_per_device gives a load of 98% (device 0) and 97-47%,
swinging for device 1.

Estimates also were shorter as the runtime, DCF=9.011675!
CPU estimates are way, a VHAR is estimated 9000 seconds.

Estimates on GPUs and runtime are more ~equal.
1 core, 2 threads for 2 GPUs appear to be enough, during the first 20 second this core is @ 100%, then GPU load rises to 98%.
Let it run for now, have seen the first, after the change to p_i 2, validated.
When more failiars occur, shall I change DCF to 1?
ID: 1264058 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1264082 - Posted: 24 Jul 2012, 14:01:59 UTC - in response to Message 1264058.  

Mike is correct you need the 2 cores if you wish to run multiple WU's on the ATI GPU. I found this out the hard way. It works just do it.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1264082 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1264084 - Posted: 24 Jul 2012, 14:05:24 UTC

@ Fred

Please answer a few questions first.

What motherboard are you using ?

How many PCI-E slots ?

I will look into the details but dont change so much.

Period_iterations_num is fine when you dont have any lags.
Dont change it anymore.

At what percentage are the units erring (fail) ?




With each crime and every kindness we birth our future.
ID: 1264084 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266158 - Posted: 31 Jul 2012, 11:06:08 UTC - in response to Message 1264084.  
Last modified: 31 Jul 2012, 11:11:25 UTC

@ Fred

Please answer a few questions first.

What motherboard are you using ?

How many PCI-E slots ?

I will look into the details but dont change so much.

Period_iterations_num is fine when you dont have any lags.
Dont change it anymore.

At what percentage are the units erring (fail) ?



Mike, sorry for my late reply, I'm using an INTEL DP67BG mobo, 2 PCIe (2.0)x16 x8, if both are used.
Using period_iterations 32, giving the least lag.
Errors : Valid (213) · Invalid (0) · Error (166)
(Even set base clock from 100 to 102MHz giving higher FLOPS from CPU, maybe I
should OC the GPUs?!)

Errors all with: Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
Also leaving 1 core (2 threads) free for the GPUs.
ID: 1266158 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1267499 - Posted: 4 Aug 2012, 12:29:39 UTC
Last modified: 4 Aug 2012, 12:32:25 UTC

@Fred

I have calculated something for you.
Please add this line into your appinfo.

<flops>509408724.212160</flops>

Below your command line.

This should increase estimates, reduce your DCF and limit the errors.

Beware its very low value on purpose.


With each crime and every kindness we birth our future.
ID: 1267499 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1268349 - Posted: 6 Aug 2012, 12:28:31 UTC - in response to Message 1267499.  
Last modified: 6 Aug 2012, 12:37:27 UTC

@Fred

I have calculated something for you.
Please add this line into your appinfo.

<flops>509408724.212160</flops>

Below your command line.

This should increase estimates, reduce your DCF and limit the errors.

Beware its very low value on purpose.


I'll try it, value is 6x lower as 1 CPU core,
FLOPS (Whetstone)values, but since it's the CPU having ~6 times
higher estimates, it could/should work, thanks for figuring
this out.
GPU estimates are within +/- 10% of actual runtime!

Since I've set 1 instance_per_device and no_cpu_lock
together with 2 free cores, each feeding a GPU, errors have stopped
and runtimes have decreased to 50% compaired to doing 2 instances_per_
device
.

But APR doesn't change so this might help.
ID: 1268349 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1268387 - Posted: 6 Aug 2012, 14:55:29 UTC - in response to Message 1268349.  

Here's how well my 7970 is working so far.

It and the 6 cores for the Fx-8150 are currently doubling the production of my second best machine (AMD 630 w/ ati 5850 gpu) I also play a lot of Video games on the 7970 rig so it ends up having less running time than my other rigs.

Still a smashing success and again a great big thanks to Mike


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1268387 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.