To Many ERRORS


log in

Advanced search

Message boards : Number crunching : To Many ERRORS

Previous · 1 · 2 · 3 · 4 · Next
Author Message
.clair.
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,647,299
RAC: 31,908
United Kingdom
Message 1263495 - Posted: 22 Jul 2012, 21:25:54 UTC

Every system has it`s own best setting,
Alocate an extra thread every day or two until the errors stop,
Then work with it around that piont,
Are you using the -hp (-high_priority switch) in app_info command line.

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1263552 - Posted: 23 Jul 2012, 1:53:07 UTC - in response to Message 1263495.

No I don't have -hp set

Would that solve some of the problem
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1263562 - Posted: 23 Jul 2012, 3:38:47 UTC

No.

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1263613 - Posted: 23 Jul 2012, 9:10:51 UTC - in response to Message 1263463.
Last modified: 23 Jul 2012, 10:09:48 UTC

First of all you quoted my reply to skildude.
So i got confused.

Anyways.

I fear you need to free 1 physical core per GPU.
Not one thread.

Try it please to see if this helps.
It certainly should.


Sorry for the misunderstanding, tried different values for period_iterations
, 2; 4; 5 ;8; 10 ;12 ;15;16; 18; 20; 30; 40 ;50 which doesn't help. Although GPU-load is ~90%.

So I'll free up 2 CPU cores for the GPUs and see how this behaves.
With period_iterations 50. Changed that to 20, again and let it run
till I see any change to the better. GPU use is ~95%.

Also switched all, unnecessary progs, like CoreTemp and Clock, off.
____________

.clair.
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,647,299
RAC: 31,908
United Kingdom
Message 1263628 - Posted: 23 Jul 2012, 10:55:56 UTC

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1263648 - Posted: 23 Jul 2012, 12:10:26 UTC - in response to Message 1263628.

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.

Yes, thats true.
But with new synching method inside drivers it doesn´t help for the low GPU usage bug.

Lets say it doesn´t hurt if you set it.
I dont use it anymore on my card.
With multiple cards it might help a little.

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1263723 - Posted: 23 Jul 2012, 15:24:29 UTC - in response to Message 1263648.
Last modified: 23 Jul 2012, 15:32:17 UTC

@ Mike,
I thought that the -hp switch was to help the gpu grab some more cpu time,
not as a cure for the errors but a little bit of help.

Yes, thats true.
But with new synching method inside drivers it doesn´t help for the low GPU usage bug.

Lets say it doesn´t hurt if you set it.
I dont use it anymore on my card.
With multiple cards it might help a little.


Still using the Cat.12.4 driver and now using 2 cores for 2 GPUs,
it might help to use the -hp switch. GPU usage is ~95%.
VLAR and ~0.4AR WUs still give about 40% errors and take ~3500 seconds
runtime. And 150 to 200 seconds on the CPU.

B.t.w. the other 2 cores, (is 4 threads) are doing 4 SETI MB WUs.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1263752 - Posted: 23 Jul 2012, 16:43:33 UTC
Last modified: 23 Jul 2012, 16:44:36 UTC

Its getting a bit complicated now.
The times are nearly normal for your cards.
The question is at what percentage the units errs.

You can reduce more cores until the erros stop.

Also modifying DCF in client_state.xml would help.
But careful have you ever tried that ?
Stop Boinc before !!!!!!!!!

How many PCI lanes does the second slot have ?
____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1263775 - Posted: 23 Jul 2012, 17:24:19 UTC
Last modified: 23 Jul 2012, 17:26:20 UTC

BTW my 7970 is completing non blanked AP WU's in about 30-45 minutes 3 at a time. My wingman ran an i7 980 at 100,000 seconds.
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

KneeDeep
Send message
Joined: 27 Sep 99
Posts: 131
Credit: 4,887,676
RAC: 214
United States
Message 1263803 - Posted: 23 Jul 2012, 18:29:31 UTC - in response to Message 1262218.

LadyL asked ...
The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?


I don't see where this was answered and it seems the most likely reason to me.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1263827 - Posted: 23 Jul 2012, 20:04:00 UTC

Downclocking is not an issue in that case.

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1263840 - Posted: 23 Jul 2012, 20:56:05 UTC - in response to Message 1263723.

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.


____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1263844 - Posted: 23 Jul 2012, 21:06:21 UTC - in response to Message 1263840.

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.



Did you read my earlier comment ?

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1264058 - Posted: 24 Jul 2012, 12:30:28 UTC - in response to Message 1263844.
Last modified: 24 Jul 2012, 13:05:48 UTC

Gonna give it one more try with 1 instance_per_device and 1 core, out of 4, for
the GPUs?!
Error rate is still climbing and is a waste of resources.

Until I find what's going terrebly wrong.

Another thing, all 3 rigs are running in High Priority, again and also
making errors, except the GTX470 running at 800MHz, in stead of 1400MHz.



Did you read my earlier comment ?


@ Mike about modifying DCF, to 1 f.i., yes it's in the client_state.xml file.
Found, it looks like it, what caused these Time exeeded, I have to use
period_iteration 2(!), also (?) produces almost no lag.

Doing 2 instances_per_device gives a load of 98% (device 0) and 97-47%,
swinging for device 1.

Estimates also were shorter as the runtime, DCF=9.011675!
CPU estimates are way, a VHAR is estimated 9000 seconds.

Estimates on GPUs and runtime are more ~equal.
1 core, 2 threads for 2 GPUs appear to be enough, during the first 20 second this core is @ 100%, then GPU load rises to 98%.
Let it run for now, have seen the first, after the change to p_i 2, validated.
When more failiars occur, shall I change DCF to 1?
____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1264082 - Posted: 24 Jul 2012, 14:01:59 UTC - in response to Message 1264058.

Mike is correct you need the 2 cores if you wish to run multiple WU's on the ATI GPU. I found this out the hard way. It works just do it.
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1264084 - Posted: 24 Jul 2012, 14:05:24 UTC

@ Fred

Please answer a few questions first.

What motherboard are you using ?

How many PCI-E slots ?

I will look into the details but dont change so much.

Period_iterations_num is fine when you dont have any lags.
Dont change it anymore.

At what percentage are the units erring (fail) ?


____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1266158 - Posted: 31 Jul 2012, 11:06:08 UTC - in response to Message 1264084.
Last modified: 31 Jul 2012, 11:11:25 UTC

@ Fred

Please answer a few questions first.

What motherboard are you using ?

How many PCI-E slots ?

I will look into the details but dont change so much.

Period_iterations_num is fine when you dont have any lags.
Dont change it anymore.

At what percentage are the units erring (fail) ?



Mike, sorry for my late reply, I'm using an INTEL DP67BG mobo, 2 PCIe (2.0)x16 x8, if both are used.
Using period_iterations 32, giving the least lag.
Errors : Valid (213) · Invalid (0) · Error (166)
(Even set base clock from 100 to 102MHz giving higher FLOPS from CPU, maybe I
should OC the GPUs?!)

Errors all with: Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED
Also leaving 1 core (2 threads) free for the GPUs.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 25194
Credit: 34,792,078
RAC: 20,986
Germany
Message 1267499 - Posted: 4 Aug 2012, 12:29:39 UTC
Last modified: 4 Aug 2012, 12:32:25 UTC

@Fred

I have calculated something for you.
Please add this line into your appinfo.

<flops>509408724.212160</flops>

Below your command line.

This should increase estimates, reduce your DCF and limit the errors.

Beware its very low value on purpose.
____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1268349 - Posted: 6 Aug 2012, 12:28:31 UTC - in response to Message 1267499.
Last modified: 6 Aug 2012, 12:37:27 UTC

@Fred

I have calculated something for you.
Please add this line into your appinfo.

<flops>509408724.212160</flops>

Below your command line.

This should increase estimates, reduce your DCF and limit the errors.

Beware its very low value on purpose.


I'll try it, value is 6x lower as 1 CPU core,
FLOPS (Whetstone)values, but since it's the CPU having ~6 times
higher estimates, it could/should work, thanks for figuring
this out.
GPU estimates are within +/- 10% of actual runtime!

Since I've set 1 instance_per_device and no_cpu_lock
together with 2 free cores, each feeding a GPU, errors have stopped
and runtimes have decreased to 50% compaired to doing 2 instances_per_
device
.

But APR doesn't change so this might help.
____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1268387 - Posted: 6 Aug 2012, 14:55:29 UTC - in response to Message 1268349.

Here's how well my 7970 is working so far.

It and the 6 cores for the Fx-8150 are currently doubling the production of my second best machine (AMD 630 w/ ati 5850 gpu) I also play a lot of Video games on the 7970 rig so it ends up having less running time than my other rigs.

Still a smashing success and again a great big thanks to Mike
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS

Copyright © 2014 University of California