To Many ERRORS


log in

Advanced search

Message boards : Number crunching : To Many ERRORS

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Profile Fred J. Verster
Volunteer tester
Send message
Joined: 21 Apr 04
Posts: 3238
Credit: 31,694,209
RAC: 5,761
Netherlands
Message 1262545 - Posted: 20 Jul 2012, 14:42:30 UTC - in response to Message 1262529.
Last modified: 20 Jul 2012, 14:47:24 UTC


Do you think that these slow running tasks may be the same issue I reported in PM to you ('Task_Hang_on_ATI', 'Addendum__Task_Hang_on_ATI') and you "copied to the r390 thread at Lunatics"?
("task 'hang' on ATI and auto-continued after a long time (but validated OK)")

Do you think that what I noted in the second letter ('Addendum__Task_Hang_on_ATI') have any merits?:
"
About the pause in GPU processing during BOINC usage:

This may be some driver glitch -
it appears to me that if I start some GPU monitoring tool it kicks the ATI driver and the GPU processing continues.
(I may be wrong, this may be just a coincidence, I didn't watching for this behaviour specifically
but I can say that the GPU processing continued at +- 2 minutes around the start of GPU monitoring program.
)

It is probably not the GPU monitoring as such that do the 'kick' (as SIV and TThrottle run all the time).
I suppose it have to be the start (initialization phase) of GPU monitoring tool.

It was GPU-Z in the case of 21fe12ad.19149.9502.8.10.160.vlar
It was ATI MemoryViewer in the case of 25ap12aa.26506.476.13.10.96
"



I was thinking the same, but could not find any 'real evedence' while errors:
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED, kept happening.
Although to a lesser extend since I used 2 threads to feed the GPUs.

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.
Will let it run with this setting. B.t.w. doing 2 instances_per_device
for MB work.

(WIN 7;64bit, BOINC 7.0.28;64bit, Lunatics rev.390 app. for MB., CPU=i7-2600,
GPUs 2x AMD/ATI EAH5870).All stock settings.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1262584 - Posted: 20 Jul 2012, 16:07:01 UTC - in response to Message 1262505.

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.


I had exactly the same issue last week on my 5850.
I already said how to fix it.


____________

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262585 - Posted: 20 Jul 2012, 16:07:21 UTC - in response to Message 1262543.

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?


it's worth a try at least if it's something app related. If you still get errors with something like 100 you know it's not that...
____________
I'm not the Pope. I don't speak Ex Cathedra!

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1262586 - Posted: 20 Jul 2012, 16:08:34 UTC - in response to Message 1262585.

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?


it's worth a try at least if it's something app related. If you still get errors with something like 100 you know it's not that...


No it won´t change anything.

____________

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262588 - Posted: 20 Jul 2012, 16:10:46 UTC - in response to Message 1262584.

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.


I had exactly the same issue last week on my 5850.
I already said how to fix it.


yes, I was wrong - happens.
I'll leave this to your capable hands then. If your suggestions don't help, we can get back to the drawing board.
____________
I'm not the Pope. I don't speak Ex Cathedra!

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262589 - Posted: 20 Jul 2012, 16:13:49 UTC - in response to Message 1262584.

I had exactly the same issue last week on my 5850.
I already said how to fix it.


Probably best if you repeat how to fix it, Mike.
I find it's rather hidden and skildude may have missed it.
____________
I'm not the Pope. I don't speak Ex Cathedra!

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1262595 - Posted: 20 Jul 2012, 16:24:22 UTC
Last modified: 20 Jul 2012, 16:31:17 UTC

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2644
Credit: 6,006,007
RAC: 4,448
Bulgaria
Message 1262666 - Posted: 20 Jul 2012, 18:17:15 UTC - in response to Message 1262545.
Last modified: 20 Jul 2012, 18:34:07 UTC

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.

Are you sure about the lag??
You can feel the lag most with VLARs, if you now run non-VLARs you will feel less lag.

I run with -period_iterations_num 80 and even with this higher value I feel small lag (especially when scrolling) if VLAR is running.
(with -period_iterations_num 10 lag is very big)


This makes me ask Raistmer - Is it possible to have some option that sets -period_iterations_num at different values depending on AR?
e.g.:
-period_iterations_num 20 -period_iterations_num_VLAR 100


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1262710 - Posted: 20 Jul 2012, 21:25:49 UTC - in response to Message 1262595.
Last modified: 20 Jul 2012, 21:26:19 UTC

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works


Yes.
And watch your DCF.

Tell me your GPU usage please.
____________

Profile Fred J. Verster
Volunteer tester
Send message
Joined: 21 Apr 04
Posts: 3238
Credit: 31,694,209
RAC: 5,761
Netherlands
Message 1262721 - Posted: 20 Jul 2012, 22:13:09 UTC - in response to Message 1262666.

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.

Are you sure about the lag??
You can feel the lag most with VLARs, if you now run non-VLARs you will feel less lag.

I run with -period_iterations_num 80 and even with this higher value I feel small lag (especially when scrolling) if VLAR is running.
(with -period_iterations_num 10 lag is very big)


This makes me ask Raistmer - Is it possible to have some option that sets -period_iterations_num at different values depending on AR?
e.g.:
-period_iterations_num 20 -period_iterations_num_VLAR 100



I'm a little confused too, expected to see an increase in lag with a lower
period_iterations_for_pulsefind. Probably each card/GPU has it's
'best' settings for period_iterations_for pulsefind....
Difference in runtime is small, compaired to 20, but I'll keep 10.
Biggest difference was freeing up 2 in stead of 1 thread, that's 1 i7-2600
core.
That's what Mike suggested, in the first place, too.


____________

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 0
United Kingdom
Message 1262789 - Posted: 21 Jul 2012, 2:35:59 UTC
Last modified: 21 Jul 2012, 2:43:14 UTC

I seem to be a bit late getting to the party-
I have set `period_iterations 2` and the lag is only a problem if a workunit is starting and being loaded into the GPU.
this thing is a crunch box so i am willing to tolerate quite a bit of lag.
The CPU is only a P4 3.6ghz (prescot 660) with HT, The cpu only crunches one freehal nci so as to keep its load down.
I find that the P4 is often overwelmed by the demands of two 7970 and during a shorty storm can not cope with servicing the GPU and stays at 100% load for several minits at a time and this makes the computer unuseable for me.
Though if it is `busy` it is up to me to leave it alone to get on with it and go play with one of the other comp`s.
I did `borrow` my q6600 from another rig to see how it fared and in that short test found that i had to keep one core free to feed each GPU, though i was not using -pi2 or -hp at that time.
If crunching on all fore cpu cores i was geting Maximum_Time_Exceded errors these stoped with two cores free for the gpu`s to use.
I am only runing two WU per card cos the PSU cant cope with any more, its is only a corsair HX620w and this box is eating about 500w, I have to get another psu before the third card,

edit - OS win7home64, BM 7.0.28, ccc12.4,

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1262829 - Posted: 21 Jul 2012, 5:22:20 UTC - in response to Message 1262710.

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works


Yes.
And watch your DCF.

Tell me your GPU usage please.

95-100%
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1262896 - Posted: 21 Jul 2012, 11:03:13 UTC

I see no more errors anymore.

Your times have stabilized as well.
Nice card IMHO.

____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1263121 - Posted: 21 Jul 2012, 22:10:40 UTC - in response to Message 1262896.

yet I have a 5850 that isn't having this problem.
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1263144 - Posted: 21 Jul 2012, 23:03:17 UTC - in response to Message 1263121.

yet I have a 5850 that isn't having this problem.


I dont see a problem anymore on your 7970.

____________

Profile Fred J. Verster
Volunteer tester
Send message
Joined: 21 Apr 04
Posts: 3238
Credit: 31,694,209
RAC: 5,761
Netherlands
Message 1263377 - Posted: 22 Jul 2012, 16:15:06 UTC - in response to Message 1263144.
Last modified: 22 Jul 2012, 16:32:54 UTC

yet I have a 5850 that isn't having this problem.


I dont see a problem anymore on your 7970.


Well, the Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED errors, surfaced
again.........

Trying todo 1 WU per GPU. See if that's helping.

Giving a very low load, so back to 2 per GPU and period_iterations 40 in
stead of 10.
Still using 1, i7-2600 thread of 8, for each GPU. (ATI HD5870)

CPU load during the first 10 seconds is 100% per (idle)thread.
Errors appear on both 1st and 2nd GPU, having about the same load, 85% average.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1263393 - Posted: 22 Jul 2012, 17:01:16 UTC
Last modified: 22 Jul 2012, 17:03:01 UTC

Dont confuse me please Fred.

Whats your DCF ?

Have you flops included in your appinfo ?

Whats the estimated times on GPU´s ?

How many CPU cores are in use ?
____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1263438 - Posted: 22 Jul 2012, 19:19:13 UTC - in response to Message 1263393.

I still dont get why it needs 2 cores to load
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile Fred J. Verster
Volunteer tester
Send message
Joined: 21 Apr 04
Posts: 3238
Credit: 31,694,209
RAC: 5,761
Netherlands
Message 1263458 - Posted: 22 Jul 2012, 20:10:44 UTC - in response to Message 1263393.
Last modified: 22 Jul 2012, 20:16:18 UTC

Dont confuse me please Fred.

Whats your DCF ?

Have you flops included in your appinfo ?

Whats the estimated times on GPU´s ?

How many CPU cores are in use ?


Why should I confuse you?
Task duration correction factor 3.61424

No FLOPS included. (Never had on this rig).

3 Cores, 6 threads are in use. 1 core or 2 threads (HT=ON) to feed GPUs.

Estimated times are ofcoarse, too high, 1.5 x runtime.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23818
Credit: 32,636,591
RAC: 23,237
Germany
Message 1263463 - Posted: 22 Jul 2012, 20:22:37 UTC
Last modified: 22 Jul 2012, 21:15:18 UTC

First of all you quoted my reply to skildude.
So i got confused.

Anyways.

I fear you need to free 1 physical core per GPU.
Not one thread.

Try it please to see if this helps.
It certainly should.
____________

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS

Copyright © 2014 University of California