To Many ERRORS

Message boards : Number crunching : To Many ERRORS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262545 - Posted: 20 Jul 2012, 14:42:30 UTC - in response to Message 1262529.  
Last modified: 20 Jul 2012, 14:47:24 UTC


Do you think that these slow running tasks may be the same issue I reported in PM to you ('Task_Hang_on_ATI', 'Addendum__Task_Hang_on_ATI') and you "copied to the r390 thread at Lunatics"?
("task 'hang' on ATI and auto-continued after a long time (but validated OK)")

Do you think that what I noted in the second letter ('Addendum__Task_Hang_on_ATI') have any merits?:
"
About the pause in GPU processing during BOINC usage:

This may be some driver glitch -
it appears to me that if I start some GPU monitoring tool it kicks the ATI driver and the GPU processing continues.
(I may be wrong, this may be just a coincidence, I didn't watching for this behaviour specifically
but I can say that the GPU processing continued at +- 2 minutes around the start of GPU monitoring program.
)

It is probably not the GPU monitoring as such that do the 'kick' (as SIV and TThrottle run all the time).
I suppose it have to be the start (initialization phase) of GPU monitoring tool.

It was GPU-Z in the case of 21fe12ad.19149.9502.8.10.160.vlar
It was ATI MemoryViewer in the case of 25ap12aa.26506.476.13.10.96
"



I was thinking the same, but could not find any 'real evedence' while errors:
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED, kept happening.
Although to a lesser extend since I used 2 threads to feed the GPUs.

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.
Will let it run with this setting. B.t.w. doing 2 instances_per_device
for MB work.

(WIN 7;64bit, BOINC 7.0.28;64bit, Lunatics rev.390 app. for MB., CPU=i7-2600,
GPUs 2x AMD/ATI EAH5870).All stock settings.
ID: 1262545 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262584 - Posted: 20 Jul 2012, 16:07:01 UTC - in response to Message 1262505.  

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.


I had exactly the same issue last week on my 5850.
I already said how to fix it.




With each crime and every kindness we birth our future.
ID: 1262584 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262585 - Posted: 20 Jul 2012, 16:07:21 UTC - in response to Message 1262543.  

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?


it's worth a try at least if it's something app related. If you still get errors with something like 100 you know it's not that...
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262585 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262586 - Posted: 20 Jul 2012, 16:08:34 UTC - in response to Message 1262585.  

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?


it's worth a try at least if it's something app related. If you still get errors with something like 100 you know it's not that...


No it won´t change anything.



With each crime and every kindness we birth our future.
ID: 1262586 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262588 - Posted: 20 Jul 2012, 16:10:46 UTC - in response to Message 1262584.  

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.


I had exactly the same issue last week on my 5850.
I already said how to fix it.


yes, I was wrong - happens.
I'll leave this to your capable hands then. If your suggestions don't help, we can get back to the drawing board.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262588 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262589 - Posted: 20 Jul 2012, 16:13:49 UTC - in response to Message 1262584.  

I had exactly the same issue last week on my 5850.
I already said how to fix it.


Probably best if you repeat how to fix it, Mike.
I find it's rather hidden and skildude may have missed it.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262589 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262595 - Posted: 20 Jul 2012, 16:24:22 UTC
Last modified: 20 Jul 2012, 16:31:17 UTC

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262595 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1262666 - Posted: 20 Jul 2012, 18:17:15 UTC - in response to Message 1262545.  
Last modified: 20 Jul 2012, 18:34:07 UTC

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.

Are you sure about the lag??
You can feel the lag most with VLARs, if you now run non-VLARs you will feel less lag.

I run with -period_iterations_num 80 and even with this higher value I feel small lag (especially when scrolling) if VLAR is running.
(with -period_iterations_num 10 lag is very big)


This makes me ask Raistmer - Is it possible to have some option that sets -period_iterations_num at different values depending on AR?
e.g.:
-period_iterations_num 20 -period_iterations_num_VLAR 100


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1262666 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262710 - Posted: 20 Jul 2012, 21:25:49 UTC - in response to Message 1262595.  
Last modified: 20 Jul 2012, 21:26:19 UTC

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works


Yes.
And watch your DCF.

Tell me your GPU usage please.


With each crime and every kindness we birth our future.
ID: 1262710 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262721 - Posted: 20 Jul 2012, 22:13:09 UTC - in response to Message 1262666.  

I changed, afew minutes ago, period_iterations from 20 to 10, which
decreased runtime and increased GPU-load, also decreased lag.

Are you sure about the lag??
You can feel the lag most with VLARs, if you now run non-VLARs you will feel less lag.

I run with -period_iterations_num 80 and even with this higher value I feel small lag (especially when scrolling) if VLAR is running.
(with -period_iterations_num 10 lag is very big)


This makes me ask Raistmer - Is it possible to have some option that sets -period_iterations_num at different values depending on AR?
e.g.:
-period_iterations_num 20 -period_iterations_num_VLAR 100



I'm a little confused too, expected to see an increase in lag with a lower
period_iterations_for_pulsefind. Probably each card/GPU has it's
'best' settings for period_iterations_for pulsefind....
Difference in runtime is small, compaired to 20, but I'll keep 10.
Biggest difference was freeing up 2 in stead of 1 thread, that's 1 i7-2600
core.
That's what Mike suggested, in the first place, too.


ID: 1262721 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1262789 - Posted: 21 Jul 2012, 2:35:59 UTC
Last modified: 21 Jul 2012, 2:43:14 UTC

I seem to be a bit late getting to the party-
I have set `period_iterations 2` and the lag is only a problem if a workunit is starting and being loaded into the GPU.
this thing is a crunch box so i am willing to tolerate quite a bit of lag.
The CPU is only a P4 3.6ghz (prescot 660) with HT, The cpu only crunches one freehal nci so as to keep its load down.
I find that the P4 is often overwelmed by the demands of two 7970 and during a shorty storm can not cope with servicing the GPU and stays at 100% load for several minits at a time and this makes the computer unuseable for me.
Though if it is `busy` it is up to me to leave it alone to get on with it and go play with one of the other comp`s.
I did `borrow` my q6600 from another rig to see how it fared and in that short test found that i had to keep one core free to feed each GPU, though i was not using -pi2 or -hp at that time.
If crunching on all fore cpu cores i was geting Maximum_Time_Exceded errors these stoped with two cores free for the gpu`s to use.
I am only runing two WU per card cos the PSU cant cope with any more, its is only a corsair HX620w and this box is eating about 500w, I have to get another psu before the third card,

edit - OS win7home64, BM 7.0.28, ccc12.4,
ID: 1262789 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262829 - Posted: 21 Jul 2012, 5:22:20 UTC - in response to Message 1262710.  

On a FX you need to free 2 cores to get full GPU utilisation.


is that the fix?
I'll free another CPU core and see what happens

reduced my usage to 6 cores
I'm now wondering if I could up my instances to 4 on the GPU if this actually works


Yes.
And watch your DCF.

Tell me your GPU usage please.

95-100%


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262829 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262896 - Posted: 21 Jul 2012, 11:03:13 UTC

I see no more errors anymore.

Your times have stabilized as well.
Nice card IMHO.



With each crime and every kindness we birth our future.
ID: 1262896 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1263121 - Posted: 21 Jul 2012, 22:10:40 UTC - in response to Message 1262896.  

yet I have a 5850 that isn't having this problem.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1263121 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263144 - Posted: 21 Jul 2012, 23:03:17 UTC - in response to Message 1263121.  

yet I have a 5850 that isn't having this problem.


I dont see a problem anymore on your 7970.



With each crime and every kindness we birth our future.
ID: 1263144 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1263377 - Posted: 22 Jul 2012, 16:15:06 UTC - in response to Message 1263144.  
Last modified: 22 Jul 2012, 16:32:54 UTC

yet I have a 5850 that isn't having this problem.


I dont see a problem anymore on your 7970.


Well, the Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED errors, surfaced
again.........

Trying todo 1 WU per GPU. See if that's helping.

Giving a very low load, so back to 2 per GPU and period_iterations 40 in
stead of 10.
Still using 1, i7-2600 thread of 8, for each GPU. (ATI HD5870)

CPU load during the first 10 seconds is 100% per (idle)thread.
Errors appear on both 1st and 2nd GPU, having about the same load, 85% average.
ID: 1263377 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263393 - Posted: 22 Jul 2012, 17:01:16 UTC
Last modified: 22 Jul 2012, 17:03:01 UTC

Dont confuse me please Fred.

Whats your DCF ?

Have you flops included in your appinfo ?

Whats the estimated times on GPU´s ?

How many CPU cores are in use ?


With each crime and every kindness we birth our future.
ID: 1263393 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1263438 - Posted: 22 Jul 2012, 19:19:13 UTC - in response to Message 1263393.  

I still dont get why it needs 2 cores to load


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1263438 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1263458 - Posted: 22 Jul 2012, 20:10:44 UTC - in response to Message 1263393.  
Last modified: 22 Jul 2012, 20:16:18 UTC

Dont confuse me please Fred.

Whats your DCF ?

Have you flops included in your appinfo ?

Whats the estimated times on GPU´s ?

How many CPU cores are in use ?


Why should I confuse you?
Task duration correction factor 3.61424

No FLOPS included. (Never had on this rig).

3 Cores, 6 threads are in use. 1 core or 2 threads (HT=ON) to feed GPUs.

Estimated times are ofcoarse, too high, 1.5 x runtime.
ID: 1263458 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1263463 - Posted: 22 Jul 2012, 20:22:37 UTC
Last modified: 22 Jul 2012, 21:15:18 UTC

First of all you quoted my reply to skildude.
So i got confused.

Anyways.

I fear you need to free 1 physical core per GPU.
Not one thread.

Try it please to see if this helps.
It certainly should.


With each crime and every kindness we birth our future.
ID: 1263463 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.