To Many ERRORS


log in

Advanced search

Message boards : Number crunching : To Many ERRORS

1 · 2 · 3 · 4 · Next
Author Message
Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1262088 - Posted: 19 Jul 2012, 3:18:38 UTC

http://setiathome.berkeley.edu/result.php?resultid=2530015233
an example of many
I've uninstalled the ati 12.6 drivers and reinstalled the standard drivers that came with the card. I'm not sure why the card isn't getting work loaded properly
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1262202 - Posted: 19 Jul 2012, 13:57:13 UTC

98 views and not one suggestion or idea what is happening here? I could use the help
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,430,366
RAC: 50,609
United Kingdom
Message 1262203 - Posted: 19 Jul 2012, 14:01:34 UTC - in response to Message 1262202.

None of the developers seem to be online at the moment. We're thinking about it.

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262218 - Posted: 19 Jul 2012, 15:21:51 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...
____________
I'm not the Pope. I don't speak Ex Cathedra!

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4424
Credit: 118,565,236
RAC: 136,537
United States
Message 1262240 - Posted: 19 Jul 2012, 16:37:40 UTC
Last modified: 19 Jul 2012, 16:40:33 UTC

From your tasks it is still occurring with 12.2 or 12.3 that you downgraded to.

If you are running more than 1 task at a time I would guess that could be the issue. If you were running 3 at a time and 2 VLAR hit at the same time as a VHAR the slow down caused by the VLAR's could make the VHAR run to long.

If that is occurring then assigning tasks a value for their load or such would be a good idea. Where normal tasks would be a load of 1.0 and then VLAR's might have a rating of a value greater than 1.0 such as 1.1-1.5. Then the max load value could be assigned per processing device.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3250
Credit: 31,880,865
RAC: 3,879
Netherlands
Message 1262246 - Posted: 19 Jul 2012, 17:03:26 UTC - in response to Message 1262240.
Last modified: 19 Jul 2012, 17:08:31 UTC

Since 2 days I'm experiencing errors on my ATI 5870 GPUs
only on MB work. Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED.
I freed up 1 core of the i7-2600, 2 threads since I'm using HT.
GPU load is higher in first 15 seconds and I've seen error rate going down,
but still unsure as to what's the exact reasaon?

Vendor Advanced Micro Devices, Inc. Driver version: CAL 1.4.1720 (VM) Version: OpenCL 1.2 AMD-APP (923.1)


Not on AstroPulse work.

Nothing has or was changed since these errors occurred?!
____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1262251 - Posted: 19 Jul 2012, 17:30:08 UTC - in response to Message 1262218.

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.

____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3250
Credit: 31,880,865
RAC: 3,879
Netherlands
Message 1262261 - Posted: 19 Jul 2012, 18:13:30 UTC - in response to Message 1262251.
Last modified: 19 Jul 2012, 18:17:43 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.



Since it's only MB (6.10), the rev.390 app. for ATI, BOINC 7.0.28 (64bit) and WIN7 (64bit) which runs for some time, leaves only (?) wacky tasks and/or WIN 7 UPDates?!

First time I see this kind of error and not only on VHAR, also VLAR.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24481
Credit: 33,791,206
RAC: 24,226
Germany
Message 1262313 - Posted: 19 Jul 2012, 21:23:10 UTC

Have you set a core free skildude ?
It seems you are suffering from low GPU usage bug.

Whats your DCF ?

____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1262316 - Posted: 19 Jul 2012, 21:29:05 UTC - in response to Message 1262313.

I've never had a problem with leaving a CPU core open for GPU work and as I recall this has never been an issue anywhere.

the DCF for this machine is 3.354584 Is that good/bad I don't know.

I can't remember which app I'm using if its the HD5 or not. whichever it is I will try the alternate app and see if that stops the errors
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24481
Credit: 33,791,206
RAC: 24,226
Germany
Message 1262323 - Posted: 19 Jul 2012, 21:43:24 UTC
Last modified: 19 Jul 2012, 21:43:47 UTC

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.
____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3250
Credit: 31,880,865
RAC: 3,879
Netherlands
Message 1262325 - Posted: 19 Jul 2012, 21:49:08 UTC - in response to Message 1262261.
Last modified: 19 Jul 2012, 21:57:16 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.



Since it's only MB (6.10), the rev.390 app. for ATI, BOINC 7.0.28 (64bit) and WIN7 (64bit) which runs for some time, leaves only (?) wacky tasks and/or WIN 7 UPDates?!

First time I see this kind of error and not only on VHAR, also VLAR.


[ADDED]

Well, the errors have stopped, in 2.5 days (60 hours) suddenly Tiem Limit Exeeded. I already had left half a core, 1 thread free, now I've freed 1 core.
Now none of the 8 threads are running 100% constantly.
(SandyBridge is indeed quite another breed ;-)) .
CPU 'clock' doesn't have a fixed value as well. This too can have an effect on
the GPU(s).

And you're running 3 instances_per_device, but have 32 Compute Units
per GPU, mine (5870s) have 20 Compute Units and do 2 per GPU.
Most mobos run PCIe slots, when both used, in 8x mode, mine too.
Still fast enough if it's PCIe 2.0 or higher.
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24481
Credit: 33,791,206
RAC: 24,226
Germany
Message 1262328 - Posted: 19 Jul 2012, 21:56:30 UTC

Fred this has nothing to do with it.
My GPU only has 18 CU´s and i´m running 3 instances without any problems.

Wrong estimates and low GPU usage only is the problem.
On a FX you need to free 2 cores to get full GPU utilisation.

You can´t compare a I7 with a FX CPU.

____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3250
Credit: 31,880,865
RAC: 3,879
Netherlands
Message 1262330 - Posted: 19 Jul 2012, 22:03:09 UTC - in response to Message 1262328.
Last modified: 19 Jul 2012, 22:16:05 UTC

Fred this has nothing to do with it.
My GPU only has 18 CU´s and i´m running 3 instances without any problems.

Wrong estimates and low GPU usage only is the problem.
On a FX you need to free 2 cores to get full GPU utilisation.

You can´t compare a I7 with a FX CPU.


An i7 isn't a FX CPU, true.
And estimates incorrect, why 'out of the blue', since , atleast my host
has run for a few month with this setting.
Freeing 1 core is also adviseble, I use 1 core, 2threads.

Looks like it's over...........
But last received result, 24 hours ago, had the same error.

And APR :
SETI@home Enhanced (anonymous platform, ATI GPU)
Number of tasks completed 11
Max tasks per day 218
Number of tasks today 0
Consecutive valid tasks 36
Average processing rate 681.28996334014
Average turnaround time 0.21 days
____________

Alan
Send message
Joined: 16 Jun 11
Posts: 4
Credit: 844,808
RAC: 16
United States
Message 1262388 - Posted: 20 Jul 2012, 3:52:18 UTC - in response to Message 1262088.

Did you upgrade to BOINC 7.0.28?

I had problems when I upgraded the estimated time dropped alot and I started getting time exceeded errors. I went back to 7.0.25. the time estimates went up and on new work I have not been having nearly as many error out.

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262505 - Posted: 20 Jul 2012, 12:23:45 UTC - in response to Message 1262323.

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.
____________
I'm not the Pope. I don't speak Ex Cathedra!

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4298
Credit: 1,067,168
RAC: 1,010
United States
Message 1262522 - Posted: 20 Jul 2012, 13:21:42 UTC - in response to Message 1262505.

Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.

Your logic isn't quite right, DCF is not used for the limit. It's strictly whatever rsc_fpops_bound the servers sent divided by whatever flops are in use for the application. The estimate is rsc_fpops_est divided by the same flops, but then multiplied by DCF.

That unfortunately means that if DCF is above 10, the estimate is longer than the limit. So it is possible to have what looks like a reasonable estimate and progress get killed for maximum elapsed time exceeded.
Joe

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2789
Credit: 6,300,149
RAC: 7,492
Bulgaria
Message 1262529 - Posted: 20 Jul 2012, 14:04:45 UTC - in response to Message 1262522.


Do you think that these slow running tasks may be the same issue I reported in PM to you ('Task_Hang_on_ATI', 'Addendum__Task_Hang_on_ATI') and you "copied to the r390 thread at Lunatics"?
("task 'hang' on ATI and auto-continued after a long time (but validated OK)")

Do you think that what I noted in the second letter ('Addendum__Task_Hang_on_ATI') have any merits?:
"
About the pause in GPU processing during BOINC usage:

This may be some driver glitch -
it appears to me that if I start some GPU monitoring tool it kicks the ATI driver and the GPU processing continues.
(I may be wrong, this may be just a coincidence, I didn't watching for this behaviour specifically
but I can say that the GPU processing continued at +- 2 minutes around the start of GPU monitoring program.
)

It is probably not the GPU monitoring as such that do the 'kick' (as SIV and TThrottle run all the time).
I suppose it have to be the start (initialization phase) of GPU monitoring tool.

It was GPU-Z in the case of 21fe12ad.19149.9502.8.10.160.vlar
It was ATI MemoryViewer in the case of 25ap12aa.26506.476.13.10.96
"

____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262537 - Posted: 20 Jul 2012, 14:24:00 UTC - in response to Message 1262522.

Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.

Your logic isn't quite right, DCF is not used for the limit. It's strictly whatever rsc_fpops_bound the servers sent divided by whatever flops are in use for the application. The estimate is rsc_fpops_est divided by the same flops, but then multiplied by DCF.

That unfortunately means that if DCF is above 10, the estimate is longer than the limit. So it is possible to have what looks like a reasonable estimate and progress get killed for maximum elapsed time exceeded.
Joe


Ta. I thought it would affect the limit as well - doesn't make much sense that way...

In that case probably have to resort to Fred's reschduler and use the expert -177 option.
____________
I'm not the Pope. I don't speak Ex Cathedra!

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1262543 - Posted: 20 Jul 2012, 14:35:36 UTC - in response to Message 1262537.
Last modified: 20 Jul 2012, 14:36:35 UTC

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS

Copyright © 2014 University of California