To Many ERRORS

Message boards : Number crunching : To Many ERRORS
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262088 - Posted: 19 Jul 2012, 3:18:38 UTC

http://setiathome.berkeley.edu/result.php?resultid=2530015233
an example of many
I've uninstalled the ati 12.6 drivers and reinstalled the standard drivers that came with the card. I'm not sure why the card isn't getting work loaded properly


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262088 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262202 - Posted: 19 Jul 2012, 13:57:13 UTC

98 views and not one suggestion or idea what is happening here? I could use the help


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262202 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1262203 - Posted: 19 Jul 2012, 14:01:34 UTC - in response to Message 1262202.  

None of the developers seem to be online at the moment. We're thinking about it.
ID: 1262203 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262218 - Posted: 19 Jul 2012, 15:21:51 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262218 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1262240 - Posted: 19 Jul 2012, 16:37:40 UTC
Last modified: 19 Jul 2012, 16:40:33 UTC

From your tasks it is still occurring with 12.2 or 12.3 that you downgraded to.

If you are running more than 1 task at a time I would guess that could be the issue. If you were running 3 at a time and 2 VLAR hit at the same time as a VHAR the slow down caused by the VLAR's could make the VHAR run to long.

If that is occurring then assigning tasks a value for their load or such would be a good idea. Where normal tasks would be a load of 1.0 and then VLAR's might have a rating of a value greater than 1.0 such as 1.1-1.5. Then the max load value could be assigned per processing device.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1262240 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262246 - Posted: 19 Jul 2012, 17:03:26 UTC - in response to Message 1262240.  
Last modified: 19 Jul 2012, 17:08:31 UTC

Since 2 days I'm experiencing errors on my ATI 5870 GPUs
only on MB work. Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED.
I freed up 1 core of the i7-2600, 2 threads since I'm using HT.
GPU load is higher in first 15 seconds and I've seen error rate going down,
but still unsure as to what's the exact reasaon?
 Vendor				 Advanced Micro Devices, Inc.
  Driver version:		 CAL 1.4.1720 (VM)
  Version:			 OpenCL 1.2 AMD-APP (923.1)


Not on AstroPulse work.

Nothing has or was changed since these errors occurred?!
ID: 1262246 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262251 - Posted: 19 Jul 2012, 17:30:08 UTC - in response to Message 1262218.  

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.



In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262251 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262261 - Posted: 19 Jul 2012, 18:13:30 UTC - in response to Message 1262251.  
Last modified: 19 Jul 2012, 18:17:43 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.



Since it's only MB (6.10), the rev.390 app. for ATI, BOINC 7.0.28 (64bit) and WIN7 (64bit) which runs for some time, leaves only (?) wacky tasks and/or WIN 7 UPDates?!

First time I see this kind of error and not only on VHAR, also VLAR.
ID: 1262261 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262313 - Posted: 19 Jul 2012, 21:23:10 UTC

Have you set a core free skildude ?
It seems you are suffering from low GPU usage bug.

Whats your DCF ?



With each crime and every kindness we birth our future.
ID: 1262313 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262316 - Posted: 19 Jul 2012, 21:29:05 UTC - in response to Message 1262313.  

I've never had a problem with leaving a CPU core open for GPU work and as I recall this has never been an issue anywhere.

the DCF for this machine is 3.354584 Is that good/bad I don't know.

I can't remember which app I'm using if its the HD5 or not. whichever it is I will try the alternate app and see if that stops the errors


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262316 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262323 - Posted: 19 Jul 2012, 21:43:24 UTC
Last modified: 19 Jul 2012, 21:43:47 UTC

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


With each crime and every kindness we birth our future.
ID: 1262323 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262325 - Posted: 19 Jul 2012, 21:49:08 UTC - in response to Message 1262261.  
Last modified: 19 Jul 2012, 21:57:16 UTC

Personally I hadn't commented because it's the ATI GPU app with which I've no experience, so I usually leave that to Mike and Raistmer to sort out.

Superficially getting tasks stuck and aborted by boinc is something that points to problems with the app or the host e.g. a bad driver.

What's puzzeling is that you have strings of 'good' tasks interspersed with long running ones like http://setiathome.berkeley.edu/result.php?resultid=2530015217 it's a VHAR it should have taken some 200 sec like the other ones on the host. It managed to complete just inside the 10x cutoff. The others may be actually processing but too slow and get aborted (as opposed to being stuck).

The card may be intermittently downclocking for some reason - any chance you can monitor that host to see if tasks are actually progressing and check the system for anomalies once a task goes past normal runtimes?

At this point it might be anything - app, driver, windows updates, boinc version, whacky tasks...

I did notice that the CPU load time is very short compared to work that actually completes. This makes me curious as to whether the WU is not loading properly or completely and it just sits there for 30 minutes and times out.



Since it's only MB (6.10), the rev.390 app. for ATI, BOINC 7.0.28 (64bit) and WIN7 (64bit) which runs for some time, leaves only (?) wacky tasks and/or WIN 7 UPDates?!

First time I see this kind of error and not only on VHAR, also VLAR.


[ADDED]

Well, the errors have stopped, in 2.5 days (60 hours) suddenly Tiem Limit Exeeded. I already had left half a core, 1 thread free, now I've freed 1 core.
Now none of the 8 threads are running 100% constantly.
(SandyBridge is indeed quite another breed ;-)) .
CPU 'clock' doesn't have a fixed value as well. This too can have an effect on
the GPU(s).

And you're running 3 instances_per_device, but have 32 Compute Units
per GPU, mine (5870s) have 20 Compute Units and do 2 per GPU.
Most mobos run PCIe slots, when both used, in 8x mode, mine too.
Still fast enough if it's PCIe 2.0 or higher.
ID: 1262325 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1262328 - Posted: 19 Jul 2012, 21:56:30 UTC

Fred this has nothing to do with it.
My GPU only has 18 CU´s and i´m running 3 instances without any problems.

Wrong estimates and low GPU usage only is the problem.
On a FX you need to free 2 cores to get full GPU utilisation.

You can´t compare a I7 with a FX CPU.



With each crime and every kindness we birth our future.
ID: 1262328 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1262330 - Posted: 19 Jul 2012, 22:03:09 UTC - in response to Message 1262328.  
Last modified: 19 Jul 2012, 22:16:05 UTC

Fred this has nothing to do with it.
My GPU only has 18 CU´s and i´m running 3 instances without any problems.

Wrong estimates and low GPU usage only is the problem.
On a FX you need to free 2 cores to get full GPU utilisation.

You can´t compare a I7 with a FX CPU.


An i7 isn't a FX CPU, true.
And estimates incorrect, why 'out of the blue', since , atleast my host
has run for a few month with this setting.
Freeing 1 core is also adviseble, I use 1 core, 2threads.

Looks like it's over...........
But last received result, 24 hours ago, had the same error.

And APR :
SETI@home Enhanced (anonymous platform, ATI GPU)
Number of tasks completed 11
Max tasks per day 218
Number of tasks today 0
Consecutive valid tasks 36
Average processing rate 681.28996334014
Average turnaround time 0.21 days
ID: 1262330 · Report as offensive
Alan

Send message
Joined: 16 Jun 11
Posts: 4
Credit: 867,828
RAC: 0
United States
Message 1262388 - Posted: 20 Jul 2012, 3:52:18 UTC - in response to Message 1262088.  

Did you upgrade to BOINC 7.0.28?

I had problems when I upgraded the estimated time dropped alot and I started getting time exceeded errors. I went back to 7.0.25. the time estimates went up and on new work I have not been having nearly as many error out.
ID: 1262388 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262505 - Posted: 20 Jul 2012, 12:23:45 UTC - in response to Message 1262323.  

Thats what i expected.

Your DCF is to high, should be around 1.
This combined with low GPU usage causing this errors.

Have you set flops in appinfo ?

Try to change DCF in client state.xml.


Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262505 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1262522 - Posted: 20 Jul 2012, 13:21:42 UTC - in response to Message 1262505.  

Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.

Your logic isn't quite right, DCF is not used for the limit. It's strictly whatever rsc_fpops_bound the servers sent divided by whatever flops are in use for the application. The estimate is rsc_fpops_est divided by the same flops, but then multiplied by DCF.

That unfortunately means that if DCF is above 10, the estimate is longer than the limit. So it is possible to have what looks like a reasonable estimate and progress get killed for maximum elapsed time exceeded.
                                                                  Joe
ID: 1262522 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1262529 - Posted: 20 Jul 2012, 14:04:45 UTC - in response to Message 1262522.  


Do you think that these slow running tasks may be the same issue I reported in PM to you ('Task_Hang_on_ATI', 'Addendum__Task_Hang_on_ATI') and you "copied to the r390 thread at Lunatics"?
("task 'hang' on ATI and auto-continued after a long time (but validated OK)")

Do you think that what I noted in the second letter ('Addendum__Task_Hang_on_ATI') have any merits?:
"
About the pause in GPU processing during BOINC usage:

This may be some driver glitch -
it appears to me that if I start some GPU monitoring tool it kicks the ATI driver and the GPU processing continues.
(I may be wrong, this may be just a coincidence, I didn't watching for this behaviour specifically
but I can say that the GPU processing continued at +- 2 minutes around the start of GPU monitoring program.
)

It is probably not the GPU monitoring as such that do the 'kick' (as SIV and TThrottle run all the time).
I suppose it have to be the start (initialization phase) of GPU monitoring tool.

It was GPU-Z in the case of 21fe12ad.19149.9502.8.10.160.vlar
It was ATI MemoryViewer in the case of 25ap12aa.26506.476.13.10.96
"

 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1262529 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1262537 - Posted: 20 Jul 2012, 14:24:00 UTC - in response to Message 1262522.  

Actually, a high DCF is good in his particular case and could prevent further -197 even though it messes with cache.

the long running task I linked will have pulled DCF up - so the estimates AND more crucially with it the 10x estimate abort limit are high (provided I'm getting my logic right) and if further tasks run long but inside the new margin they will keep DCF up even though the majority of tasks pulls DCF back to 1.

That at least might allow error free processing until the underliying cause of the slowrunning WUs is found.

Your logic isn't quite right, DCF is not used for the limit. It's strictly whatever rsc_fpops_bound the servers sent divided by whatever flops are in use for the application. The estimate is rsc_fpops_est divided by the same flops, but then multiplied by DCF.

That unfortunately means that if DCF is above 10, the estimate is longer than the limit. So it is possible to have what looks like a reasonable estimate and progress get killed for maximum elapsed time exceeded.
                                                                  Joe


Ta. I thought it would affect the limit as well - doesn't make much sense that way...

In that case probably have to resort to Fred's reschduler and use the expert -177 option.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1262537 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1262543 - Posted: 20 Jul 2012, 14:35:36 UTC - in response to Message 1262537.  
Last modified: 20 Jul 2012, 14:36:35 UTC

to digress:
-Windows 7 ultimate fully updated.
-BOINC 7.0.28
-ATI 12.3 drivers installed without errors (uninstalled drivers, used Driver sweep, reboot installed 12.3)
allowed 1 CPU core to remain idle and changed from the HD5 to standard GPU app and still getting errors

Are there any other ATI 7970 users that have this problem or use a different driver?
would a period_interation move in the app_info change anything?


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1262543 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : To Many ERRORS


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.