Is it hosts or work erring out?

Message boards : Number crunching : Is it hosts or work erring out?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266187 - Posted: 31 Jul 2012, 13:41:26 UTC
Last modified: 31 Jul 2012, 13:41:54 UTC

WUID
1037783606.

WUID
1037925313.

WUID
1037548421.

WUID
1037925340.


I can continue.....
Seeing more and more of these errors, including my ATI rig.
Too much different (CUDA) versions? Optimized hosts left unattended?
ID: 1266187 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1266252 - Posted: 31 Jul 2012, 21:48:52 UTC - in response to Message 1266187.  

The two with "Maximum elapsed time exceeded" errors I score as BOINC server code runtime estimation problems, the one with "found a triplet twice" is WU related though also a design flaw in the CUDA code. That leaves one where the GPU isn't operating stably, the host producing more invalid than valid results.
                                                                   Joe
ID: 1266252 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34577
Credit: 79,922,639
RAC: 80
Germany
Message 1266259 - Posted: 31 Jul 2012, 21:59:15 UTC

Its much more complicated Joe.
I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design.
Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU.

With each crime and every kindness we birth our future.
ID: 1266259 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266268 - Posted: 31 Jul 2012, 22:24:41 UTC - in response to Message 1266259.  
Last modified: 31 Jul 2012, 22:26:10 UTC

Its much more complicated Joe.
I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design.
Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU.


O.t.h. when I run Milkyway the GPUs are pushed to the limit and when I leave
also 1 core free, the throttle to 500MHz. cause of the 100C temp limit.
With all the CPU cores used @ 100% (SETI MB), they run close to their max.!


Probably is parallizing the anylizing of different signal types used at SETI,
more difficult to 'smooth out' over the CUs of a GPU?
ID: 1266268 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1266293 - Posted: 31 Jul 2012, 23:31:28 UTC - in response to Message 1266259.  

Its much more complicated Joe.
I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design.
Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU.

The reason I score Maximum elapsed time exceeded errors as a BOINC problem is because that feature hasn't been updated to reflect some of the observed difficulties. It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar.

Perhaps the splitters should just set rsc_fpops_bound to a higher multiple of rsc_fpops_est, something like 20 or 25 rather than the current 10 might be better overall. But that's a workaround rather than a fix.

You're definitely right that it's complex, but when a protective feature starts causing more problems than it prevents some redesign is sensible. Avoiding adding additional complexity and possibilities for problems might be difficult, though.
                                                                  Joe
ID: 1266293 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266375 - Posted: 1 Aug 2012, 8:27:56 UTC - in response to Message 1266293.  
Last modified: 1 Aug 2012, 8:39:00 UTC

Its much more complicated Joe.
I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design.
Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU.

The reason I score Maximum elapsed time exceeded errors as a BOINC problem is because that feature hasn't been updated to reflect some of the observed difficulties. It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar.

Perhaps the splitters should just set rsc_fpops_bound to a higher multiple of rsc_fpops_est, something like 20 or 25 rather than the current 10 might be better overall. But that's a workaround rather than a fix.

You're definitely right that it's complex, but when a protective feature starts causing more problems than it prevents some redesign is sensible. Avoiding adding additional complexity and possibilities for problems might be difficult, though.
                                                                  Joe



So, in fact they aren't real-computation-errors, but the result of a BOINC 'safety net'?

Upped CPU base clock and GPU clock speed, with expected zero result and
probably added instabillity!?
ID: 1266375 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266386 - Posted: 1 Aug 2012, 10:13:42 UTC - in response to Message 1266375.  
Last modified: 1 Aug 2012, 10:25:57 UTC

Its much more complicated Joe.
I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design.
Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU.

The reason I score Maximum elapsed time exceeded errors as a BOINC problem is because that feature hasn't been updated to reflect some of the observed difficulties. It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar.

Perhaps the splitters should just set rsc_fpops_bound to a higher multiple of rsc_fpops_est, something like 20 or 25 rather than the current 10 might be better overall. But that's a workaround rather than a fix.

You're definitely right that it's complex, but when a protective feature starts causing more problems than it prevents some redesign is sensible. Avoiding adding additional complexity and possibilities for problems might be difficult, though.
                                                                  Joe



So, in fact they aren't real-computation-errors, but the result of a BOINC 'safety net'?

Upped CPU base clock and GPU clock speed, with expected zero result and
probably added instabillity!?


And with an error rate greater as 60% is it worth to just go on like this?
If it's a server/splitter-side setting is a work-around compromizing
results?

It's too HOT today anyway, but would trying some older drivers and OpenCL version 1.0 make sense, I'll doubt it with a 'too tight' splitter setting?!
ID: 1266386 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34577
Credit: 79,922,639
RAC: 80
Germany
Message 1266411 - Posted: 1 Aug 2012, 13:31:42 UTC

Older drivers could help indeed for effected hosts.
Specially on HD 5x and HD 6x cards because of different syncing inside drivers.
But not for HD 7x cards of course.


With each crime and every kindness we birth our future.
ID: 1266411 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266550 - Posted: 1 Aug 2012, 23:53:25 UTC - in response to Message 1266411.  

Older drivers could help indeed for effected hosts.
Specially on HD 5x and HD 6x cards because of different syncing inside drivers.
But not for HD 7x cards of course.



I did something I should/could have done before OCing the CPU, base clock
100MHz to 104MHz and OCing the GPU from 850MHz to 896MHz.
Appears perfectly stable CPU @ 3540MHz; 3560MFLOPS per CPU; 6 used, 2 for GPUs.
GPU 5440GFLOPS max. Temps, on air 80C highest for CPU and 82 for the GPUs.
If this don't help I'll try older cat. drivers like 11.2 or 11.9 and AMD SDK 1.0 now using cat 12.4 and AMD-APP-SDK 2.4, OpenCL ver. 1.2



ID: 1266550 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266744 - Posted: 2 Aug 2012, 16:56:20 UTC - in response to Message 1266550.  
Last modified: 2 Aug 2012, 17:38:43 UTC

Older drivers could help indeed for effected hosts.
Specially on HD 5x and HD 6x cards because of different syncing inside drivers.
But not for HD 7x cards of course.



I did something I should/could have done before OCing the CPU, base clock
100MHz to 104MHz and OCing the GPU from 850MHz to 896MHz.
Appears perfectly stable CPU @ 3540MHz; 3560MFLOPS per CPU; 6 used, 2 for GPUs.
GPU 5440GFLOPS max. Temps, on air 80C highest for CPU and 82 for the GPUs.
If this don't help I'll try older cat. drivers like 11.2 or 11.9 and AMD SDK 1.0 now using cat 12.4 and AMD-APP-SDK 2.4, OpenCL ver. 1.2




There isn't anything against rev.177 for MB work?
I did see another and newer version,then rev.331 forgot on which host?

It did not help to raise the GPU-core-clock, still over 65% errored, Time Limit
Exceeded.

Now trying cat.11.2, was 12.4, with OpenCL support.
Hope this helps, if not I'll quit using the GPUs cause it's waste of recourses.
Maybe try to get 79xx series, but €€€........!?

First difference when using cat.11.2 was a crash and uninstalled 11.2 and installed cat.11.9 giving a much higher CPU load, average 96%,
77% with cat. 12.4!

GPU load is lower ~73%, was ~92%. Well I'll let it run overnight and count the
errors :-/ .

I'm running out of options.................?!
ID: 1266744 · Report as offensive
Alan

Send message
Joined: 16 Jun 11
Posts: 4
Credit: 867,828
RAC: 0
United States
Message 1266785 - Posted: 2 Aug 2012, 18:59:09 UTC

I am going on about 18 hrs with out errors.
I rolled back to catalyst 12.1 driver 8.930.0.0.
also check to see if your AV sandboxes executables.
Avast! does. If it does try adding the exe files for seti to the sandbox exception list.
ID: 1266785 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266789 - Posted: 2 Aug 2012, 19:07:56 UTC - in response to Message 1266744.  
Last modified: 2 Aug 2012, 19:27:16 UTC

Older drivers could help indeed for effected hosts.
Specially on HD 5x and HD 6x cards because of different syncing inside drivers.
But not for HD 7x cards of course.



I did something I should/could have done before OCing the CPU, base clock
100MHz to 104MHz and OCing the GPU from 850MHz to 896MHz.
Appears perfectly stable CPU @ 3540MHz; 3560MFLOPS per CPU; 6 used, 2 for GPUs.
GPU 5440GFLOPS max. Temps, on air 80C highest for CPU and 82 for the GPUs.
If this don't help I'll try older cat. drivers like 11.2 or 11.9 and AMD SDK 1.0 now using cat 12.4 and AMD-APP-SDK 2.4, OpenCL ver. 1.2




There isn't anything against rev.177 for MB work?
I did see another and newer version,then rev.331 forgot on which host?

It did not help to raise the GPU-core-clock, still over 65% errored, Time Limit
Exceeded.

Now trying cat.11.2, was 12.4, with OpenCL support.
Hope this helps, if not I'll quit using the GPUs cause it's waste of recourses.
Maybe try to get 79xx series, but €€€........!?

First difference when using cat.11.2 was a crash and uninstalled 11.2 and installed cat.11.9 giving a much higher CPU load, average 96%,
77% with cat. 12.4!

GPU load is lower ~73%, was ~92%. Well I'll let it run overnight and count the
errors :-/ .

I'm running out of options.................?!

[snipped]......


Reading a post of Joe Segur, It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar.
[snipped]...........

I was watching the WUs done by the ATI 5870GPU when it got aborted with a message maximumelapsed time eceeded 57:27 min. and the WU got is "error message", while, infact, it's a SERVER-Side quick fix which is turning from worse to disaster :-/.

Don't know when this impementation was put in place, as far I can remem-
ber, since the beginning of june, 2012 maybe earliar, atleast my (INTET/AMD-ATI-rig), stated
making these
Exit status	197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED 
.
'errors'.

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Number of period iterations for PulseFind setted to:20
Number of app instances per device setted to:2
Running on device number: 0
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 0 device, slots 0 to 1 (including) will be checked
Used slot is 0;	OpenCL-kernels filename : MultiBeam_Kernels_r390.cl 
Info : Building Program (clBuildProgram):main kernels: OK code 0

Windows optimized S@H Enhanced application by Alex Kan
Version info: SSE3x (AMD/Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSE3x Win32 Build 390 , Ported by : Raistmer, JDWhale


SETI7 update by Raistmer

OpenCL version by Raistmer, r390


Build features: SETI7	Non-graphics	OpenCL	IPP	AMD specific	USE_SSE3	x86	
     CPUID:         Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz 

     Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 
CPU type 0x46 
Number of OpenCL platforms:				 1


 OpenCL Platform Name:					 AMD Accelerated Parallel Processing
Number of devices:				 2
  Max compute units:				 20
  Max work group size:				 256
  Max clock frequency:				 898Mhz
  Max memory allocation:			 536870912
  Cache type:					 None
  Cache line size:				 0
  Cache size:					 0
  Global memory size:				 1073741824
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Queue properties:				 
    Out-of-Order:				 No
  Name:						 Cypress
  Vendor:					 Advanced Micro Devices, Inc.
  Driver version:				 CAL 1.4.1720 (VM)
  Version:					 OpenCL 1.2 AMD-APP (938.1)
  Extensions:					 cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing 
  Max compute units:				 20
  Max work group size:				 256
  Max clock frequency:				 898Mhz
  Max memory allocation:			 536870912
  Cache type:					 None
  Cache line size:				 0
  Cache size:					 0
  Global memory size:				 1073741824
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Queue properties:				 
    Out-of-Order:				 No
  Name:						 Cypress
  Vendor:					 Advanced Micro Devices, Inc.
  Driver version:				 CAL 1.4.1720 (VM)
  Version:					 OpenCL 1.2 AMD-APP (938.1)
  Extensions:					 cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing 


Work Unit Info:
...............
Credit multiplier is :  2.85
WU true angle range is :  0.013126
Pulse: peak=3.387251, time=53.74, period=7.17, d_freq=1420967520.48, score=1.044, chirp=-0.19964, fft_len=1024 
Triplet: peak=10.70621, time=44.83, period=14.48, d_freq=1420967589.97, chirp=-6.1352, fft_len=256 
Pulse: peak=1.727817, time=53.69, period=3.116, d_freq=1420969401.12, score=1.006, chirp=10.137, fft_len=128 
Pulse: peak=5.58868, time=53.74, period=14.94, d_freq=1420968058.23, score=1.004, chirp=21.874, fft_len=1024 
Triplet: peak=9.934995, time=2.936, period=2.7, d_freq=1420975331.94, chirp=-29.343, fft_len=512 
Pulse: peak=4.046307, time=53.74, period=8.965, d_freq=1420974465.85, score=1.085, chirp=43.682, fft_len=1024 
Triplet: peak=10.82791, time=55.58, period=35.64, d_freq=1420968425.85, chirp=-44.816, fft_len=128 
Pulse: peak=9.219539, time=54.11, period=31.04, d_freq=1420975424.16, score=1.004, chirp=47.283, fft_len=8k
Spike: peak=24.22475, time=25.17, d_freq=1420972100.06, chirp=-59.537, fft_len=32k
Spike: peak=24.50759, time=25.17, d_freq=1420972100.06, chirp=-59.596, fft_len=32k
Triplet: peak=10.60119, time=51.98, period=49.02, d_freq=1420969412.08, chirp=78.561, fft_len=512 
Pulse: peak=2.327311, time=53.71, period=4.551, d_freq=1420975124, score=1.019, chirp=-90.699, fft_len=512 
Pulse: peak=0.949061, time=53.7, period=1.313, d_freq=1420970923.87, score=1.027, chirp=-95.767, fft_len=256 

</stderr_txt>
]]> 


This unit has found some signals((7)pulses, 3 triplets and 2 spikes), if I'm
not mistaken, [probably] would have finished and would have validated.
Or only the strongest are counted?

Well, i'll stop the GPUs anyway, after installing cat.11.9, CPU time got from
76% to 98% over the coarse of a MB WU. 4 cores to 'feed' the GPUs

GPU-Load has dropped from ~average 96 to 76% , so I'm not expecting
improvements, it'll get worse and worse :-/ Throughput or RAC has dropped,
cutting work cause some invalid way of handling these WUs, can't even remember
when and maybe I even saw this error message never before.

OFF TOPIC
Why does the AstroPulse GPU-app. so well compaired to the ATI version for MB,
is it the 4 signal-types for MB and 1 with AstroPulse, or I'm terribly
mistaken, AstroPulse detects Pulses and Repetitive Pulses.
Isn't this better suited for parallyzing, then 4 different signal types?

Back On Topic.
(Trying to read a half kilo of OpenCL ;-0 )
ID: 1266789 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1266794 - Posted: 2 Aug 2012, 19:24:21 UTC - in response to Message 1266789.  

you appear to be having a similar problem to what I had. Mike had me leave 2 CPU cores open to run the GPU. Surprise surprise My Ati worked fine after that


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1266794 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266806 - Posted: 2 Aug 2012, 19:48:06 UTC - in response to Message 1266794.  

I now use 4 cores, doesn't get better compaired to 2.
Went back from cat.12.4 to cat.11.9 looks disasterous, CPU times UP
GPU load down!

Will see tomorrow, cause the errors all come from the GPU!

ID: 1266806 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1266809 - Posted: 2 Aug 2012, 19:53:36 UTC
Last modified: 2 Aug 2012, 20:02:58 UTC

Fred,
Try something for me,
This is the commandline from my app.info
swap it for your line in your app.info just to see what happens.

<cmdline>-period_iterations_num 20 -instances_per_device 2 -hp -no_cpu_lock</cmdline>

Give it a day to see what happens,
or less it wrecks the job,
if it does then put your own back in,
I am running two 7970 on a single P4 and dont get errors.
Give it a go, you havent got much to loose . . .

edit - i use 12:4 on the 7970 coz it uses a lot less CPU and more GPU.
---- - Use whatever period-iterations you want to.
ID: 1266809 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1266827 - Posted: 2 Aug 2012, 20:50:29 UTC - in response to Message 1266809.  
Last modified: 2 Aug 2012, 21:14:36 UTC

Fred,
Try something for me,
This is the commandline from my app.info
swap it for your line in your app.info just to see what happens.

<cmdline>-period_iterations_num 20 -instances_per_device 2 -hp -no_cpu_lock</cmdline>

Give it a day to see what happens,
or less it wrecks the job,
if it does then put your own back in,
I am running two 7970 on a single P4 and dont get errors.
Give it a go, you havent got much to loose . . .

edit - i use 12:4 on the 7970 coz it uses a lot less CPU and more GPU.
---- - Use whatever period-iterations you want to.


Except cpu_lock, I'm using the same settings, tried another driver, cat.11.2 and 11.9 more CPU time and less GPU load, I changed it back to cat 12.4, but didn't install the whole AMD-APP-SDK 2.4 which I used the last time.

Also left 2 cores out for the GPUs, of 8 (i7-2600) and it runs slightly faster,
base clock 104MHz. (Multiplier is locked at 34x 100MHz.)
Raising the GPUs core-clock doesn't help. Did some BenchMark test with
SiSoftSandra a 8% overall gain, CPU/GPU/DRAM(DDR3-1634MHz.)OpenCL, etc..

I still find it odd, cause it worked perfect before june 2012!
And the errors only come from MB work, AstroPulse has by far the biggest
speed-up!

A server-side setting was put in place which handled the "runtime" and appears
to have some negative effect on some GPUs/drivers?!

But to answer your question, already have no_cpu_lock put in app_info
xml
, hope it helps.
With the cat 12.4 drivers, CPU use has gone down and GPU load has gone up
from 74% to 94% average over 1 MB VLAR WU. (2 cores are free and feed the GPUs)
ID: 1266827 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34577
Credit: 79,922,639
RAC: 80
Germany
Message 1266838 - Posted: 2 Aug 2012, 21:17:20 UTC

Fred.

Freeing cores dont give any benefit without using no_cpu_lock param.

With each crime and every kindness we birth our future.
ID: 1266838 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1266906 - Posted: 3 Aug 2012, 3:49:27 UTC

The "197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED " error isn't really new, it's just more specific than the previous -177 error where you had to look down in the stderr text to see which limit had been exceeeded.

The project has had the limit set at 10x the raw estimate probably since the beginning of seti_boinc, definitely it was so when I transitioned from Classic in June 2005. But in those days hosts typically ran at fractional DCF so the ratio between the indicated estimate and when BOINC would kill a task for taking too long was more like 40 or 50. It was when CreditNew and the associated per Application runtime estimation started being used that the ratio of 10 really started being used. Combined with some instability in the server calculations there have been many hosts which at one time or another have been afflicted with those errors, that's why Fred's Rescheduler has the option to increase the limit.

But if a host suddenly starts taking over 10x its former run time on similar tasks, figuring out the cause of that change is obviously important. Using the rescheduler to avoid the errors would be reasonable while investigating, that way the host continues being productive even though less so.
                                                                  Joe
ID: 1266906 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1266988 - Posted: 3 Aug 2012, 10:03:50 UTC - in response to Message 1266906.  
Last modified: 3 Aug 2012, 10:06:40 UTC

Fred's ATI/AMD host has an extremely high APR for ATI/AMD MultiBeam of 644.29, Normally this is around half of the ATI/AMD Astropulse APR value, which is in this case 635.47,
(My GTX460 only has a MB APR of 321.54 and is a lot faster completing MB than Fred's HD5800's, while AP for the GTX460 is 672.17)

I suggest that Fred tries running One task at a time (so they at least complete instead of erroring), and see if he can drive that APR down,

Claggy
ID: 1266988 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1267003 - Posted: 3 Aug 2012, 11:27:51 UTC - in response to Message 1266988.  
Last modified: 3 Aug 2012, 11:49:30 UTC

Fred's ATI/AMD host has an extremely high APR for ATI/AMD MultiBeam of 644.29, Normally this is around half of the ATI/AMD Astropulse APR value, which is in this case 635.47,
(My GTX460 only has a MB APR of 321.54 and is a lot faster completing MB than Fred's HD5800's, while AP for the GTX460 is 672.17)

I suggest that Fred tries running One task at a time (so they at least complete instead of erroring), and see if he can drive that APR down,

Claggy


Thanks all for your advice, stupid of me forgetting the no_cpu_lock,
with 2 free cores. I changed that yesterday evening and last errors are from
yesterday evening. August 02-2012;

SETI@home Enhanced (anonymous platform, CPU)
Number of tasks completed	7906
Max tasks per day	8608
Number of tasks today	125
Consecutive valid tasks	7783
Average processing rate	31.941425242509
Average turnaround time	2.55 days

SETI@home Enhanced (anonymous platform, ATI GPU)
Number of tasks completed	15
Max tasks per day	190
Number of tasks today	147
Consecutive valid tasks	9
Average processing rate	644.29097472245
Average turnaround time	0.55 days


Valid (292) · Invalid (0) · Error (147)

20 Jul 2012 | 15:54:17 UTC 	2 Aug 2012 | 20:41:47 UTC 	Error while computing 	3,373.15 	2,634.14 	--- 	SETI@home Enhanced
Anonymous platform (ATI GPU)
2532590167 	1031871149 	20 Jul 2012 | 15:54:17 UTC 	2 Aug 2012 | 20:41:47 UTC 	Error while computing 	30.19 	8.36 	--- 	SETI@home Enhanced
Anonymous platform (ATI GPU)
2532590125 	1032666587 	20 Jul 2012 | 15:54:17 UTC 	2 Aug 2012 | 20:41:47 UTC 	Error while computing 	3,378.73 	3,009.09 	--- 	SETI@home Enhanced
Anonymous platform (ATI GPU)
2532590119 	1032666580 	20 Jul 2012 | 15:54:17 UTC 	2 Aug 2012 | 20:41:47 UTC 	Error while computing 	3,379.29 	2,728.10 	--- 	SETI@home Enhanced
Anonymous platform (ATI GPU)
2532565192 	1032654775 	20 Jul 2012 | 15:25:46 UTC 	2 Aug 2012 | 12:18:12 UTC 	Error while computing 	23.52 	8.58 	--- 	SETI@home Enhanced
Anonymous platform (ATI GPU)


These are the last errors, maybe some more as there are still a lot pending.
I'll try 1 instance_per_device, leave all other settings as ther are now.
ID: 1267003 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Is it hosts or work erring out?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.