Message boards :
Number crunching :
Is it hosts or work erring out?
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
WUID 1037783606. WUID 1037925313. WUID 1037548421. WUID 1037925340. I can continue..... Seeing more and more of these errors, including my ATI rig. Too much different (CUDA) versions? Optimized hosts left unattended? ![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
The two with "Maximum elapsed time exceeded" errors I score as BOINC server code runtime estimation problems, the one with "found a triplet twice" is WU related though also a design flaw in the CUDA code. That leaves one where the GPU isn't operating stably, the host producing more invalid than valid results. Joe |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 34577 Credit: 79,922,639 RAC: 80 ![]() ![]() |
Its much more complicated Joe. I´m almost certain we will see this Maximum elapsed time exceeded error much more often now with this new GPU driver design. Specially with weak motherboards not beeing able to utilize CPU effectively to feed GPU. With each crime and every kindness we birth our future. |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Its much more complicated Joe. Probably is parallizing the anylizing of different signal types used at SETI, more difficult to 'smooth out' over the CUs of a GPU? ![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
Its much more complicated Joe. The reason I score Maximum elapsed time exceeded errors as a BOINC problem is because that feature hasn't been updated to reflect some of the observed difficulties. It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar. Perhaps the splitters should just set rsc_fpops_bound to a higher multiple of rsc_fpops_est, something like 20 or 25 rather than the current 10 might be better overall. But that's a workaround rather than a fix. You're definitely right that it's complex, but when a protective feature starts causing more problems than it prevents some redesign is sensible. Avoiding adding additional complexity and possibilities for problems might be difficult, though. Joe |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Its much more complicated Joe. So, in fact they aren't real-computation-errors, but the result of a BOINC 'safety net'? Upped CPU base clock and GPU clock speed, with expected zero result and probably added instabillity!? ![]() |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Its much more complicated Joe. And with an error rate greater as 60% is it worth to just go on like this? If it's a server/splitter-side setting is a work-around compromizing results? It's too HOT today anyway, but would trying some older drivers and OpenCL version 1.0 make sense, I'll doubt it with a 'too tight' splitter setting?! ![]() |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 34577 Credit: 79,922,639 RAC: 80 ![]() ![]() |
Older drivers could help indeed for effected hosts. Specially on HD 5x and HD 6x cards because of different syncing inside drivers. But not for HD 7x cards of course. With each crime and every kindness we birth our future. |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Older drivers could help indeed for effected hosts. I did something I should/could have done before OCing the CPU, base clock 100MHz to 104MHz and OCing the GPU from 850MHz to 896MHz. Appears perfectly stable CPU @ 3540MHz; 3560MFLOPS per CPU; 6 used, 2 for GPUs. GPU 5440GFLOPS max. Temps, on air 80C highest for CPU and 82 for the GPUs. If this don't help I'll try older cat. drivers like 11.2 or 11.9 and AMD SDK 1.0 now using cat 12.4 and AMD-APP-SDK 2.4, OpenCL ver. 1.2 ![]() |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Older drivers could help indeed for effected hosts. There isn't anything against rev.177 for MB work? I did see another and newer version,then rev.331 forgot on which host? It did not help to raise the GPU-core-clock, still over 65% errored, Time Limit Exceeded. Now trying cat.11.2, was 12.4, with OpenCL support. Hope this helps, if not I'll quit using the GPUs cause it's waste of recourses. Maybe try to get 79xx series, but €€€........!? First difference when using cat.11.2 was a crash and uninstalled 11.2 and installed cat.11.9 giving a much higher CPU load, average 96%, 77% with cat. 12.4! GPU load is lower ~73%, was ~92%. Well I'll let it run overnight and count the errors :-/ . I'm running out of options.................?! ![]() |
Alan Send message Joined: 16 Jun 11 Posts: 4 Credit: 867,828 RAC: 0 ![]() |
I am going on about 18 hrs with out errors. I rolled back to catalyst 12.1 driver 8.930.0.0. also check to see if your AV sandboxes executables. Avast! does. If it does try adding the exe files for seti to the sandbox exception list. |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Older drivers could help indeed for effected hosts. [snipped]...... Reading a post of Joe Segur, It was intended to cut off processing when an application got stuck in a loop and would probably never have finished a task, but it's implemented in such a simple fashion that it often cuts off processing which is approaching completion. Dr. Anderson did set up the average underlying APR to keep track of variance as well, but the variance isn't used for anything yet. It might be possible for BOINC to increase the rsc_fpops_bound for an app which is exhibiting high variance on a host, or something similar. [snipped]........... I was watching the WUs done by the ATI 5870GPU when it got aborted with a message maximumelapsed time eceeded 57:27 min. and the WU got is "error message", while, infact, it's a SERVER-Side quick fix which is turning from worse to disaster :-/. Don't know when this impementation was put in place, as far I can remem- ber, since the beginning of june, 2012 maybe earliar, atleast my (INTET/AMD-ATI-rig), stated making these Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED. 'errors'. <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Number of period iterations for PulseFind setted to:20 Number of app instances per device setted to:2 Running on device number: 0 Priority of worker thread raised successfully Priority of process adjusted successfully, high priority class used OpenCL platform detected: Advanced Micro Devices, Inc. BOINC assigns 0 device, slots 0 to 1 (including) will be checked Used slot is 0; OpenCL-kernels filename : MultiBeam_Kernels_r390.cl Info : Building Program (clBuildProgram):main kernels: OK code 0 Windows optimized S@H Enhanced application by Alex Kan Version info: SSE3x (AMD/Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSE3x Win32 Build 390 , Ported by : Raistmer, JDWhale SETI7 update by Raistmer OpenCL version by Raistmer, r390 Build features: SETI7 Non-graphics OpenCL IPP AMD specific USE_SSE3 x86 CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 CPU type 0x46 Number of OpenCL platforms: 1 OpenCL Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Max compute units: 20 Max work group size: 256 Max clock frequency: 898Mhz Max memory allocation: 536870912 Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Queue properties: Out-of-Order: No Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1720 (VM) Version: OpenCL 1.2 AMD-APP (938.1) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing Max compute units: 20 Max work group size: 256 Max clock frequency: 898Mhz Max memory allocation: 536870912 Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Queue properties: Out-of-Order: No Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1720 (VM) Version: OpenCL 1.2 AMD-APP (938.1) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.013126 Pulse: peak=3.387251, time=53.74, period=7.17, d_freq=1420967520.48, score=1.044, chirp=-0.19964, fft_len=1024 Triplet: peak=10.70621, time=44.83, period=14.48, d_freq=1420967589.97, chirp=-6.1352, fft_len=256 Pulse: peak=1.727817, time=53.69, period=3.116, d_freq=1420969401.12, score=1.006, chirp=10.137, fft_len=128 Pulse: peak=5.58868, time=53.74, period=14.94, d_freq=1420968058.23, score=1.004, chirp=21.874, fft_len=1024 Triplet: peak=9.934995, time=2.936, period=2.7, d_freq=1420975331.94, chirp=-29.343, fft_len=512 Pulse: peak=4.046307, time=53.74, period=8.965, d_freq=1420974465.85, score=1.085, chirp=43.682, fft_len=1024 Triplet: peak=10.82791, time=55.58, period=35.64, d_freq=1420968425.85, chirp=-44.816, fft_len=128 Pulse: peak=9.219539, time=54.11, period=31.04, d_freq=1420975424.16, score=1.004, chirp=47.283, fft_len=8k Spike: peak=24.22475, time=25.17, d_freq=1420972100.06, chirp=-59.537, fft_len=32k Spike: peak=24.50759, time=25.17, d_freq=1420972100.06, chirp=-59.596, fft_len=32k Triplet: peak=10.60119, time=51.98, period=49.02, d_freq=1420969412.08, chirp=78.561, fft_len=512 Pulse: peak=2.327311, time=53.71, period=4.551, d_freq=1420975124, score=1.019, chirp=-90.699, fft_len=512 Pulse: peak=0.949061, time=53.7, period=1.313, d_freq=1420970923.87, score=1.027, chirp=-95.767, fft_len=256 </stderr_txt> ]]> This unit has found some signals((7)pulses, 3 triplets and 2 spikes), if I'm not mistaken, [probably] would have finished and would have validated. Or only the strongest are counted? Well, i'll stop the GPUs anyway, after installing cat.11.9, CPU time got from 76% to 98% over the coarse of a MB WU. 4 cores to 'feed' the GPUs GPU-Load has dropped from ~average 96 to 76% , so I'm not expecting improvements, it'll get worse and worse :-/ Throughput or RAC has dropped, cutting work cause some invalid way of handling these WUs, can't even remember when and maybe I even saw this error message never before. OFF TOPIC Why does the AstroPulse GPU-app. so well compaired to the ATI version for MB, is it the 4 signal-types for MB and 1 with AstroPulse, or I'm terribly mistaken, AstroPulse detects Pulses and Repetitive Pulses. Isn't this better suited for parallyzing, then 4 different signal types? Back On Topic. (Trying to read a half kilo of OpenCL ;-0 ) ![]() |
![]() ![]() Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 ![]() ![]() |
you appear to be having a similar problem to what I had. Mike had me leave 2 CPU cores open to run the GPU. Surprise surprise My Ati worked fine after that ![]() In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
|
.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69 ![]() ![]() |
Fred, Try something for me, This is the commandline from my app.info swap it for your line in your app.info just to see what happens. <cmdline>-period_iterations_num 20 -instances_per_device 2 -hp -no_cpu_lock</cmdline> Give it a day to see what happens, or less it wrecks the job, if it does then put your own back in, I am running two 7970 on a single P4 and dont get errors. Give it a go, you havent got much to loose . . . edit - i use 12:4 on the 7970 coz it uses a lot less CPU and more GPU. ---- - Use whatever period-iterations you want to. |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Fred, Except cpu_lock, I'm using the same settings, tried another driver, cat.11.2 and 11.9 more CPU time and less GPU load, I changed it back to cat 12.4, but didn't install the whole AMD-APP-SDK 2.4 which I used the last time. Also left 2 cores out for the GPUs, of 8 (i7-2600) and it runs slightly faster, base clock 104MHz. (Multiplier is locked at 34x 100MHz.) Raising the GPUs core-clock doesn't help. Did some BenchMark test with SiSoftSandra a 8% overall gain, CPU/GPU/DRAM(DDR3-1634MHz.)OpenCL, etc.. I still find it odd, cause it worked perfect before june 2012! And the errors only come from MB work, AstroPulse has by far the biggest speed-up! A server-side setting was put in place which handled the "runtime" and appears to have some negative effect on some GPUs/drivers?! But to answer your question, already have no_cpu_lock put in app_info xml, hope it helps. With the cat 12.4 drivers, CPU use has gone down and GPU load has gone up from 74% to 94% average over 1 MB VLAR WU. (2 cores are free and feed the GPUs) ![]() |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 34577 Credit: 79,922,639 RAC: 80 ![]() ![]() |
Fred. Freeing cores dont give any benefit without using no_cpu_lock param. With each crime and every kindness we birth our future. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
The "197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED " error isn't really new, it's just more specific than the previous -177 error where you had to look down in the stderr text to see which limit had been exceeeded. The project has had the limit set at 10x the raw estimate probably since the beginning of seti_boinc, definitely it was so when I transitioned from Classic in June 2005. But in those days hosts typically ran at fractional DCF so the ratio between the indicated estimate and when BOINC would kill a task for taking too long was more like 40 or 50. It was when CreditNew and the associated per Application runtime estimation started being used that the ratio of 10 really started being used. Combined with some instability in the server calculations there have been many hosts which at one time or another have been afflicted with those errors, that's why Fred's Rescheduler has the option to increase the limit. But if a host suddenly starts taking over 10x its former run time on similar tasks, figuring out the cause of that change is obviously important. Using the rescheduler to avoid the errors would be reasonable while investigating, that way the host continues being productive even though less so. Joe |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 ![]() |
Fred's ATI/AMD host has an extremely high APR for ATI/AMD MultiBeam of 644.29, Normally this is around half of the ATI/AMD Astropulse APR value, which is in this case 635.47, (My GTX460 only has a MB APR of 321.54 and is a lot faster completing MB than Fred's HD5800's, while AP for the GTX460 is 672.17) I suggest that Fred tries running One task at a time (so they at least complete instead of erroring), and see if he can drive that APR down, Claggy |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Fred's ATI/AMD host has an extremely high APR for ATI/AMD MultiBeam of 644.29, Normally this is around half of the ATI/AMD Astropulse APR value, which is in this case 635.47, Thanks all for your advice, stupid of me forgetting the no_cpu_lock, with 2 free cores. I changed that yesterday evening and last errors are from yesterday evening. August 02-2012; SETI@home Enhanced (anonymous platform, CPU) Number of tasks completed 7906 Max tasks per day 8608 Number of tasks today 125 Consecutive valid tasks 7783 Average processing rate 31.941425242509 Average turnaround time 2.55 days SETI@home Enhanced (anonymous platform, ATI GPU) Number of tasks completed 15 Max tasks per day 190 Number of tasks today 147 Consecutive valid tasks 9 Average processing rate 644.29097472245 Average turnaround time 0.55 days Valid (292) · Invalid (0) · Error (147) 20 Jul 2012 | 15:54:17 UTC 2 Aug 2012 | 20:41:47 UTC Error while computing 3,373.15 2,634.14 --- SETI@home Enhanced Anonymous platform (ATI GPU) 2532590167 1031871149 20 Jul 2012 | 15:54:17 UTC 2 Aug 2012 | 20:41:47 UTC Error while computing 30.19 8.36 --- SETI@home Enhanced Anonymous platform (ATI GPU) 2532590125 1032666587 20 Jul 2012 | 15:54:17 UTC 2 Aug 2012 | 20:41:47 UTC Error while computing 3,378.73 3,009.09 --- SETI@home Enhanced Anonymous platform (ATI GPU) 2532590119 1032666580 20 Jul 2012 | 15:54:17 UTC 2 Aug 2012 | 20:41:47 UTC Error while computing 3,379.29 2,728.10 --- SETI@home Enhanced Anonymous platform (ATI GPU) 2532565192 1032654775 20 Jul 2012 | 15:25:46 UTC 2 Aug 2012 | 12:18:12 UTC Error while computing 23.52 8.58 --- SETI@home Enhanced Anonymous platform (ATI GPU) These are the last errors, maybe some more as there are still a lot pending. I'll try 1 instance_per_device, leave all other settings as ther are now. ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.