Warning when using 7.3.14

Message boards : Number crunching : Warning when using 7.3.14
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1498000 - Posted: 31 Mar 2014, 23:01:23 UTC

BOINC 7.3.14, the latest alpha, will trash all work here at Seti with similar messages as the one at the end of this post.

This is because this newest client will check the workunit.rsc_memory_bound value that the project has set as the maximum amount of memory that a task can use. The clients up till 7.3.14 did not check this value correctly. 7.3.14 does. And since all projects set this value to too low a value, your work will be aborted after a bit of a run.

If this client is one that is going to be released to the public, all projects will have to redo their workunit.rsc_memory_bound values to more correct ones. E.g. 1024MB of x64, 512MB on x86.

The project would have to set rsc_memory_bound to 1 GB. But then these tasks wouldn't be sent to x86 hosts with 512MB RAM.

<core_client_version>7.3.14</core_client_version>
<![CDATA[
<message>
working set size > workunit.rsc_memory_bound: 36.82MB > 32.00MB
</message>
<stderr_txt>
Running on device number: 0
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
OpenCL platform detected: Intel(R) Corporation
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns device 0
Info: BOINC provided device ID used

Build features: SETI7	Non-graphics	OpenCL	USE_OPENCL_HD5xxx	OCL_ZERO_COPY	OCL_CHIRP3	FFTW	AMD specific	USE_SSE	x86	
     CPUID:        Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz 

     Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX 
OpenCL-kernels filename : MultiBeam_Kernels_r1843.cl 
ar=0.443860 NumCfft=192145 NumGauss=          1064875014 NumPulse=        114535999999 NumTriplet=      15768471666688

Currently allocated 145 MB for GPU buffers
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Windows optimized S@H Enhanced application by Alex Kan
Version info: SSEx (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSEx Win32 Build 1843 , Ported by : Raistmer, JDWhale

SETI7 update by Raistmer

OpenCL version by Raistmer, r1843

AMD HD5 version by Raistmer

Number of OpenCL platforms:				 2


 OpenCL Platform Name:					 Intel(R) OpenCL
Number of devices:				 0


 OpenCL Platform Name:					 AMD Accelerated Parallel Processing
Number of devices:				 1
  Max compute units:				 20
  Max work group size:				 256
  Max clock frequency:				 1000Mhz
  Max memory allocation:			 536870912
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 2147483648
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Queue properties:				 
    Out-of-Order:				 No
  Name:						 Pitcairn
  Vendor:					 Advanced Micro Devices, Inc.
  Driver version:				 1268.1 (VM)
  Version:					 OpenCL 1.2 AMD-APP (1268.1)
  Extensions:					 cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer 


Work Unit Info:
...............
Credit multiplier is :  2.85
WU true angle range is :  0.443860
Used GPU device parameters are:
	Number of compute units: 20
	Single buffer allocation size: 64MB
	max WG size: 256
period_iterations_num=20
GPU device synched
Termination request detected or computations are finished. GPU device synched,  exiting...

</stderr_txt>
]]>

ID: 1498000 · Report as offensive
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 1498036 - Posted: 1 Apr 2014, 1:24:20 UTC

Must set rsc_memory_bound correctly

SETI Team:

You need to change your work unit parameters, to properly set <rsc_memory_bound> correctly. BOINC 7.3.14 alpha (and potentially future versions also) will read that value, and compare it to the Working Set size, and will auto-abort the work unit if it exceeds the bound.

As of right now, I am getting errors due to your incorrect settings.

For example:
http://setiathome.berkeley.edu/result.php?resultid=3467797216
Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
<core_client_version>7.3.14</core_client_version>
<![CDATA[
<message>
working set size > workunit.rsc_memory_bound: 97.08MB > 32.00MB
</message>

Could you please promptly fix this?

Regards,
Jacob Klein
ID: 1498036 · Report as offensive
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 1498043 - Posted: 1 Apr 2014, 2:10:33 UTC

It looks like this change is being reverted for now, per David's email below.
So, there is no longer an immediate need to correct the value...
But please consider setting it correctly at some point, in case it gets used by the client in the future.


> Date: Mon, 31 Mar 2014 18:53:33 -0700
> From: d..a@ssl.berkeley.edu
> To: b..c_alpha@ssl.berkeley.edu
> Subject: Re: [boinc_alpha] 7.3.14 - Heads up - Memory bound enforcement
>
> On further thought, I'm going to change things back to the way they were, namely
>
> 1) workunit.rsc_memory_bound is used only by the server;
> it won't send a job if rsc_memory_bound > host's available RAM
> 2) the client aborts a job if working set size > host's available RAM
> 3) the client will run a set of jobs only if the sum of their WSSs
> fits in available RAM
> (i.e. if a job's WSS is close to all available RAM,
> it would run that job and nothing else)
>
> The reason for not aborting jobs when WSS > rsc_memory_bound is that
> it requires projects to come up with very accurate estimates of RAM usage,
> which I don't think is feasible in general.
> Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.
>
> -- David
ID: 1498043 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1498088 - Posted: 1 Apr 2014, 5:05:54 UTC

I keep telling people that newer versions don't necessarily make them better.


> Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.

Now that's strange hearing that from Dr. D.A., are you sure that he sent it?

Cheers.
ID: 1498088 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1498113 - Posted: 1 Apr 2014, 6:59:21 UTC - in response to Message 1498088.  

I keep telling people that newer versions don't necessarily make them better.


> Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.

Now that's strange hearing that from Dr. D.A., are you sure that he sent it?

Cheers.


The impressions I've been given from activity on the boincapi and CreditNew fronts, is that Dr. A's quite aware there are problems, and needs development help fixing them due to low resources. I'm not at all surprised that a change to tighten a questionable failsafe, might create more work than it alleviates, so be retracted for a rethink.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1498113 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1498196 - Posted: 1 Apr 2014, 14:26:50 UTC - in response to Message 1498000.  
Last modified: 1 Apr 2014, 14:27:22 UTC


The project would have to set rsc_memory_bound to 1 GB. But then these tasks wouldn't be sent to x86 hosts with 512MB RAM.


Does this imply that host with let say 256 MB will never recive this app?
I think that host with 256MB of system memory can do OpenCL Ati MB task provided GPU allow it. Why project should exclude such hosts just to please new BOINC version? Or am I missing something ?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1498196 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1498197 - Posted: 1 Apr 2014, 14:31:55 UTC
Last modified: 1 Apr 2014, 14:33:02 UTC

And another question why the hell client should abort ANYTHING already recived ???
It always just causes frustration and resource waste. What a maniacal will to abortion ??
It was with "GPU missing" until objections were too strong and now task set on suspend instead of abortion... It with "too long running" tasks that aborted w/o really need just because BOINC thinks they run too long due to distorted estimates... And now because task workset exceeds available memory? so what, we live in world with paging many years already. If such task happened to land somehow on PC let it continue or die by itself.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1498197 · Report as offensive
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 1498201 - Posted: 1 Apr 2014, 14:36:41 UTC - in response to Message 1498197.  
Last modified: 1 Apr 2014, 14:38:21 UTC

If the task's working set size exceeds the amount of RAM that the user has configured BOINC to use, then it does make sense to abort the task, because it cannot possibly be completed. That logic is not being changed.

The proposal was to additionally check the working set size against a work unit parameter, rsc_memory_bound, and abort if the bound was exceeded. That proposed change, which was put into 7.3.14 alpha, will be reverted.

Thus, even if the work unit exceeds rsc_memory_bound, it'll be allowed to continue, unless it happens to exceed the amount of RAM that the user has configured BOINC to use.

Take a breath . . .
ID: 1498201 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1498218 - Posted: 1 Apr 2014, 15:44:15 UTC
Last modified: 1 Apr 2014, 16:05:37 UTC

Part of the problem here, is that under modern operating systems, limits on and with respect to physical RAM make no sense, as all the memory is virtualised. That makes 'working set size' and page commit behaviour technically more relevant, though pretty far removed from typical user perception (Except perhaps for the few that completely disable paging, and understand that up to 50% of Windows memory is non-paged Kernel memory)

As the implementation appears to have been intended, the limit setting, along with subsequent aborts, will appear as nonsense, simply because applications do not use ANY physical memory directly.

If Boinc wants 'safeties', it needs to use ones that make sense, like the obvious:
"received an out of memory exception with XYZ application"... don't send that anymore, and somehow provide info as to why (with a possible reset)... application details page perhaps ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1498218 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1498266 - Posted: 1 Apr 2014, 21:11:33 UTC - in response to Message 1498197.  
Last modified: 1 Apr 2014, 21:13:12 UTC

And now because task workset exceeds available memory? so what, we live in world with paging many years already. If such task happened to land somehow on PC let it continue or die by itself.

Next time please consider what paging does to computer's usability.

If BOINC is allowed to do computations while the computer is in use and the science apps use so much memory that the computer starts paging it will severely affect the users experience with using whatever programs he wants to use.

And even if the paging occurs when the computer is not in use otherwise it will slow down science apps very badly. You can't do any computations with your data if the data is not in RAM, can you?

And if you combine this with Linux' rather optimistic memory allocation strategy and the typical fixed size swap (or even no swap at all) you'll get in extreme cases a computer that is complete unusable. Once you have too much data in hand Linux starts throwing program code out of the RAM.

When all of the virtual memory is used the computer will spend all of its time loading a page of code in and throwing another out. I've seen that happen with programs leaking memory and starting one too many program when the computer is already swapping. The only way out that situation is the reset button.
ID: 1498266 · Report as offensive

Message boards : Number crunching : Warning when using 7.3.14


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.