When can some better controls be put on runaway rigs?

Message boards : Number crunching : When can some better controls be put on runaway rigs?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1448493 - Posted: 29 Nov 2013, 22:35:16 UTC - in response to Message 1448443.  
Last modified: 29 Nov 2013, 22:39:22 UTC

FAIK, the "100% of the processors" can be reduced with no unexpected side effects.

Actually, I'm not convinced that the "no unexpected side effects" is necessarily true.

Referring back to an old thread, AstroPulse has yet to finish on my system and specifically to my message #1391036, I noted at the time (back in mid-July) that there seemed as if there might be a connection between running BOINC at less than 100% and having AP CPU tasks get "hung". I had started to look very carefully at the timing of the hangups and found that each one I looked at had halted at precisely the progress percent that was recorded in the last checkpoint taken. This led me to speculate that

if BOINC happens to suspend a CPU AP task while it's in the middle of some specific critical activity (such as taking a checkpoint), whether the AP program might have a problem resuming when BOINC tells it to.

I experienced several more of these "hung" AP tasks up until early September, under exactly the same circumstances. Here are some notes I made on a September 5th hangup:
9/5/2013: AP was "stuck" from 8:26 PM until approx. 9:02 PM
Entered "chunk" 7296 at approx. 7:29 PM
BOINC Manager showed 45.858% complete
Running on jrbuck_PC at 75%

<active_task>
    <project_master_url>http://setiathome.berkeley.edu/</project_master_url>
    <result_name>ap_28au08ad_B6_P0_00027_20130902_26759.wu_1</result_name>
    <checkpoint_cpu_time>87224.510000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>97241.463882</checkpoint_elapsed_time>
    <fraction_done>0.458581</fraction_done>
</active_task>


PROCESS EXPLORER:
Thread 1480:
ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc
ntkrnlpa.exe!KeDelayExecutionThread+0x472
ntkrnlpa.exe!NtSetEvent+0xb4a
ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6
ntdll.dll!KiFastSystemCallRet
kernel32.dll!Sleep+0xf
astropulse_6.01_windows_intelx86.exe+0x2e4fb
ntdll.dll!RtlInitializeExceptionChain+0x63
ntdll.dll!RtlInitializeExceptionChain+0x36

Thread 5128:
ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc
ntkrnlpa.exe!KeWaitForMultipleObjects+0x540
ntkrnlpa.exe!NtSetEvent+0x8df
ntkrnlpa.exe!NtSetEvent+0x64e
ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6
ntdll.dll!KiFastSystemCallRet
kernel32.dll!WaitForMultipleObjects+0x18
astropulse_6.01_windows_intelx86.exe+0x35db2

Thread 5836: (just viewing this thread's stack caused process to resume)
ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc
ntkrnlpa.exe!KeWaitForMutexObject+0x492
ntkrnlpa.exe!KeTestAlertThread+0x78
hal.dll!KfRaiseIrql+0xd1
hal.dll!KeRaiseIrqlToSynchLevel+0x70
hal.dll!HalEndSystemInterrupt+0x73
hal.dll!HalInitializeProcessor+0xcc1
astropulse_6.01_windows_intelx86.exe+0x167a2
astropulse_6.01_windows_intelx86.exe+0x8954
astropulse_6.01_windows_intelx86.exe+0x9300
astropulse_6.01_windows_intelx86.exe+0x3f04b
astropulse_6.01_windows_intelx86.exe+0xd6648
astropulse_6.01_windows_intelx86.exe+0x6346c
 

Following this last incident, I read up on TThrottle and decided to give it a try, since it seemed to take a different approach to task suspension. On September 8, I installed TThrottle and set the BOINC processor percentage back to 100%, allowing TThrottle to manage the processor usage, instead of BOINC. I have not had a stuck AP task since then, close to 3 months now.

That may not be conclusive proof that there is, in fact, an unexpected side effect to setting the BOINC processor percentage to something less than 100%, but I certainly found that in my case, at least, there was.
ID: 1448493 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1448500 - Posted: 29 Nov 2013, 23:11:30 UTC - in response to Message 1448493.  

FAIK, the "100% of the processors" can be reduced with no unexpected side effects.

Actually, I'm not convinced that the "no unexpected side effects" is necessarily true.

<...snip...>

That may not be conclusive proof that there is, in fact, an unexpected side effect to setting the BOINC processor percentage to something less than 100%, but I certainly found that in my case, at least, there was.

Jeff, I think you've fallen into exactly the same trap that Joe tried to extricate me from.

There's a huge difference between these two lines in preferences:

On multiprocessors, use at most: 100% of the processors

Use at most: 100% of CPU time
(Can be used to reduce CPU heat)

The first line (relating to the number of processors) is safe to use - it doesn't make BOINC suspend a task while running.

The second version - relating to the precentage of time BOINC runs a task (stopping and restarting it) is where the potential problems lie. Coincidentally, since I last posted, David Anderson has acknowledged the problem:

client: fix bugs with CPU throttling and GPU apps
Various bad things could happen when CPU throttling was used together w/ GPU apps.

Examples:

  • on a multi-GPU system, several GPU tasks are assigned to the same GPU
  • a suspended GPU task remains in memory (tying up its GPU resources)


while other tasks try to use the GPU.


http://boinc.berkeley.edu/trac/changeset/d6da81b86284eb60e1d4e33155167e2f87a1b138/boinc-v2
ID: 1448500 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1448506 - Posted: 29 Nov 2013, 23:26:18 UTC - in response to Message 1448500.  

I think I might have been a bit careless in my use of the term
setting the BOINC processor percentage to something less than 100%

I had forgotten that BOINC had those two similar preferences. It was, in fact, the
Use at most: 100% of CPU time
(Can be used to reduce CPU heat)

option that I was setting to less than 100%, in order to control my CPU temperatures, thus causing the tasks to actually get suspended.

I wonder if what David Anderson is referring to, though, is actually what I (and others) seemed to be experiencing. Certainly in my case it was only AP CPU tasks that would hang. The GPU just kept chugging along.
ID: 1448506 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : When can some better controls be put on runaway rigs?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.