When can some better controls be put on runaway rigs?

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1448493 - Posted: 29 Nov 2013, 22:35:16 UTC - in response to Message 1448443. Last modified: 29 Nov 2013, 22:39:22 UTC FAIK, the "100% of the processors" can be reduced with no unexpected side effects. Actually, I'm not convinced that the "no unexpected side effects" is necessarily true. Referring back to an old thread, AstroPulse has yet to finish on my system and specifically to my message #1391036, I noted at the time (back in mid-July) that there seemed as if there might be a connection between running BOINC at less than 100% and having AP CPU tasks get "hung". I had started to look very carefully at the timing of the hangups and found that each one I looked at had halted at precisely the progress percent that was recorded in the last checkpoint taken. This led me to speculate that if BOINC happens to suspend a CPU AP task while it's in the middle of some specific critical activity (such as taking a checkpoint), whether the AP program might have a problem resuming when BOINC tells it to. I experienced several more of these "hung" AP tasks up until early September, under exactly the same circumstances. Here are some notes I made on a September 5th hangup: 9/5/2013: AP was "stuck" from 8:26 PM until approx. 9:02 PM Entered "chunk" 7296 at approx. 7:29 PM BOINC Manager showed 45.858% complete Running on jrbuck_PC at 75% <active_task> <project_master_url>http://setiathome.berkeley.edu/</project_master_url> <result_name>ap_28au08ad_B6_P0_00027_20130902_26759.wu_1</result_name> <checkpoint_cpu_time>87224.510000</checkpoint_cpu_time> <checkpoint_elapsed_time>97241.463882</checkpoint_elapsed_time> <fraction_done>0.458581</fraction_done> </active_task> PROCESS EXPLORER: Thread 1480: ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc ntkrnlpa.exe!KeDelayExecutionThread+0x472 ntkrnlpa.exe!NtSetEvent+0xb4a ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6 ntdll.dll!KiFastSystemCallRet kernel32.dll!Sleep+0xf astropulse_6.01_windows_intelx86.exe+0x2e4fb ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36 Thread 5128: ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc ntkrnlpa.exe!KeWaitForMultipleObjects+0x540 ntkrnlpa.exe!NtSetEvent+0x8df ntkrnlpa.exe!NtSetEvent+0x64e ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6 ntdll.dll!KiFastSystemCallRet kernel32.dll!WaitForMultipleObjects+0x18 astropulse_6.01_windows_intelx86.exe+0x35db2 Thread 5836: (just viewing this thread's stack caused process to resume) ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc ntkrnlpa.exe!KeWaitForMutexObject+0x492 ntkrnlpa.exe!KeTestAlertThread+0x78 hal.dll!KfRaiseIrql+0xd1 hal.dll!KeRaiseIrqlToSynchLevel+0x70 hal.dll!HalEndSystemInterrupt+0x73 hal.dll!HalInitializeProcessor+0xcc1 astropulse_6.01_windows_intelx86.exe+0x167a2 astropulse_6.01_windows_intelx86.exe+0x8954 astropulse_6.01_windows_intelx86.exe+0x9300 astropulse_6.01_windows_intelx86.exe+0x3f04b astropulse_6.01_windows_intelx86.exe+0xd6648 astropulse_6.01_windows_intelx86.exe+0x6346c Following this last incident, I read up on TThrottle and decided to give it a try, since it seemed to take a different approach to task suspension. On September 8, I installed TThrottle and set the BOINC processor percentage back to 100%, allowing TThrottle to manage the processor usage, instead of BOINC. I have not had a stuck AP task since then, close to 3 months now. That may not be conclusive proof that there is, in fact, an unexpected side effect to setting the BOINC processor percentage to something less than 100%, but I certainly found that in my case, at least, there was. ID: 1448493 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1448500 - Posted: 29 Nov 2013, 23:11:30 UTC - in response to Message 1448493. FAIK, the "100% of the processors" can be reduced with no unexpected side effects. Actually, I'm not convinced that the "no unexpected side effects" is necessarily true. <...snip...> That may not be conclusive proof that there is, in fact, an unexpected side effect to setting the BOINC processor percentage to something less than 100%, but I certainly found that in my case, at least, there was. Jeff, I think you've fallen into exactly the same trap that Joe tried to extricate me from. There's a huge difference between these two lines in preferences: On multiprocessors, use at most: 100% of the processors Use at most: 100% of CPU time (Can be used to reduce CPU heat) The first line (relating to the number of processors) is safe to use - it doesn't make BOINC suspend a task while running. The second version - relating to the precentage of time BOINC runs a task (stopping and restarting it) is where the potential problems lie. Coincidentally, since I last posted, David Anderson has acknowledged the problem: client: fix bugs with CPU throttling and GPU apps Various bad things could happen when CPU throttling was used together w/ GPU apps. Examples: on a multi-GPU system, several GPU tasks are assigned to the same GPU a suspended GPU task remains in memory (tying up its GPU resources) while other tasks try to use the GPU. http://boinc.berkeley.edu/trac/changeset/d6da81b86284eb60e1d4e33155167e2f87a1b138/boinc-v2 ID: 1448500 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1448506 - Posted: 29 Nov 2013, 23:26:18 UTC - in response to Message 1448500. I think I might have been a bit careless in my use of the term setting the BOINC processor percentage to something less than 100% I had forgotten that BOINC had those two similar preferences. It was, in fact, the Use at most: 100% of CPU time (Can be used to reduce CPU heat) option that I was setting to less than 100%, in order to control my CPU temperatures, thus causing the tasks to actually get suspended. I wonder if what David Anderson is referring to, though, is actually what I (and others) seemed to be experiencing. Certainly in my case it was only AP CPU tasks that would hang. The GPU just kept chugging along. ID: 1448506 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.