Message boards :
Number crunching :
Long-running work unit
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise. The primary timing looks to be using inline assembly with the rdtsc instruction, the fallback the gettimeofday function. There are other options for both depending on build configuration. If you run the test BilBg suggested, we'll at least know for sure if the symptoms on your Linux system really match. There might possibly be other parts of the initialization which could hang. Joe |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
 I did a test today (15 runs) - stderr.txt (K6-2+ / Windows 2000) http://pastebin.com/NLKpNxiJ Some observations: - it is known but lets make it clear - app never hangs (at least for me) in the middle of Computing, it only may hang on the startup during 'Optimal function choice' - wisdom.sah is 'updated' first (the fake update, only the order of lines changes) The time of the file is set to 'now' 13-15 s after app start (since this happens before 'Optimal function choice' this file is updated even for hang app) - the 'Test duration' is another wrong value: Test duration 0.36 seconds Test duration 380.57 seconds In reality I measured 120-140 s from app start to appearing of line 'Test duration' (obviously only for not-hang runs, time is not exactly the same every run but is in the range of ~2 minutes on this system) - After app Hangs - most of the CPU time/load (~70%) is in Kernel mode (SIV, Process Explorer):   - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
I ran some tests, but the results were unremarkable. I've also been running SETI@Home normally without any problems, until today. Workunits 1612718929 and 1612022476 both stalled with the same symptoms as before. Unfortunately, I'm unable to reproduce the problem. For instance, stderr.txt for 1612022476 (27mr08ah.20785.20522.438086664204.12.8) is: setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 0.432149 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_vGetPowerSpectrum2 0.000204 0.00000 But when I test the WU with -verbose, it looks normal: 11:46:43 (12672): Can't open init data file - running in standalone mode 11:46:43 (12672): Can't open init data file - running in standalone mode setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 0.432149 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.000253 0.00000 test v_vGetPowerSpectrum 0.000172 0.00000 test v_vGetPowerSpectrum2 0.000193 0.00000 test v_vGetPowerSpectrumUnrolled 0.000177 0.00000 test v_vGetPowerSpectrumUnrolled2 0.000198 0.00000 test v_avxGetPowerSpectrum faulted v_vGetPowerSpectrum 0.000172 0.00000 choice v_ChirpData 0.010563 0.00000 test fpu_ChirpData 0.017693 0.00000 test fpu_opt_ChirpData 0.010357 0.00000 test v_vChirpData_x86_64 0.073799 0.01993 test sse1_ChirpData_ak 0.009647 0.00000 test sse1_ChirpData_ak8e 0.007629 0.00000 test sse1_ChirpData_ak8h 0.007942 0.00000 test sse2_ChirpData_ak 0.010725 0.00000 test sse2_ChirpData_ak8 0.006080 0.00000 test sse3_ChirpData_ak 0.010540 0.00000 test sse3_ChirpData_ak8 0.006001 0.00000 test avx_ChirpData_a faulted avx_ChirpData_b faulted avx_ChirpData_c faulted avx_ChirpData_d faulted sse3_ChirpData_ak8 0.006001 0.00000 choice v_Transpose 0.013878 0.00000 test v_Transpose2 0.007828 0.00000 test v_Transpose4 0.006351 0.00000 test v_Transpose8 0.011334 0.00000 test v_pfTranspose2 0.008176 0.00000 test v_pfTranspose4 0.005297 0.00000 test v_pfTranspose8 0.009363 0.00000 test v_vTranspose4 0.004491 0.00000 test v_vTranspose4np 0.004173 0.00000 test v_vTranspose4ntw 0.004860 0.00000 test v_vTranspose4x8ntw 0.002933 0.00000 test v_vTranspose4x16ntw 0.002436 0.00000 test v_vpfTranspose8x4ntw 0.004869 0.00000 test v_avxTranspose4x8ntw faulted v_avxTranspose4x16ntw faulted v_avxTranspose8x4ntw faulted v_avxTranspose8x8ntw_a faulted v_avxTranspose8x8ntw_b faulted v_vTranspose4x16ntw 0.002436 0.00000 choice FPU opt folding 0.000967 0.00000 test ben SSE folding 0.000801 0.00000 test AK SSE folding 0.000651 0.00000 test BH SSE folding 0.000723 0.00000 test JS AVX_a folding faulted JS AVX_c folding faulted AK SSE folding 0.000651 0.00000 choice Test duration 4.93 seconds When I ran the tests in September, it was co-incidentally after I'd rebooted the machine after a kernel upgrade. I ran for several days without sleeping the machine overnight, but the first time I do sleep the machine after a reboot, I get messages like the following in syslog when I wake it: [44821.501321] TSC synchronization [CPU#0 -> CPU#1]: [44821.501322] Measured 5056157788 cycles TSC warp between CPUs, turning off TSC clock. [ 0.008000] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 0.008000] process: Switch to broadcast mode on CPU1 [44811.682905] CPU1 is up The BIOS is up-to-date (or as up-to-date as it's ever going to be, given the age of the motherboard). /proc/cpuinfo is unchanged after these messages, and in particular the flags tsc, rdtscp, constant_tsc and nonstop_tsc are present before and after sleeping. I have a suspicion this has something to do with the problem, but as I say, I'm unable to reproduce the problem. I've also tried testing with taskset(1) to set the CPU affinity. I'm going to try restarting the WUs. I suspect they'll complete normally. |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
And indeed, both WUs have finished normally now. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
But when I test the WU with -verbose, it looks normal Did you try 10-20 times or only one? It is known that the hang do not happen every time. Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
I didn't count, but I tested both stalled WUs about 10 times each, and ran similar tests in September. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.