Long-running work unit

Message boards : Number crunching : Long-running work unit
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1577050 - Posted: 24 Sep 2014, 13:24:59 UTC - in response to Message 1576863.  

The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise.

How about under Linux?

The primary timing looks to be using inline assembly with the rdtsc instruction, the fallback the gettimeofday function. There are other options for both depending on build configuration.

If you run the test BilBg suggested, we'll at least know for sure if the symptoms on your Linux system really match. There might possibly be other parts of the initialization which could hang.
                                                                   Joe
ID: 1577050 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1577361 - Posted: 24 Sep 2014, 22:00:34 UTC

 
I did a test today (15 runs) - stderr.txt (K6-2+ / Windows 2000)
http://pastebin.com/NLKpNxiJ


Some observations:
- it is known but lets make it clear - app never hangs (at least for me) in the middle of Computing, it only may hang on the startup during 'Optimal function choice'

- wisdom.sah is 'updated' first (the fake update, only the order of lines changes)
The time of the file is set to 'now' 13-15 s after app start (since this happens before 'Optimal function choice' this file is updated even for hang app)

- the 'Test duration' is another wrong value:
Test duration 0.36 seconds
Test duration 380.57 seconds

In reality I measured 120-140 s from app start to appearing of line 'Test duration'
(obviously only for not-hang runs, time is not exactly the same every run but is in the range of ~2 minutes on this system)


- After app Hangs - most of the CPU time/load (~70%) is in Kernel mode (SIV, Process Explorer):






 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1577361 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1585729 - Posted: 12 Oct 2014, 11:13:05 UTC - in response to Message 1577361.  

I ran some tests, but the results were unremarkable. I've also been running SETI@Home normally without any problems, until today.

Workunits 1612718929 and 1612022476 both stalled with the same symptoms as before. Unfortunately, I'm unable to reproduce the problem. For instance, stderr.txt for 1612022476 (27mr08ah.20785.20522.438086664204.12.8) is:

setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0

Work Unit Info:
...............
WU true angle range is :  0.432149
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)
            v_vGetPowerSpectrum2 0.000204 0.00000


But when I test the WU with -verbose, it looks normal:

11:46:43 (12672): Can't open init data file - running in standalone mode
11:46:43 (12672): Can't open init data file - running in standalone mode
setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0

Work Unit Info:
...............
WU true angle range is :  0.432149
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.000253 0.00000  test
             v_vGetPowerSpectrum 0.000172 0.00000  test
            v_vGetPowerSpectrum2 0.000193 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.000177 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.000198 0.00000  test
           v_avxGetPowerSpectrum faulted
             v_vGetPowerSpectrum 0.000172 0.00000  choice

                     v_ChirpData 0.010563 0.00000  test
                   fpu_ChirpData 0.017693 0.00000  test
               fpu_opt_ChirpData 0.010357 0.00000  test
             v_vChirpData_x86_64 0.073799 0.01993  test
               sse1_ChirpData_ak 0.009647 0.00000  test
             sse1_ChirpData_ak8e 0.007629 0.00000  test
             sse1_ChirpData_ak8h 0.007942 0.00000  test
               sse2_ChirpData_ak 0.010725 0.00000  test
              sse2_ChirpData_ak8 0.006080 0.00000  test
               sse3_ChirpData_ak 0.010540 0.00000  test
              sse3_ChirpData_ak8 0.006001 0.00000  test
                 avx_ChirpData_a faulted
                 avx_ChirpData_b faulted
                 avx_ChirpData_c faulted
                 avx_ChirpData_d faulted
              sse3_ChirpData_ak8 0.006001 0.00000  choice

                     v_Transpose 0.013878 0.00000  test
                    v_Transpose2 0.007828 0.00000  test
                    v_Transpose4 0.006351 0.00000  test
                    v_Transpose8 0.011334 0.00000  test
                  v_pfTranspose2 0.008176 0.00000  test
                  v_pfTranspose4 0.005297 0.00000  test
                  v_pfTranspose8 0.009363 0.00000  test
                   v_vTranspose4 0.004491 0.00000  test
                 v_vTranspose4np 0.004173 0.00000  test
                v_vTranspose4ntw 0.004860 0.00000  test
              v_vTranspose4x8ntw 0.002933 0.00000  test
             v_vTranspose4x16ntw 0.002436 0.00000  test
            v_vpfTranspose8x4ntw 0.004869 0.00000  test
            v_avxTranspose4x8ntw faulted
           v_avxTranspose4x16ntw faulted
            v_avxTranspose8x4ntw faulted
          v_avxTranspose8x8ntw_a faulted
          v_avxTranspose8x8ntw_b faulted
             v_vTranspose4x16ntw 0.002436 0.00000  choice

                 FPU opt folding 0.000967 0.00000  test
                 ben SSE folding 0.000801 0.00000  test
                  AK SSE folding 0.000651 0.00000  test
                  BH SSE folding 0.000723 0.00000  test
                JS AVX_a folding faulted
                JS AVX_c folding faulted
                  AK SSE folding 0.000651 0.00000  choice

                   Test duration     4.93 seconds


When I ran the tests in September, it was co-incidentally after I'd rebooted the machine after a kernel upgrade. I ran for several days without sleeping the machine overnight, but the first time I do sleep the machine after a reboot, I get messages like the following in syslog when I wake it:

[44821.501321] TSC synchronization [CPU#0 -> CPU#1]:
[44821.501322] Measured 5056157788 cycles TSC warp between CPUs, turning off TSC clock.
[    0.008000] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[    0.008000] process: Switch to broadcast mode on CPU1
[44811.682905] CPU1 is up


The BIOS is up-to-date (or as up-to-date as it's ever going to be, given the age of the motherboard). /proc/cpuinfo is unchanged after these messages, and in particular the flags tsc, rdtscp, constant_tsc and nonstop_tsc are present before and after sleeping.

I have a suspicion this has something to do with the problem, but as I say, I'm unable to reproduce the problem. I've also tried testing with taskset(1) to set the CPU affinity.

I'm going to try restarting the WUs. I suspect they'll complete normally.
ID: 1585729 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1585837 - Posted: 12 Oct 2014, 16:53:05 UTC - in response to Message 1585729.  

And indeed, both WUs have finished normally now.
ID: 1585837 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1586322 - Posted: 13 Oct 2014, 17:08:40 UTC - in response to Message 1585729.  

But when I test the WU with -verbose, it looks normal

Did you try 10-20 times or only one?
It is known that the hang do not happen every time.
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1586322 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1586370 - Posted: 13 Oct 2014, 18:31:26 UTC - in response to Message 1586322.  

I didn't count, but I tested both stalled WUs about 10 times each, and ran similar tests in September.
ID: 1586370 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Long-running work unit


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.