Crunching apears to stop

Message boards : Number crunching : Crunching apears to stop
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Richard Jablonski

Send message
Joined: 16 Sep 10
Posts: 9
Credit: 2,322,803
RAC: 0
United States
Message 1674845 - Posted: 7 May 2015, 20:38:49 UTC

I am having problem with work units starting and then after days it is still at the same time and percentage done. When I look at the graphics it shows choosing optimal functions. This has been going on about a month now. Some work and some do not past a certain point. After about a week of this going on I normally just abort it and get new work units. I would rather have a fix than abort the work units.
ID: 1674845 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1674860 - Posted: 7 May 2015, 21:16:52 UTC

I see you are running a FX 8350.
Did you check your temps ?
Are you running on all 8 cores ?

Free at least 2 CPU cores.


With each crime and every kindness we birth our future.
ID: 1674860 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1674980 - Posted: 8 May 2015, 1:15:47 UTC - in response to Message 1674845.  
Last modified: 8 May 2015, 1:20:06 UTC

I saw you let run the stock project applications.

AFAIK, the stock CPU applications 'like' to 'sleep' here and there on AMD CPUs.

Maybe it would help if you would install opti applications with help of the Lunatics Installer ...

Message boards : Number crunching : Optimised Applications and Other Binaries (sticky thread at the top of NC forum)
ID: 1674980 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1675083 - Posted: 8 May 2015, 7:41:57 UTC - in response to Message 1674845.  

I am having problem with work units starting and then after days it is still at the same time and percentage done.
When I look at the graphics it shows choosing optimal functions

Read here:
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1571978#1571978

This is a long existing problem with 'Optimal function choices' test on AMD CPUs
- sometimes the test hangs, sometimes not. On some systems it hangs a lot, on some - never/rarely
(long existing = years = don't expect fix of the stock apps soon as "Someone have to figure out why it happens on some systems before a fix is proposed")


The only real cure is using apps which don't do 'Optimal function choices' (and are faster):
Optimised Applications - Installer v0.43a
http://setiathome.berkeley.edu/forum_thread.php?id=71867&postid=1596404#1596404


After about a week of this going ... I would rather have a fix than abort the work units.

No need to wait more than a few minutes - on your CPU 'choosing optimal functions' should finish in < 30 seconds.

And no need to abort -
just Suspend/Restart the task

(If you start using Optimised Applications there will be no need for Suspend/Restart as they will never hang)

But if you stay on stock/standard apps:
- You may need to do Suspend/Restart several times (it is unclear when the test passes and when it hangs).

Hang don't depend on task, it depend on (stock) app - so may happen for any CPU task.
This test is done before the app even looks at the task data.
But if the app goes past 'Optimal function choice' it will no more hang during computing
(unless the task is Restarted again - any Restart (e.g. from 77%, from any %) do again 'Optimal function choice')

Go for Optimised Applications and you are 'cured'
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1675083 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1675086 - Posted: 8 May 2015, 8:12:23 UTC - in response to Message 1675083.  

I am having problem with work units starting and then after days it is still at the same time and percentage done.
When I look at the graphics it shows choosing optimal functions

Read here:
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1571978#1571978

This is a long existing problem with 'Optimal function choices' test on AMD CPUs
- sometimes the test hangs, sometimes not. On some systems it hangs a lot, on some - never/rarely
(long existing = years = don't expect fix of the stock apps soon as "Someone have to figure out why it happens on some systems before a fix is proposed")

It's not just AMD CPUs, I see this on my C2D T5500 running Ubuntu 14.04, I've tried compiling my own apps, no change, I've reported about it at Lunatics with some suggestions, don't think anyone was interested.

Check out some of the results at Beta:

All tasks for computer 73174

<core_client_version>7.5.0</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_v7 7.28 Revision: 2834 g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
libboinc: BOINC 7.5.0

Work Unit Info:
...............
WU true angle range is :  0.011416
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.000654 0.00000  test
             v_vGetPowerSpectrum 0.000554 0.00000  test
            v_vGetPowerSpectrum2 0.000812 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.000320 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.000765 0.00000  test
           v_avxGetPowerSpectrum faulted
     v_vGetPowerSpectrumUnrolled 0.000320 0.00000  choice

                     v_ChirpData 0.023827 0.00000  test
                   fpu_ChirpData 0.055913 0.00000  test
               fpu_opt_ChirpData 0.053241 0.00000  test
             v_vChirpData_x86_64 0.608812 0.01993  test
               sse1_ChirpData_ak 8215593.620413 0.00000  test
setiathome_v7 7.28 Revision: 2834 g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
libboinc: BOINC 7.5.0

Work Unit Info:
...............
WU true angle range is :  0.011416
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
setiathome_v7 7.28 Revision: 2834 g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
libboinc: BOINC 7.5.0

Work Unit Info:
...............
WU true angle range is :  0.011416
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.000393 0.00000  test
             v_vGetPowerSpectrum 0.000295 0.00000  test
            v_vGetPowerSpectrum2 0.000411 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.000326 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.000316 0.00000  test
           v_avxGetPowerSpectrum faulted
             v_vGetPowerSpectrum 0.000295 0.00000  choice

                     v_ChirpData 0.026576 0.00000  test
                   fpu_ChirpData 0.025330 0.00000  test
               fpu_opt_ChirpData 0.032680 0.00000  test
             v_vChirpData_x86_64 0.582770 0.01993  test
               sse1_ChirpData_ak 0.018779 0.00000  test
             sse1_ChirpData_ak8e 0.016703 0.00000  test
             sse1_ChirpData_ak8h 79256438.325382 0.00000  test
setiathome_v7 7.28 Revision: 2834 g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
libboinc: BOINC 7.5.0

Work Unit Info:
...............
WU true angle range is :  0.011416
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)

              v_GetPowerSpectrum 0.000330 0.00000  test
             v_vGetPowerSpectrum 0.000273 0.00000  test
            v_vGetPowerSpectrum2 0.000300 0.00000  test
     v_vGetPowerSpectrumUnrolled 0.000268 0.00000  test
    v_vGetPowerSpectrumUnrolled2 0.000323 0.00000  test
           v_avxGetPowerSpectrum faulted
     v_vGetPowerSpectrumUnrolled 0.000268 0.00000  choice

                     v_ChirpData 0.028666 0.00000  test
                   fpu_ChirpData 0.021731 0.00000  test
               fpu_opt_ChirpData 0.027636 0.00000  test
             v_vChirpData_x86_64 19813884.405080 0.01993  test
               sse1_ChirpData_ak 138697187.902061 0.00000  test
             sse1_ChirpData_ak8e 0.023322 0.00000  test
             sse1_ChirpData_ak8h 0.020502 0.00000  test
               sse2_ChirpData_ak 0.028234 0.00000  test
              sse2_ChirpData_ak8 0.263413 0.00000  test
               sse3_ChirpData_ak 0.047039 0.00000  test
              sse3_ChirpData_ak8 0.016548 0.00000  test
                 avx_ChirpData_a faulted
                 avx_ChirpData_b faulted
                 avx_ChirpData_c faulted
                 avx_ChirpData_d faulted
              sse3_ChirpData_ak8 0.016548 0.00000  choice

                     v_Transpose 0.049660 0.00000  test
                    v_Transpose2 0.025243 0.00000  test
                    v_Transpose4 0.014373 0.00000  test
                    v_Transpose8 0.019599 0.00000  test
                  v_pfTranspose2 0.028664 0.00000  test
                  v_pfTranspose4 0.015308 0.00000  test
                  v_pfTranspose8 0.037967 0.00000  test
                   v_vTranspose4 0.019065 0.00000  test
                 v_vTranspose4np 0.018374 0.00000  test
                v_vTranspose4ntw 0.014543 0.00000  test
              v_vTranspose4x8ntw 0.018542 0.00000  test
             v_vTranspose4x16ntw 0.011961 0.00000  test
            v_vpfTranspose8x4ntw 0.014342 0.00000  test
            v_avxTranspose4x8ntw faulted
           v_avxTranspose4x16ntw faulted
            v_avxTranspose8x4ntw faulted
          v_avxTranspose8x8ntw_a faulted
          v_avxTranspose8x8ntw_b faulted
             v_vTranspose4x16ntw 0.011961 0.00000  choice

                 FPU opt folding 0.005224 0.00000  test
                 ben SSE folding 0.005372 0.00000  test
                  AK SSE folding 0.004457 0.00000  test
                  BH SSE folding 0.004935 0.00000  test
                JS AVX_a folding faulted
                JS AVX_c folding faulted
                  AK SSE folding 0.004457 0.00000  choice

                   Test duration    23.30 seconds


Flopcounter: 44658297320699.539062

Spike count:    8
Autocorr count: 0
Pulse count:    2
Triplet count:  0
Gaussian count: 0
07:18:01 (30888): called boinc_finish(0)

</stderr_txt>
]]>


Claggy
ID: 1675086 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1675096 - Posted: 8 May 2015, 8:44:01 UTC - in response to Message 1675086.  
Last modified: 8 May 2015, 9:29:57 UTC

               sse1_ChirpData_ak 8215593.620413 0.00000  test
...
             sse1_ChirpData_ak8h 79256438.325382 0.00000  test
...
             v_vChirpData_x86_64 19813884.405080 0.01993  test
               sse1_ChirpData_ak 138697187.902061 0.00000  test

You have strange big numbers like in my tests:
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1576490#1576490

Joe may be interested to see more tests (e.g. if you will see also strange big negative numbers):
http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1576847#1576847

I may guess "big negative numbers" are in fact "too big positive numbers"
http://en.wikipedia.org/wiki/2147483647#In_computing

OK, here example of your "big negative numbers":
http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=18517750

And it results in:
"fpu_opt_ChirpData -79751928.307712 0.00000 choice"

There is also "very small negative number" which have to mean the counter was just 'a bit' over 2147483647
sse1_ChirpData_ak8e -0.000188 0.00000 test

I see some warnings about TSC/RDTSC in the 'Use' section here:
http://en.wikipedia.org/wiki/Time_Stamp_Counter

 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1675096 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1675111 - Posted: 8 May 2015, 9:51:40 UTC - in response to Message 1675086.  
Last modified: 8 May 2015, 10:01:15 UTC

It's not just AMD CPUs, I see this on my C2D T5500 running Ubuntu 14.04, I've tried compiling my own apps, no change, I've reported about it at Lunatics with some suggestions, don't think anyone was interested.
...


Part of a long list of ToDos for my own builds (both CPU and GPU of various types), has for a long time been some C++ class based inheritance for key processing functions.

When I mentioned that I was going to be enabling builds to use various CPU FFTs, Cuda and OpenCL, with internal dispatch, Eric did express interest in having a more 'pluggable' implementation for the FFTs at least (which currently are not benched), and that he would appreciate if I could put the same facilities into main.

As selection of those depends on hardware, libraries, accuracy and performance, dispatch there has to be a bit more flexible and generic than the existing mechanism, so I started on the Class hierarchy to include the other processing functions as well.

Since shifting build system, Cuda7, and various Boinc issues have taken precedence, for stock that has sat at a bare/unpopulated file/class structure I committed quite a while back, in a folder under stock v7.

That will probably recommence my end, as soon as I've mastered the basics of the Gradle build system, and by nature requires redoing the benchmark code.

[Edit:] since, while testing gradle backstage, we tested some of the precision timers involved a little while back in small test puces, and they appeared to work for a range of devices/purposes without issue, probably the bench code will receive some of that work in the end. Yep, bits and pieces everywhere to tie together.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1675111 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1675117 - Posted: 8 May 2015, 10:39:17 UTC

My 'bad tests' were on Windows 2000 (and didn't happen on the same machine under Windows 98) this may be related:

"Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2000, in Windows Server 2003, and in Windows XP"
https://support.microsoft.com/en-us/kb/895980
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1675117 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1675126 - Posted: 8 May 2015, 11:06:45 UTC - in response to Message 1675117.  
Last modified: 8 May 2015, 11:09:16 UTC

yeah,
underneath most bench code uses that function (for the Windows code), which is basically just a CPU timestamp counter call and appropriate serialising instruction. Some Motherboards and windows versions have issues, as well as some caveats with hyperthreading and such.

If the stock bench code is using the RDTSC instruction directly, or that Windows API function (On Windows builds Obviously), there would be some alternative ways to use a lower resolution counter not prone to the issues.

When I get to that point, I'll probably test the reliability of the used timer on the host, and use a less accurate means instead of that timestamp [where necessary]. Another possibility is that the timers in the stock variant are overflowing somehow. Since the hardware counters involved should not overflow for ~100 years or so (64 bit IIRC), then it's possible only a portion of the value is used (e.g. if it only uses 32 bits, the number of 'ticks' might only be a couple of seconds on some hosts, and some benches take longer than that.)

So plenty of possibilities to check out. I'd be interested to reproduce the issues in a dedicated standalone test piece down the line, allowing proving of alternatives/fixes, especially since what I'm working towards will be heavily dependant on timers.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1675126 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1675165 - Posted: 8 May 2015, 13:48:35 UTC - in response to Message 1675126.  
Last modified: 8 May 2015, 14:22:22 UTC

 
I also read this:
"Acquiring high-resolution time stamps":
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx

The most funny part ;) for me was:
How do I determine and validate that QPC works on my machine?
     You don't need to perform such checks.

  Sounds like:
Stormtrooper: Let me see your identification.
Ben Obi-Wan Kenobi: [with a small wave of his hand] You don't need to see his identification.
Stormtrooper: We don't need to see his identification.
Ben Obi-Wan Kenobi: These aren't the droids you're looking for.
http://www.imdb.com/title/tt0076759/quotes


They also say:
Do I need to set the thread affinity to a single core to use QPC?
     No. ...

But the first user response at the end:
"
Is the performance counter monotonic (non-decreasing)?
Yes

Would be nice if the above was true, but it's not. I can't speak to the technical reasons why this happened in my application, but I found that in order to have monotonic behavior, the thread affinity had to be set. Specifically, the counter was monotonic on only a single CPU core, and affinity had to be set to deal with this. As I recall, this was in a Windows XP SP3 virtual machine running in VirtualBox. All four hyper-threaded cores on my Core i7 were available to the guest (i.e. 8 logical CPUs). Maybe the newer versions of Windows work around this; I don't know.

Many others have observed this behavior...

.NET Stopwatch class (based on QPC): http://stackoverflow.com/questions/1008345/system-diagnostics-stopwatch-returns-negative-numbers-in-elapsed-properties

Note it can be proven by decompiling/examining .NET 2.0 to 4.0 sources to see that MSFT themselves added a hack to protect against returning negative durations in their stopwatch class (see my answer in link above). So you can't say "yes" that the QPC function is monotonic when the NETFX programmers themselves coded to protect against that, and lots of devs observe that it's not.
"
 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1675165 · Report as offensive
Profile Richard Jablonski

Send message
Joined: 16 Sep 10
Posts: 9
Credit: 2,322,803
RAC: 0
United States
Message 1675180 - Posted: 8 May 2015, 14:59:28 UTC - in response to Message 1675083.  

Thank you I installed the installer ap
ID: 1675180 · Report as offensive

Message boards : Number crunching : Crunching apears to stop


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.