Message boards :
Number crunching :
Long-running work unit
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
Yes, I'm showing all tasks; there are some running and some ready to start. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Yes, I'm showing all tasks; there are some running and some ready to start. Well, I can't explain it then. I don't know which version of BOINC you're using, because your computers are hidden at both projects, but I'm not aware of any reported BOINC bug which would falsely report that some task is suspended when it isn't. Note that it is possible to 'suspend' a task which has completed and is ready to report - that doesn't show visibly in BOINC Manager, but I use it sometimes to defer work fetch if I want some other project to fetch first. That state clears itself automatically when the suspended task is reported. At some time - probably in the next few weeks - the v7.4.xx development line will be promoted to 'recommended' status. That version of BOINC restores better logging of the various reasons why specific classes of work may or may not be requested during a scheduler contact. If your problem persists, you might consider trying an alpha test version to obtain that extra information. I haven't tested v7.4.22 yet, but v7.4.21 seems safe and clean for most purposes. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Yes, I'm showing all tasks; there are some running and some ready to start. Link to his computer (the one with the initial problem of hang CPU task) is in my post: http://setiathome.berkeley.edu/forum_thread.php?id=75617&postid=1571014#1571014 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
It's OK now after I restarted Boinc. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Did you check again in slots of running tasks that the files are updating? (stderr and state (I'm not sure about the exact names on Linux), to be sure app is not hang) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
No, why? Tasks are going through, and I aborted the WU I had trouble with. |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
It's happening again with another task. I'm about to abort it. stderr.txt has: setiathome_v7 7.00 Revision: 1782 g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 1.571478 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- I notice it's running with the i686 executable: ps -fp 4512 UID PID PPID C STIME TTY TIME CMD boinc 4512 17291 62 Sep19 ? 11:02:14 ../../projects/setiathome.berkeley.edu/setiathome_7.01_i686-pc-linux-gnu while the normally-running task on the other core is using the X86_64 executable, as I would expect: ps -fp 11855 UID PID PPID C STIME TTY TIME CMD boinc 11855 17291 91 11:09 ? 00:34:19 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
It's happening again with another task Of course it will happen, it don't depend on task, it depend on app This is long existing problem with 'Optimal function choices' test on AMD CPUs (sometimes the test hangs, sometimes not. On some systems it hangs a lot, on some - never/rarely) The way to avoid this is using apps which don't do 'Optimal function choices' (and are faster): http://lunatics.kwsn.net/index.php?module=Downloads;catd=1 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
The way I'm avoiding it is by suspending work on SETI for now. I'll look back in a few months to see if this long existing problem has been fixed. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
 long existing = years It will not be fixed magically Someone have to figure out why it happens on some systems before a fix is proposed Example: It happens in 30-40% at start/restart of tasks on my K6-2+ but only when it is booted in Windows 2000 When the same system runs in Windows 98 it never happens The BOINC and SETI@home files are the same (same BOINC Program and Data directories used, same tasks continue to run) PSAPI.DLL used in both cases is from Windows 2000 It is in SETI@home directory (<BOINC_Data>\projects\setiathome.berkeley.edu\) and in app_info.xml I put these lines: <file_info> <name>Psapi.dll</name> <executable/> </file_info> ... <file_ref> <file_name>Psapi.dll</file_name> <copy_file/> </file_ref> I'm willing to test why the hang happens if someone can provide a test scenario The currently running task (only at night, on Windows 2000) shows how some nights no progress was done (Restarts at the same progress/percent as previous day) setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2 libboinc: 7.1.0 Work Unit Info: ............... WU true angle range is : 0.014060 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.010141 0.00000 setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2 libboinc: 7.1.0 Work Unit Info: ............... WU true angle range is : 0.014060 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.009980 0.00000 fpu_opt_ChirpData 0.368909 0.00000 v_Transpose 0.564497 0.00000 FPU opt folding 0.125455 0.00000 Restarted at 3.66 percent. Restarted at 7.49 percent. Restarted at 11.46 percent. Restarted at 15.09 percent. Restarted at 15.09 percent. Restarted at 18.85 percent. Restarted at 22.83 percent. Restarted at 26.47 percent. Restarted at 30.44 percent. Restarted at 36.80 percent. Restarted at 36.80 percent. Restarted at 43.95 percent. Restarted at 43.95 percent. Restarted at 51.01 percent. Restarted at 51.01 percent. Restarted at 58.15 percent. Restarted at 58.15 percent. Restarted at 65.34 percent. Restarted at 72.66 percent.  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
Is the source code available? How can I reset a WU so it appears never to have run, so I can watch it start up? |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
 Source code is available but I don't know the proper link You don't need to "reset a WU so it appears never to have run" so you "can watch it start up" 'Optimal function choice' is done on every restart (but only the first result is printed in stderr.txt) There is a cmdline switch which makes the app print in stderr.txt all the functions tested on every restart Do an Offline Test (outside of BOINC): - Make an empty directory - Copy in it all the app files, for Windows those are: setiathome_7.00_windows_intelx86.exe libfftw3f-3-3_upx.dll - Copy some WU file and rename it to work_unit.sah - Stop BOINC and run the app with -verbose switch: setiathome_7.00_windows_intelx86.exe -verbose Now look what is written in stderr.txt (new lines have to appear every few seconds) After a few minutes kill the app process (repeat run with -verbose as many times as you like) (Part of) The results for me (run on 01.08.2013) (first posted test show hang, second finish, but both have strange big/negative numbers) 16:23:58 (812): Can't set up shared mem: -1. Will run in standalone mode. Restarted at 50.94 percent. Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.011967 0.00000 test v_vGetPowerSpectrum not supported on CPU v_vGetPowerSpectrum2 not supported on CPU v_vGetPowerSpectrumUnrolled not supported on CPU v_vGetPowerSpectrumUnrolled2 not supported on CPU v_avxGetPowerSpectrum not supported on CPU v_GetPowerSpectrum 0.011967 0.00000 choice v_ChirpData 0.640901 0.00000 test fpu_ChirpData 573481538.838350 0.00000 test 16:28:36 (1380): Can't set up shared mem: -1. Will run in standalone mode. Restarted at 50.94 percent. Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.010331 0.00000 test v_vGetPowerSpectrum not supported on CPU v_vGetPowerSpectrum2 not supported on CPU v_vGetPowerSpectrumUnrolled not supported on CPU v_vGetPowerSpectrumUnrolled2 not supported on CPU v_avxGetPowerSpectrum not supported on CPU v_GetPowerSpectrum 0.010331 0.00000 choice v_ChirpData 67253406.327791 0.00000 test fpu_ChirpData -22044769989.102539 0.00000 test fpu_opt_ChirpData 0.175478 0.00000 test v_vChirpData_x86_64 not supported on CPU sse1_ChirpData_ak not supported on CPU sse1_ChirpData_ak8e not supported on CPU sse1_ChirpData_ak8h not supported on CPU sse2_ChirpData_ak not supported on CPU sse2_ChirpData_ak8 not supported on CPU sse3_ChirpData_ak not supported on CPU sse3_ChirpData_ak8 not supported on CPU avx_ChirpData_a not supported on CPU avx_ChirpData_b not supported on CPU avx_ChirpData_c not supported on CPU avx_ChirpData_d not supported on CPU fpu_ChirpData -22044769989.102539 0.00000 choice v_Transpose 3.278891 0.00000 test v_Transpose2 140111.578261 0.00000 test v_Transpose4 68867348302.981277 0.00000 test v_Transpose8 -2.147919 0.00000 test v_pfTranspose2 not supported on CPU v_pfTranspose4 not supported on CPU v_pfTranspose8 not supported on CPU v_vTranspose4 not supported on CPU v_vTranspose4np not supported on CPU v_vTranspose4ntw not supported on CPU v_vTranspose4x8ntw not supported on CPU v_vTranspose4x16ntw not supported on CPU v_vpfTranspose8x4ntw not supported on CPU v_avxTranspose4x8ntw not supported on CPU v_avxTranspose4x16ntw not supported on CPU v_avxTranspose8x4ntw not supported on CPU v_avxTranspose8x8ntw_a not supported on CPU v_avxTranspose8x8ntw_b not supported on CPU v_Transpose8 -2.147919 0.00000 choice FPU opt folding 0.075586 0.00000 test ben SSE folding not supported on CPU AK SSE folding not supported on CPU BH SSE folding not supported on CPU JS AVX_a folding not supported on CPU JS AVX_c folding not supported on CPU FPU opt folding 0.075586 0.00000 choice Test duration 75.50 seconds ***** Normal 'Optimal function choice' test - no big numbers: 20:11:09 (-1710395): Can't set up shared mem: -1. Will run in standalone mode. Restarted at 71.81 percent. Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.010295 0.00000 test v_vGetPowerSpectrum not supported on CPU v_vGetPowerSpectrum2 not supported on CPU v_vGetPowerSpectrumUnrolled not supported on CPU v_vGetPowerSpectrumUnrolled2 not supported on CPU v_avxGetPowerSpectrum not supported on CPU v_GetPowerSpectrum 0.010295 0.00000 choice v_ChirpData 0.560746 0.00000 test fpu_ChirpData 0.594549 0.00000 test fpu_opt_ChirpData 0.506013 0.00000 test v_vChirpData_x86_64 not supported on CPU sse1_ChirpData_ak not supported on CPU sse1_ChirpData_ak8e not supported on CPU sse1_ChirpData_ak8h not supported on CPU sse2_ChirpData_ak not supported on CPU sse2_ChirpData_ak8 not supported on CPU sse3_ChirpData_ak not supported on CPU sse3_ChirpData_ak8 not supported on CPU avx_ChirpData_a not supported on CPU avx_ChirpData_b not supported on CPU avx_ChirpData_c not supported on CPU avx_ChirpData_d not supported on CPU fpu_opt_ChirpData 0.506013 0.00000 choice v_Transpose 0.798243 0.00000 test v_Transpose2 0.420133 0.00000 test v_Transpose4 0.214401 0.00000 test v_Transpose8 0.361900 0.00000 test v_pfTranspose2 not supported on CPU v_pfTranspose4 not supported on CPU v_pfTranspose8 not supported on CPU v_vTranspose4 not supported on CPU v_vTranspose4np not supported on CPU v_vTranspose4ntw not supported on CPU v_vTranspose4x8ntw not supported on CPU v_vTranspose4x16ntw not supported on CPU v_vpfTranspose8x4ntw not supported on CPU v_avxTranspose4x8ntw not supported on CPU v_avxTranspose4x16ntw not supported on CPU v_avxTranspose8x4ntw not supported on CPU v_avxTranspose8x8ntw_a not supported on CPU v_avxTranspose8x8ntw_b not supported on CPU v_Transpose4 0.214401 0.00000 choice FPU opt folding 0.084228 0.00000 test ben SSE folding not supported on CPU AK SSE folding not supported on CPU BH SSE folding not supported on CPU JS AVX_a folding not supported on CPU JS AVX_c folding not supported on CPU FPU opt folding 0.084228 0.00000 choice Test duration 50.49 seconds ***** 'Optimal function choice' test on another CPU which do not hang (AMD Athlon II X3 455 + Windows XP) 16:26:24 (476): Can't set up shared mem: -1. Will run in standalone mode. Restarted at 18.06 percent. Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.000382 0.00000 test v_vGetPowerSpectrum 0.000320 0.00000 test v_vGetPowerSpectrum2 0.000358 0.00000 test v_vGetPowerSpectrumUnrolled 0.000324 0.00000 test v_vGetPowerSpectrumUnrolled2 0.000340 0.00000 test v_avxGetPowerSpectrum not supported on CPU v_vGetPowerSpectrum 0.000320 0.00000 choice v_ChirpData 0.012787 0.00000 test fpu_ChirpData 0.016894 0.00000 test fpu_opt_ChirpData 0.012891 0.00000 test v_vChirpData_x86_64 0.060894 0.00000 test sse1_ChirpData_ak 0.010138 0.00000 test sse1_ChirpData_ak8e 0.008654 0.00000 test sse1_ChirpData_ak8h 0.009152 0.00000 test sse2_ChirpData_ak 0.009978 0.00000 test sse2_ChirpData_ak8 0.006273 0.00000 test sse3_ChirpData_ak 0.009179 0.00000 test sse3_ChirpData_ak8 0.006062 0.00000 test avx_ChirpData_a not supported on CPU avx_ChirpData_b not supported on CPU avx_ChirpData_c not supported on CPU avx_ChirpData_d not supported on CPU sse3_ChirpData_ak8 0.006062 0.00000 choice v_Transpose 0.021419 0.00000 test v_Transpose2 0.011473 0.00000 test v_Transpose4 0.008364 0.00000 test v_Transpose8 0.014047 0.00000 test v_pfTranspose2 0.011766 0.00000 test v_pfTranspose4 0.007041 0.00000 test v_pfTranspose8 0.011649 0.00000 test v_vTranspose4 0.006263 0.00000 test v_vTranspose4np 0.006060 0.00000 test v_vTranspose4ntw 0.005209 0.00000 test v_vTranspose4x8ntw 0.003133 0.00000 test v_vTranspose4x16ntw 0.002851 0.00000 test v_vpfTranspose8x4ntw 0.005198 0.00000 test v_avxTranspose4x8ntw not supported on CPU v_avxTranspose4x16ntw not supported on CPU v_avxTranspose8x4ntw not supported on CPU v_avxTranspose8x8ntw_a not supported on CPU v_avxTranspose8x8ntw_b not supported on CPU v_vTranspose4x16ntw 0.002851 0.00000 choice FPU opt folding 0.003475 0.00000 test ben SSE folding 0.001164 0.00000 test AK SSE folding 0.001046 0.00000 test BH SSE folding 0.001043 0.00000 test JS AVX_a folding not supported on CPU JS AVX_c folding not supported on CPU BH SSE folding 0.001043 0.00000 choice Test duration 4.77 seconds  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Source code is available but I don't know the proper link http://setiathome.berkeley.edu/sah_porting.php |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
for Windows those are: Hum: Yes, I'm running Linux. I restarted Boinc and the work unit, and the progress dropped to 0%. :-( Claggy |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
The principle is exactly the same for Linux, including the command line switch and the workunit rename. Only the executable name will be different, and it won't need a DLL file. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Although it might be easier doing it in a Bench, I've setup a KWSN-Bench-Linux-MBv7_v2.01.08 Bench for that purpose, with the setiathome_7.01_x86_64-pc-linux-gnu and the setiathome_7.01_i686-pc-linux-gnu apps, and the Seti v7 wisgen Wu, Just download and extract the Bench program, and run the 'benchmark' file in a terminal, it should only take 30 seconds or so, afterwards navigate to the testData directory, the two text files you'll want to look at are the ref-stderr.setiathome_7.01_x86_64-pc-linux-gnu._WisGenA.wu.txt and the stderr.setiathome_7.01_i686-pc-linux-gnu._WisGenA.wu.txt files My OneDrive Claggy |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
for Windows those are: I know he is running Linux, and I know he knows the app filenames (I just didn't search them during the post, I explained my experience on how I did the test expecting he to find the proper filenames) From his posts the apps filenames have to be: setiathome_7.01_i686-pc-linux-gnu setiathome_7.01_x86_64-pc-linux-gnu  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
BilBg wrote: ... It's a good test method, IMO probably the best that can be done without building special code. I'll think about something better which might help, but won't have time to actually do anything until next week at the earliest. The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise. Windows' implementation of the QueryPerformanceCounter function of course must be specific to the hardware, such details are worked out cooperatively between Microsoft and the CPU manufacturers. Whether the flaw is in the implementation for some chips or the S@H hires_timer.cpp usage of the function isn't clear. The big positive or negative values could obviously be handled better, wherever they come from. Some kind of sanity check could be added, perhaps retrying a test if it produces unbelievable times. Really figuring out the cause and eliminating it would be much better, of course. Joe |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise. How about under Linux? |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise. Try the test and report, please (if you want to help to find and fix the issue (Josef W. Segur is one of the main programmers)) The test will show (for your system): - how often the hang happens - do you also see big + - numbers - does the hang happen more often at some places (e.g. testing particular function or after big + - numbers) - does the hang happen for both Linux apps (32 and 64 bit) Run the apps (32 and 64 bit) at least 10 times each to have enough statistics   - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.