Message boards :
Number crunching :
Need help - Seti v8 MB units running forever
Message board moderation
Author | Message |
---|---|
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
OpenSuse Linux 13.1 x86_64, Intel i7-3960X 6 core/HT 32G ram, NVIDIA GTX 460 Since I started crunching v8 MB units I have this problem on 1 pc where the majority of work units crunch forever (I've let them run up to 60 hours and they stay at 100% and are still crunching with no expected completion time, just "---" Remaining). I tried going back to earlier versions of boinc but have the same problem. I did a project reset, and a fresh install - no dice. Sometimes, shutting down boinc and restarting will get them to complete, sometimes not, restarting boinc always resets the Elapsed Time back to zero. Is this a known v8 issue? Is there anything I can do to fix this? Curious how it only affects this one system, My other two systems are fine (all 3 same os) |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I have an 8 core google cloud compute instance running SETI@home with Ubuntu. The only issue I had was forgetting to give the optimized apps execute permission when I first coped them over. I was running the stock apps and recently switched to the optimized AVX versions. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
Are these CPU or GPU WUs or both? I did have some issues with never ending GPU WUs a while back- in my case they would start crunching, elapsed time would start counting, but after less than a couple of % had been done the % completed would stop, and the Estimated time to completion would start climbing. I think I only had 3 or 4 of these WUs and in the end all I could do was abort them. Haven't had any since. The other issue I had related to Windows update hogging CPU resources & causing 1 GPU WU to stall due to lack of CPU time. Since you're running Linux that shouldn't the case, but it might be worth checking CPU usage to see that all instances are getting their needed CPU time. Grant Darwin NT |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
As near as I can tell they are all CPU units. Here is a sample image showing the progress of some WU's that were started yesterday and still showing 0%, after another 24 or 48 hours they might show 100% - it varies with the particular WU. Its 4AM here so forgive me if I'm mumbling |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
That looks like BOINC's placebo "pseudo progress" - designed to reassure people that something's happening even when it isn't (some badly-designed science applications, at some badly-designed projects, don't report their progress at all. But not here.) Check whether anything is really happening under the hood. 1) Select one of the 'running' tasks, and click the Properties command button. Compare 'CPU time' and 'CPU time at last checkpoint' - they shouldn't be more than a minute or two apart, and no less than say ~95% of the Elapsed time. 2) Use whatever the Linux equivalent of Task Manager is - 'top'? - to see if there's any loading on the CPU cores. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
The cpu is fully loaded according to top, here is the info from boinc: |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
No checkpoint after 24 hours is not a good sign - as is no memory usage. If you can, look inside the directory shown just above the Process ID - that's [boinc_data_directory] / slots / 8 and try to find a file called stderr.txt Look to see what time the file was last modified (it should be created when the task starts running, and updated at intervals while it runs). Open the file, and paste the contents here, please. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
The stderr.txt (in full): setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.009480 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000065 0.00000 setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.009480 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000111 0.00000 setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.009480 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000101 0.00000 |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
That looks like it's looping because of "name timing error". I think we need a Linux specialist here, someone who knows the next question to ask. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
Also, do these file sizes and perms look ok (my username has been replaced with x's)? # ls -l total 72 -rw-r--r-- 1 xxxxxxxx users 0 Mar 3 06:02 boinc_lockfile -rw-rw---- 1 xxxxxxxx users 266080 Mar 3 06:02 boinc_setiathome_8 -rw-r--r-- 1 xxxxxxxx users 100 Mar 2 22:39 graphics_app -rw-r--r-- 1 xxxxxxxx users 8111 Mar 4 12:14 init_data.xml -rw-r--r-- 1 xxxxxxxx users 104 Mar 2 22:39 result.sah -rw-r--r-- 1 xxxxxxxx users 86 Mar 2 22:39 setiathome-8.00_AUTHORS -rw-r--r-- 1 xxxxxxxx users 86 Mar 2 22:39 setiathome-8.00_COPYING -rw-r--r-- 1 xxxxxxxx users 88 Mar 2 22:39 setiathome-8.00_COPYRIGHT -rw-r--r-- 1 xxxxxxxx users 85 Mar 2 22:39 setiathome-8.00_README -rw-r--r-- 1 xxxxxxxx users 98 Mar 2 22:39 setiathome_8.00_x86_64-pc-linux-gnu -rw-r--r-- 1 xxxxxxxx users 75 Mar 2 22:39 seti_logo -rw-r--r-- 1 xxxxxxxx users 77 Mar 2 22:39 sponsor_bkg -rw-r--r-- 1 xxxxxxxx users 73 Mar 2 22:39 sponsor_logo -rw-r--r-- 1 xxxxxxxx users 1356 Mar 3 06:02 stderr.txt -rw-r--r-- 1 xxxxxxxx users 6590 Mar 3 06:02 wisdom.sah -rw-r--r-- 1 xxxxxxxx users 100 Mar 2 22:39 work_unit.sah |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I forget my Linux, but? Are you running as admin? Admin/user rights seem different. Shouldn't x86_64 have execute rights? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
setiathome_8.00_x86_64-pc-linux-gnu is the main program and needs to be executable, but at 98 bytes, that can't be the real file, just a symlink. Let's assume the real file is executing properly, else nothing at all would be written to stderr.txt I don't recognise boinc_setiathome_8, and at 266080 bytes it should be significant. Anyone? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
That looks like it's looping because of "name timing error". I think we need a Linux specialist here, someone who knows the next question to ask. Scrub that. "name timing error" are the column headers, and the message board software has mungled the spacing. A real one looks like Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000058 0.00000 avx_ChirpData_d 0.002103 0.00000 v_vTranspose4np 0.001111 0.00000 AK SSE folding 0.000302 0.00000 It looks like yours is crashing after the 'GetPowerSpectrum' test, and restarting the program before testing ChirpData. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
Brent Norman: No, not running as admin (or root as we say in the linux world) Just an ordinary user. I'm not sure i understand your question about x86_64, but that's the cpu mode not a filesystem thing |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
setiathome_8.00_x86_64-pc-linux-gnu is the main program and needs to be executable, but at 98 bytes, that can't be the real file, just a symlink. Let's assume the real file is executing properly, else nothing at all would be written to stderr.txt Its a boinc link, not a symlink. In otherwords its an ordinary file that contains a reference - only boinc does this, it isnt a standard part of linux at all # cat setiathome_8.00_x86_64-pc-linux-gnu <soft_link>../../projects/setiathome.berkeley.edu/setiathome_8.00_x86_64-pc-linux-gnu</soft_link> |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Brent Norman: I think he was pondering the same thing as me - whether the file with x86_64 in its name needed to have execute permission. But since it is obviously launching, even if subsequently crashing, I think we can move on beyond that question. |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
I think this looks just like the good old stuck in benchmarks bug. Eric, could you unhide your hosts or give a link to this one. It would be nice to see the stderr from more tasks. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
I think this looks just like the good old stuck in benchmarks bug. Ok, maybe I'm dense but for the life of me I cant find the place where you allow others to view your computers/tasks, is it under account? I've seen it before (a long time past) but cant find it now |
woohoo Send message Joined: 30 Oct 13 Posts: 972 Credit: 165,671,404 RAC: 5 |
SETI@home preferences | Should SETI@home show your computers on its web site? |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
I tried an experiment this afternoon and installed the Lunatics V8 MB executable just to see if that was going to have the same issue - and so far it hasn't. Of course its going to take several days to see this is really working but it looks promising after 6 hours or so on lunatics. One thing i do notice with lunatics tho and maybe its nothing but the estimated total time per work unit seems high most are 5 hours plus and some are 14 and a few are around 2 hours. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.