Need help - Seti v8 MB units running forever

Message boards : Number crunching : Need help - Seti v8 MB units running forever
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769374 - Posted: 4 Mar 2016, 1:31:24 UTC

OpenSuse Linux 13.1 x86_64, Intel i7-3960X 6 core/HT 32G ram, NVIDIA GTX 460

Since I started crunching v8 MB units I have this problem on 1 pc where the majority of work units crunch forever (I've let them run up to 60 hours and they stay at 100% and are still crunching with no expected completion time, just "---" Remaining). I tried going back to earlier versions of boinc but have the same problem. I did a project reset, and a fresh install - no dice. Sometimes, shutting down boinc and restarting will get them to complete, sometimes not, restarting boinc always resets the Elapsed Time back to zero.
Is this a known v8 issue? Is there anything I can do to fix this? Curious how it only affects this one system, My other two systems are fine (all 3 same os)
ID: 1769374 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1769382 - Posted: 4 Mar 2016, 1:55:04 UTC

I have an 8 core google cloud compute instance running SETI@home with Ubuntu. The only issue I had was forgetting to give the optimized apps execute permission when I first coped them over. I was running the stock apps and recently switched to the optimized AVX versions.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1769382 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1769383 - Posted: 4 Mar 2016, 1:59:04 UTC - in response to Message 1769374.  

Are these CPU or GPU WUs or both?

I did have some issues with never ending GPU WUs a while back- in my case they would start crunching, elapsed time would start counting, but after less than a couple of % had been done the % completed would stop, and the Estimated time to completion would start climbing.
I think I only had 3 or 4 of these WUs and in the end all I could do was abort them. Haven't had any since.

The other issue I had related to Windows update hogging CPU resources & causing 1 GPU WU to stall due to lack of CPU time. Since you're running Linux that shouldn't the case, but it might be worth checking CPU usage to see that all instances are getting their needed CPU time.
Grant
Darwin NT
ID: 1769383 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769467 - Posted: 4 Mar 2016, 12:18:53 UTC - in response to Message 1769383.  
Last modified: 4 Mar 2016, 12:19:46 UTC

As near as I can tell they are all CPU units.
Here is a sample image showing the progress of some WU's that were started yesterday and still showing 0%, after another 24 or 48 hours they might show 100% - it varies with the particular WU. Its 4AM here so forgive me if I'm mumbling

ID: 1769467 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769469 - Posted: 4 Mar 2016, 12:33:25 UTC - in response to Message 1769467.  

That looks like BOINC's placebo "pseudo progress" - designed to reassure people that something's happening even when it isn't (some badly-designed science applications, at some badly-designed projects, don't report their progress at all. But not here.)

Check whether anything is really happening under the hood.

1) Select one of the 'running' tasks, and click the Properties command button. Compare 'CPU time' and 'CPU time at last checkpoint' - they shouldn't be more than a minute or two apart, and no less than say ~95% of the Elapsed time.

2) Use whatever the Linux equivalent of Task Manager is - 'top'? - to see if there's any loading on the CPU cores.
ID: 1769469 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769504 - Posted: 4 Mar 2016, 16:58:15 UTC - in response to Message 1769469.  

The cpu is fully loaded according to top, here is the info from boinc:
ID: 1769504 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769506 - Posted: 4 Mar 2016, 17:13:33 UTC - in response to Message 1769504.  

No checkpoint after 24 hours is not a good sign - as is no memory usage.

If you can, look inside the directory shown just above the Process ID - that's

[boinc_data_directory] / slots / 8

and try to find a file called stderr.txt

Look to see what time the file was last modified (it should be created when the task starts running, and updated at intervals while it runs). Open the file, and paste the contents here, please.
ID: 1769506 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769550 - Posted: 4 Mar 2016, 20:07:04 UTC - in response to Message 1769506.  

The stderr.txt (in full):

setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
libboinc: BOINC 7.7.0

Work Unit Info:
...............
WU true angle range is : 0.009480
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_avxGetPowerSpectrum 0.000065 0.00000
setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
libboinc: BOINC 7.7.0

Work Unit Info:
...............
WU true angle range is : 0.009480
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_avxGetPowerSpectrum 0.000111 0.00000
setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
libboinc: BOINC 7.7.0

Work Unit Info:
...............
WU true angle range is : 0.009480
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_avxGetPowerSpectrum 0.000101 0.00000
ID: 1769550 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769553 - Posted: 4 Mar 2016, 20:17:49 UTC - in response to Message 1769550.  

That looks like it's looping because of "name timing error". I think we need a Linux specialist here, someone who knows the next question to ask.
ID: 1769553 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769554 - Posted: 4 Mar 2016, 20:21:20 UTC - in response to Message 1769550.  

Also, do these file sizes and perms look ok (my username has been replaced with x's)?
# ls -l
total 72
-rw-r--r-- 1 xxxxxxxx users      0 Mar  3 06:02 boinc_lockfile
-rw-rw---- 1 xxxxxxxx users 266080 Mar  3 06:02 boinc_setiathome_8
-rw-r--r-- 1 xxxxxxxx users    100 Mar  2 22:39 graphics_app
-rw-r--r-- 1 xxxxxxxx users   8111 Mar  4 12:14 init_data.xml
-rw-r--r-- 1 xxxxxxxx users    104 Mar  2 22:39 result.sah
-rw-r--r-- 1 xxxxxxxx users     86 Mar  2 22:39 setiathome-8.00_AUTHORS
-rw-r--r-- 1 xxxxxxxx users     86 Mar  2 22:39 setiathome-8.00_COPYING
-rw-r--r-- 1 xxxxxxxx users     88 Mar  2 22:39 setiathome-8.00_COPYRIGHT
-rw-r--r-- 1 xxxxxxxx users     85 Mar  2 22:39 setiathome-8.00_README
-rw-r--r-- 1 xxxxxxxx users     98 Mar  2 22:39 setiathome_8.00_x86_64-pc-linux-gnu
-rw-r--r-- 1 xxxxxxxx users     75 Mar  2 22:39 seti_logo
-rw-r--r-- 1 xxxxxxxx users     77 Mar  2 22:39 sponsor_bkg
-rw-r--r-- 1 xxxxxxxx users     73 Mar  2 22:39 sponsor_logo
-rw-r--r-- 1 xxxxxxxx users   1356 Mar  3 06:02 stderr.txt
-rw-r--r-- 1 xxxxxxxx users   6590 Mar  3 06:02 wisdom.sah
-rw-r--r-- 1 xxxxxxxx users    100 Mar  2 22:39 work_unit.sah
ID: 1769554 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1769556 - Posted: 4 Mar 2016, 20:34:58 UTC - in response to Message 1769554.  

I forget my Linux, but?

Are you running as admin? Admin/user rights seem different.

Shouldn't x86_64 have execute rights?
ID: 1769556 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769557 - Posted: 4 Mar 2016, 20:35:07 UTC - in response to Message 1769554.  

setiathome_8.00_x86_64-pc-linux-gnu is the main program and needs to be executable, but at 98 bytes, that can't be the real file, just a symlink. Let's assume the real file is executing properly, else nothing at all would be written to stderr.txt

I don't recognise boinc_setiathome_8, and at 266080 bytes it should be significant. Anyone?
ID: 1769557 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769559 - Posted: 4 Mar 2016, 20:52:05 UTC - in response to Message 1769553.  

That looks like it's looping because of "name timing error". I think we need a Linux specialist here, someone who knows the next question to ask.

Scrub that. "name timing error" are the column headers, and the message board software has mungled the spacing.

A real one looks like

Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)
           v_avxGetPowerSpectrum 0.000058 0.00000 
                 avx_ChirpData_d 0.002103 0.00000 
                 v_vTranspose4np 0.001111 0.00000 
                  AK SSE folding 0.000302 0.00000 

It looks like yours is crashing after the 'GetPowerSpectrum' test, and restarting the program before testing ChirpData.
ID: 1769559 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769561 - Posted: 4 Mar 2016, 20:55:50 UTC - in response to Message 1769556.  
Last modified: 4 Mar 2016, 20:56:58 UTC

Brent Norman:
No, not running as admin (or root as we say in the linux world) Just an ordinary user. I'm not sure i understand your question about x86_64, but that's the cpu mode not a filesystem thing
ID: 1769561 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769563 - Posted: 4 Mar 2016, 20:59:15 UTC - in response to Message 1769557.  

setiathome_8.00_x86_64-pc-linux-gnu is the main program and needs to be executable, but at 98 bytes, that can't be the real file, just a symlink. Let's assume the real file is executing properly, else nothing at all would be written to stderr.txt

I don't recognise boinc_setiathome_8, and at 266080 bytes it should be significant. Anyone?


Its a boinc link, not a symlink. In otherwords its an ordinary file that contains a reference - only boinc does this, it isnt a standard part of linux at all

# cat setiathome_8.00_x86_64-pc-linux-gnu
<soft_link>../../projects/setiathome.berkeley.edu/setiathome_8.00_x86_64-pc-linux-gnu</soft_link>
ID: 1769563 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1769564 - Posted: 4 Mar 2016, 21:00:13 UTC - in response to Message 1769561.  

Brent Norman:
No, not running as admin (or root as we say in the linux world) Just an ordinary user. I'm not sure i understand your question about x86_64, but that's the cpu mode not a filesystem thing

I think he was pondering the same thing as me - whether the file with x86_64 in its name needed to have execute permission. But since it is obviously launching, even if subsequently crashing, I think we can move on beyond that question.
ID: 1769564 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1769579 - Posted: 4 Mar 2016, 21:30:24 UTC - in response to Message 1769564.  

I think this looks just like the good old stuck in benchmarks bug.

Eric, could you unhide your hosts or give a link to this one. It would be nice to see the stderr from more tasks.
ID: 1769579 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769629 - Posted: 5 Mar 2016, 4:11:21 UTC - in response to Message 1769579.  

I think this looks just like the good old stuck in benchmarks bug.

Eric, could you unhide your hosts or give a link to this one. It would be nice to see the stderr from more tasks.

Ok, maybe I'm dense but for the life of me I cant find the place where you allow others to view your computers/tasks, is it under account? I've seen it before (a long time past) but cant find it now
ID: 1769629 · Report as offensive
woohoo
Volunteer tester

Send message
Joined: 30 Oct 13
Posts: 972
Credit: 165,671,404
RAC: 5
United States
Message 1769631 - Posted: 5 Mar 2016, 4:16:16 UTC

SETI@home preferences | Should SETI@home show your computers on its web site?
ID: 1769631 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1769632 - Posted: 5 Mar 2016, 4:17:35 UTC - in response to Message 1769629.  

I tried an experiment this afternoon and installed the Lunatics V8 MB executable just to see if that was going to have the same issue - and so far it hasn't. Of course its going to take several days to see this is really working but it looks promising after 6 hours or so on lunatics. One thing i do notice with lunatics tho and maybe its nothing but the estimated total time per work unit seems high most are 5 hours plus and some are 14 and a few are around 2 hours.
ID: 1769632 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Need help - Seti v8 MB units running forever


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.