Long-running work unit

Message boards : Number crunching : Long-running work unit
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1570837 - Posted: 12 Sep 2014, 5:20:06 UTC

I have a work unit where the estimated remaining time field is empty. It took about 40 hours to get to 99.999% complete, and was still running after 48 hours when I suspended it. The application is SETI@home v7 7.01. Most of my work units take about 2 hours to complete.

What should I do? Abort it?
ID: 1570837 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1570863 - Posted: 12 Sep 2014, 7:46:15 UTC - in response to Message 1570837.  
Last modified: 12 Sep 2014, 7:47:35 UTC

I would reboot the PC and look what the task will do.

It looks like the task couldn't finish correctly.

If you open Task-Manager tab processes, how much CPU time usage take this task?

It's an AMD CPU? AFAIK it could happen with stock apps that it don't finish correctly.

After the reboot I would wait some time, then if the task don't finish correctly I would abort it.

If you are an advanced user, you could install »opti apps«.
ID: 1570863 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1570875 - Posted: 12 Sep 2014, 8:47:57 UTC - in response to Message 1570837.  
Last modified: 12 Sep 2014, 8:49:08 UTC

Try restarting Boinc, or restarting the host, with your computers hidden, and there only being a 7.01 app for Linux, I'm assuming you're running Linux, and that you're got Boinc from the repository.

Setiathome Applications

Claggy
ID: 1570875 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1570916 - Posted: 12 Sep 2014, 11:31:30 UTC - in response to Message 1570875.  

Yes, I'm running Linux. I restarted Boinc and the work unit, and the progress dropped to 0%. :-(

At least it now has a Remaining estimate (rather longer than usual, about 3.5 hours). Looks good so far.

It sounds as though this is a known problem, then?
ID: 1570916 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1570986 - Posted: 12 Sep 2014, 14:59:13 UTC - in response to Message 1570916.  

The original problem seems to be back. The elapsed time is about 3.5 hours now (which was the estimated remaining time at first), but the progress is only about 60%, while the estimated remaining time has gone to being shown as "---".

Abort it, I guess, unless there's anything I can do to debug it?
ID: 1570986 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1570987 - Posted: 12 Sep 2014, 15:04:17 UTC

If I have a troubled task I check to see if any of the wingmates that were also assigned the task has issues. If not I'll give it one more go before calling it quits.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1570987 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1571002 - Posted: 12 Sep 2014, 15:42:25 UTC - in response to Message 1570987.  

How would I find/alert the wingmates? The WU's name is 20fe09af.20377.5815.438086664197.12.252.vlar_0.

As far as I can see, it's doing now what it did the first time. What are the chances of it succeeding on the third try?
ID: 1571002 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1571012 - Posted: 12 Sep 2014, 16:21:09 UTC
Last modified: 12 Sep 2014, 16:21:32 UTC

Go to your list of computers
http://setiathome.berkeley.edu/hosts_user.php
Select Tasks for that machine.
In the Task column choose Show names.
Locate the task & click on the Work Unit ID next to it in the list.
Now you can look at the Task details for the wingmates that have completed it.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1571012 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1571014 - Posted: 12 Sep 2014, 16:27:54 UTC - in response to Message 1571002.  

How would I find/alert the wingmates? The WU's name is 20fe09af.20377.5815.438086664197.12.252.vlar_0


It was Completed OK by another (Mac OS) computer:
http://setiathome.berkeley.edu/workunit.php?wuid=1589164773

And for reference of other readers - your computer is:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7306664

You didn't answer an earlier question:
Do you see CPU load by this process?

(and vlar tasks take more time)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1571014 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1571055 - Posted: 12 Sep 2014, 17:49:59 UTC - in response to Message 1570916.  

Yes, I'm running Linux. I restarted Boinc and the work unit, and the progress dropped to 0%. :-(

At least it now has a Remaining estimate (rather longer than usual, about 3.5 hours). Looks good so far.

It sounds as though this is a known problem, then?

The latest Boinc's estimate progress if the app doesn't report it's progress,
that can lead to progress being reported even if the app isn't making any actual progress.

Claggy
ID: 1571055 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1571074 - Posted: 12 Sep 2014, 18:25:04 UTC - in response to Message 1571014.  

Do you see CPU load by this process?


Yes, the "top" command shows a normal CPU load for the task. It's an AMD CPU.
ID: 1571074 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1571270 - Posted: 13 Sep 2014, 2:09:28 UTC - in response to Message 1571074.  

Then you may find is it really progressing (or is hang)

go to <BOINC_Data>\slots\
and find the relevant slot #
(the WU name is in work_unit.sah)

Sort the filelist by date, look if the last updated files have near-current time


Look in the next files for similar rows:

boinc_task_state.xml (this file is written by BOINC, may not show the real progress)
<fraction_done>0.850431</fraction_done>

state.sah (this file is written by the app, this is the checkpoint file)
<prog>0.85040579</prog>

(the above shows ~85% done)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1571270 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1571799 - Posted: 14 Sep 2014, 7:49:24 UTC - in response to Message 1571270.  

There's no state file. The only file being updated is boinc_setiathome_2, and this is updated whether or not the WU is running. stderr.txt says:

setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0

Work Unit Info:
...............
WU true angle range is :  0.014299
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0

Work Unit Info:
...............
WU true angle range is :  0.014299
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------

ID: 1571799 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1571978 - Posted: 14 Sep 2014, 21:15:00 UTC - in response to Message 1571799.  

The app is hang in 'Optimal function choice' routine

'Optimal function choice' do not depend on the task itself (so hang may happen again with any task)
This test is done before the app even looks at the task data


Suspend/Restart it (wait a few minutes after Restart)
Look in stderr.txt to see the printed functions (Optimal function choices) in the table
(do Suspend/Restart a few times if needed)


Normal stderr.txt should start with:
http://setiathome.berkeley.edu/result.php?resultid=3732593977
<stderr_txt>
setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0

Work Unit Info:
...............
WU true angle range is :  0.384554
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)
     v_vGetPowerSpectrumUnrolled 0.000039 0.00000 
                 avx_ChirpData_c 0.001676 0.00000 
           v_avxTranspose4x16ntw 0.000320 0.00000 
                  AK SSE folding 0.000227 0.00000 

 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1571978 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1571983 - Posted: 14 Sep 2014, 21:28:45 UTC - in response to Message 1571978.  

Optimized apps do not have this problem (they do not do 'Optimal function choice') and are of course much faster.

If you want to go for them:
http://www.arkayn.us/forum/index.php?action=tpmod;dl=cat5
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1571983 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1572334 - Posted: 15 Sep 2014, 19:03:40 UTC - in response to Message 1571978.  

I've suspended and restarted the task several times, but stderr.txt doesn't change; its last write time is three days ago. I tried running gdb, but it doesn't look promising:

Attaching to process 29675
Reading symbols from /var/lib/boinc-client/projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu...(no debugging symbols found)...done.
0x000000000040d708 in ?? ()
(gdb) bt
#0  0x000000000040d708 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) n
Cannot find bounds of current function
(gdb) s
Cannot find bounds of current function
(gdb) ni
0x00000000004a557c in ?? ()
(gdb) ni
0x00000000004a557d in ?? ()
(gdb) 
0x00000000004a5580 in ?? ()
(gdb) 
0x00000000004a5584 in ?? ()
(gdb) 
0x00000000004a5587 in ?? ()
(gdb) 
0x00000000004a558c in ?? ()
(gdb) 
0x00000000004a5591 in ?? ()
(gdb) 
0x000000000073c6c0 in ?? ()
(gdb) 
0x000000000073c6c5 in ?? ()
(gdb) 
0x000000000073c6c7 in ?? ()
(gdb) 
0x000000000073c6cd in ?? ()
(gdb) 
0x000000000073c6d3 in ?? ()
(gdb) 
0x00000000004a5596 in ?? ()
(gdb) 
0x00000000004a559d in ?? ()
(gdb) 
0x00000000004a559f in ?? ()
(gdb) 
0x00000000004a55ae in ?? ()
(gdb) 
0x00000000004a55b4 in ?? ()
(gdb) 
0x00000000004a55b6 in ?? ()
(gdb) 
0x00000000004a55b8 in ?? ()
(gdb) 
0x00000000004a55c4 in ?? ()
(gdb) 
0x00000000004a55ca in ?? ()
(gdb) 
0x00000000004a55cc in ?? ()
(gdb) 
0x00000000004a55df in ?? ()
(gdb) 
0x00000000004a55e4 in ?? ()
(gdb) 
0x00000000004a55e6 in ?? ()
(gdb) 
0x00000000004a55e8 in ?? ()
(gdb) 
0x00000000004a55e9 in ?? ()
(gdb) 
<signal handler called>
(gdb) 
<signal handler called>
(gdb) det
Detaching from program: /var/lib/boinc-client/projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu, process 29675

ID: 1572334 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1573163 - Posted: 17 Sep 2014, 10:24:31 UTC - in response to Message 1572334.  

BOINC Manager wasn't downloading new WUs, saying: "Not requesting tasks: some task is suspended via Manager".

So I aborted the task. Unfortunately, BOINC Manager still isn't downloading new WUs, issuing the same message.

Any ideas?
ID: 1573163 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1573166 - Posted: 17 Sep 2014, 10:35:05 UTC - in response to Message 1573163.  

BOINC Manager wasn't downloading new WUs, saying: "Not requesting tasks: some task is suspended via Manager".

So I aborted the task. Unfortunately, BOINC Manager still isn't downloading new WUs, issuing the same message.

Any ideas?

What it says on the tin. BOINC will not fetch new work while any task from the project is suspended.

Note that 'suspended' (by you) is different from 'waiting to run' (BOINC's task management). Open BOINC Manager in Advanced view, ensure that all tasks are displayed (not just active tasks), and look down the 'status' column. Highlight any task(s) that say 'Task suspended by user', and click the third button on the left - which will have changed to say 'Resume'.

BTW, I answered a problem you posted at Einstein yesterday - your assumption was wrong, and your proposed course of action was inadvisable.
ID: 1573166 · Report as offensive
Graeme Hewson

Send message
Joined: 14 Jun 99
Posts: 19
Credit: 242,802
RAC: 0
United Kingdom
Message 1573167 - Posted: 17 Sep 2014, 10:44:33 UTC - in response to Message 1573166.  

There are no SETI@Home tasks shown in my manager. The one I aborted was the only one, and it's not there now.

Thanks, I saw your reply about Einstein, but I haven't got around yet to doing the reading you suggested.
ID: 1573167 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1573169 - Posted: 17 Sep 2014, 10:52:27 UTC - in response to Message 1573167.  

There are no SETI@Home tasks shown in my manager. The one I aborted was the only one, and it's not there now.

Ensure that all tasks are displayed. The top button on the left should have the caption "Show active tasks": that would be the outcome if you clicked it.
ID: 1573169 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Long-running work unit


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.