Abandoned tasks - Ongoing issue

Message boards : Number crunching : Abandoned tasks - Ongoing issue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1353594 - Posted: 5 Apr 2013, 9:33:41 UTC - in response to Message 1353589.  

My main cruncher abandoned 98 tasks at 7.45 last night. This has not happened to me since Christmas and I was hoping the issue had been solved, I was down to my last 2 abandoned from the Christmas problems. Nothing wrong with the PC, all temps normal and continued crunching the adandoned tasks quite happily until I noticed and reset the project.

Most annoying thing is I did not notice until 9AM this morning, so thats over 12 hours of wasted crunching.

Still one good thing with the superdooper download speeds we get now at least I was able to get more WU's quickly.

As I said to Rob, could you possibly dig out your log for 15 minutes either side of 4 Apr 2013, 19:44:22 UTC, which is when host 5951950 dumped its load? I've got Rob's, but having a couple more 'post move' examples to exhibit will help to persuade them to take it seriously.

The old messages can be found in 'stdoutdae.txt' in the BOINC data directory, if they're no longer accessible via the manager.
ID: 1353594 · Report as offensive
Chuck

Send message
Joined: 23 Mar 12
Posts: 3
Credit: 28,836,441
RAC: 37
United States
Message 1364959 - Posted: 5 May 2013, 21:04:28 UTC

Since the last Bionic Update, whenever any of the AstroPulse projects try to run, they are aborted. Is there something going on with the latest Bionic Update?
ID: 1364959 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22203
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1364966 - Posted: 5 May 2013, 21:22:36 UTC
Last modified: 5 May 2013, 21:25:11 UTC

Richard - the preceding post from Chuck looks to be another example for your collection.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1364966 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1364968 - Posted: 5 May 2013, 21:29:39 UTC - in response to Message 1364966.  
Last modified: 5 May 2013, 21:33:45 UTC

Richard - the preceding post from Chuck looks to be another example for your collection.

No, I think

201 (0xc9) EXIT_MISSING_COPROC

comes from a different collection.

Edit - if anybody has experience of an AMD ATI Radeon HD 5700 series (Juniper) under Windows 8, feel free to contribute. Outside my range of knowledge, I'm afraid.
ID: 1364968 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22203
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1364973 - Posted: 5 May 2013, 21:31:49 UTC

A very different collection indeed!
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1364973 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34748
Credit: 261,360,520
RAC: 489
Australia
Message 1364974 - Posted: 5 May 2013, 21:32:37 UTC - in response to Message 1364959.  

Since the last Bionic Update, whenever any of the AstroPulse projects try to run, they are aborted. Is there something going on with the latest Bionic Update?

Seeing as all of them have returned "201 (0xc9) EXIT_MISSING_COPROC" it seems that you are having some sort of problem communicating with your video card but as I've no experience with Radeons I can't help any further.

Cheers.
ID: 1364974 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1364979 - Posted: 5 May 2013, 22:14:37 UTC - in response to Message 1364968.  

Since there isn't any data online, the first step would be for Chuck to post the BOINC startup log that lists the AMD-APP version. The Stock SETI AstroPulse App has problems with any AMD-APP newer than 938.2. When I ran my AMD 6850 in Windows 8 with BOINC 7.0.64, I used Catalyst 12-8 with 938.2 and didn't have any problems. He could start with reinstalling Catalyst and then BOINC. I run the BOINC installer twice, just to be sure.
ID: 1364979 · Report as offensive
Chuck

Send message
Joined: 23 Mar 12
Posts: 3
Credit: 28,836,441
RAC: 37
United States
Message 1365295 - Posted: 6 May 2013, 23:41:12 UTC - in response to Message 1364959.  

I just watched as I was downloading a bunch of tasks. Several of them were AstroPulse. As soon as the downloads finished -- it was FAST! -- I went back to Tasks. All of the AstroPulse tasks had been aborted before they even ran. Humm! Any suggestions?
ID: 1365295 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1365297 - Posted: 6 May 2013, 23:46:11 UTC - in response to Message 1365295.  

I just watched as I was downloading a bunch of tasks. Several of them were AstroPulse. As soon as the downloads finished -- it was FAST! -- I went back to Tasks. All of the AstroPulse tasks had been aborted before they even ran. Humm! Any suggestions?

Can you post the Boinc startup messages from the Event log please, the first 20 to 30 line will do.

Claggy
ID: 1365297 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1370685 - Posted: 22 May 2013, 18:13:23 UTC
Last modified: 22 May 2013, 18:31:43 UTC

Twice today I've had ~200 tasks abandoned on one of my hosts. What is special about today is that I am on a rather shaky mobile connection. I guess this indicates that this problem is communications-related.

Here's the log from the two abandon-instances. They both state that the previous RPC is too recent, something that is not actually true.

There is a pattern, though:

1. First a scheduler call fails.
2. Then the next one succeeds.
3. But then the scheduler call after that fails, and all tasks have been abandoned.

Host id is 6801223

Any news on the debugging of this problem..?

22.05.2013 13:31:35 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 13:31:35 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 13:31:48 |  | Project communication failed: attempting access to reference site
22.05.2013 13:31:48 | SETI@home | Scheduler request failed: Couldn't resolve host name
22.05.2013 13:32:04 |  | BOINC can't access Internet - check network connection or proxy configuration.
22.05.2013 13:33:40 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 13:33:40 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 13:39:00 | SETI@home | Scheduler request failed: Timeout was reached
22.05.2013 13:40:41 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 13:40:41 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 13:40:45 | SETI@home | Scheduler request completed: got 0 new tasks
22.05.2013 13:40:45 | SETI@home | No tasks sent
22.05.2013 13:40:45 | SETI@home | No tasks are available for AstroPulse v6
22.05.2013 13:40:45 | SETI@home | This computer has reached a limit on tasks in progress
22.05.2013 13:40:45 | SETI@home | Project has no tasks available
22.05.2013 13:45:50 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 13:45:50 | SETI@home | Requesting new tasks for NVIDIA
22.05.2013 13:45:56 | SETI@home | Scheduler request completed: got 0 new tasks
22.05.2013 13:45:56 | SETI@home | Not sending work - last request too recent: 76 sec


... and

22.05.2013 19:39:10 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 19:39:10 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 19:39:17 | SETI@home | Finished download of ap_04my12ae_B1_P1_00238_20130522_30370.wu
22.05.2013 19:39:35 | SETI@home | Finished download of ap_02jn10aa_B2_P0_00148_20130522_08465.wu
22.05.2013 19:40:21 | SETI@home | Finished download of ap_02jn10aa_B3_P1_00233_20130522_28071.wu
22.05.2013 19:40:37 | SETI@home | Finished download of ap_02jn10aa_B1_P0_00276_20130522_19949.wu
22.05.2013 19:43:05 | SETI@home | Computation for task 26my10ab.14735.305323.10.11.67_0 finished
22.05.2013 19:43:05 | SETI@home | Starting task 02au12ag.30623.20517.7.11.2_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
22.05.2013 19:43:07 | SETI@home | Started upload of 26my10ab.14735.305323.10.11.67_0_0
22.05.2013 19:43:10 | SETI@home | Computation for task 27my10aa.5881.320667.8.11.27_1 finished
22.05.2013 19:43:10 | SETI@home | Starting task 22my12aa.32191.9474.11.11.152_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 0
22.05.2013 19:43:12 | SETI@home | Finished upload of 26my10ab.14735.305323.10.11.67_0_0
22.05.2013 19:43:12 | SETI@home | Started upload of 27my10aa.5881.320667.8.11.27_1_0
22.05.2013 19:43:19 | SETI@home | Finished upload of 27my10aa.5881.320667.8.11.27_1_0
22.05.2013 19:44:41 |  | Project communication failed: attempting access to reference site
22.05.2013 19:44:41 | SETI@home | Scheduler request failed: Timeout was reached
22.05.2013 19:44:45 |  | Internet access OK - project servers may be temporarily down.
22.05.2013 19:46:38 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 19:46:38 | SETI@home | Reporting 2 completed tasks
22.05.2013 19:46:38 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 19:46:42 | SETI@home | Scheduler request completed: got 4 new tasks
22.05.2013 19:46:44 | SETI@home | Started download of ap_01mr13ag_B4_P0_00313_20130427_21712.wu
22.05.2013 19:46:44 | SETI@home | Started download of ap_04my12ae_B1_P1_00376_20130522_30370.wu
22.05.2013 19:46:44 | SETI@home | Started download of ap_02jn10aa_B3_P1_00362_20130522_28071.wu
22.05.2013 19:46:44 | SETI@home | Started download of ap_07jn10aa_B4_P0_00051_20130522_10732.wu
22.05.2013 19:50:38 | SETI@home | Finished download of ap_01mr13ag_B4_P0_00313_20130427_21712.wu
22.05.2013 19:50:41 | SETI@home | Finished download of ap_07jn10aa_B4_P0_00051_20130522_10732.wu
22.05.2013 19:51:30 | SETI@home | Finished download of ap_02jn10aa_B3_P1_00362_20130522_28071.wu
22.05.2013 19:51:46 | SETI@home | Sending scheduler request: To fetch work.
22.05.2013 19:51:46 | SETI@home | Requesting new tasks for CPU and NVIDIA
22.05.2013 19:51:56 | SETI@home | Scheduler request completed: got 0 new tasks
22.05.2013 19:51:56 | SETI@home | Not sending work - last request too recent: 77 sec

ID: 1370685 · Report as offensive
Tom*

Send message
Joined: 12 Aug 11
Posts: 127
Credit: 20,769,223
RAC: 9
United States
Message 1370696 - Posted: 22 May 2013, 19:07:47 UTC
Last modified: 22 May 2013, 19:28:36 UTC

Wag

Could network instability cause RPC UDP to fallback to TCP causing a possible
rpc call too recent?

In TCP/IP networks, the authors of RPC faced the problem of mapping program numbers to generic network services. They designed each server to provide both a TCP and a UDP port for each program and each version. Generally, RPC applications use UDP when sending data, and fall back to TCP only when the data to be transferred doesn't fit into a single UDP datagram.


The RPC architecture shows four possible protocols for connection oriented
RPC requests TCP,SPX, Named Pipes and HTTP

It also shows two protocols for Datagram RPC UDP and CDP

so there may be more than one failover fallback here.

I wonder if Richard can find the proper magic in REGEDIT to disable failover
for RPC as a test
ID: 1370696 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1370711 - Posted: 22 May 2013, 20:10:37 UTC - in response to Message 1370696.  

AFAIK, all the RPCs from BOINC are made through the HTTP protocol, so Im not sure if the UDP/TCP fallback is acting here...

But, the mistery is how come a failure in the network or even a protocol fallback could delay a RPC long enough (more than 5 mins) to produce an out of sequence error... BTW, Im still conviced that no matter how misterious this is, it's still what causes the abandoning events...

If it matters, it's still happening on my hosts, just not as constantly as it was because Im not using one of the ISPs that seems to be very flaky...
Anyway, if one of the other ISPs have some failure it may lead to some of my hosts abandoning their tasks.
ID: 1370711 · Report as offensive
Profile trader
Volunteer tester

Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1370714 - Posted: 22 May 2013, 20:20:34 UTC - in response to Message 1370711.  

just a thought as a work around. try changing network activity to something other then always run and try manually connecting and see if they abandon then.
ID: 1370714 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1370723 - Posted: 22 May 2013, 20:45:12 UTC - in response to Message 1370714.  

just a thought as a work around. try changing network activity to something other then always run and try manually connecting and see if they abandon then.

It may work on some hosts, but with the current limits it wont be easy on a host with a faster/multi GPU because you may need to do a manual update once every hour if you want to keep it crunching (or often if you get the "project has no tasks" answer)...

I have a script in place (using a little app that Ive made) that detects the "abandoning events" by checking the tasks pages of the host and if needed it resets the project to clean the abandoned tasks...
(As the abandoned tasks are not deleted in the hosts, if you dont catch it as soon as they happen, then the host will be just sucking electricity because the results of those WUs will be discarded when reported.)

ID: 1370723 · Report as offensive
Bill Collins

Send message
Joined: 5 Nov 05
Posts: 25
Credit: 57,544,918
RAC: 0
United States
Message 1372632 - Posted: 28 May 2013, 4:58:12 UTC - in response to Message 1370685.  

I have been running Seti@Home on my machines on and off since 1999 and steadilly since 2008. Within the last week I have suddenly been hit with lots and lots of "abandoned" work units. My 2 machines gulp down as many work units as they get and I have recently replaced their Nvidia GPUs (one month ago and two months ago) with better Nvidia GPUs and saw my RAC rise significantly. 3 days ago the more recently GPU replaced of my two machines started a 1300 WU nosedive.

Is there something going on here?

-Bill
-Bill
ID: 1372632 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1372745 - Posted: 28 May 2013, 13:45:45 UTC - in response to Message 1372632.  


Is there something going on here?

-Bill


First of all, when this happens on a host, you should reset the project on that host as quickly as possible. The host never gets to know that the work has been abandoned by the project, so it keeps working on tasks that the project has already decided to ignore. Any valid work that has been assigned after the abandonement will be resent as "lost workunits" after the project has been reset.

This is a low-lurking problem that has been ongoing for at least half a year. I have had work abandoned IIRC seven times since last November. Richard Haselgrove has been looking into it, but it's a hard one to figure out.

On my hosts, abandonement seems to often happen twice within a short timespan, and then two months will pass without any trouble. I suspect it may have something to do with communications trouble. The last time it happened (twice in one day), I was on a shaky mobile connection.


ID: 1372745 · Report as offensive
Bill Collins

Send message
Joined: 5 Nov 05
Posts: 25
Credit: 57,544,918
RAC: 0
United States
Message 1372904 - Posted: 29 May 2013, 2:08:44 UTC - in response to Message 1372745.  

Thanks. I'll try that out. I've never had to reset the project before.

-Bill
-Bill
ID: 1372904 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Number crunching : Abandoned tasks - Ongoing issue


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.