Abandoned tasks - Ongoing issue

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1353594 - Posted: 5 Apr 2013, 9:33:41 UTC - in response to Message 1353589. My main cruncher abandoned 98 tasks at 7.45 last night. This has not happened to me since Christmas and I was hoping the issue had been solved, I was down to my last 2 abandoned from the Christmas problems. Nothing wrong with the PC, all temps normal and continued crunching the adandoned tasks quite happily until I noticed and reset the project. Most annoying thing is I did not notice until 9AM this morning, so thats over 12 hours of wasted crunching. Still one good thing with the superdooper download speeds we get now at least I was able to get more WU's quickly. As I said to Rob, could you possibly dig out your log for 15 minutes either side of 4 Apr 2013, 19:44:22 UTC, which is when host 5951950 dumped its load? I've got Rob's, but having a couple more 'post move' examples to exhibit will help to persuade them to take it seriously. The old messages can be found in 'stdoutdae.txt' in the BOINC data directory, if they're no longer accessible via the manager. ID: 1353594 ·

Chuck Send message Joined: 23 Mar 12 Posts: 3 Credit: 28,836,441 RAC: 37	Message 1364959 - Posted: 5 May 2013, 21:04:28 UTC Since the last Bionic Update, whenever any of the AstroPulse projects try to run, they are aborted. Is there something going on with the latest Bionic Update? ID: 1364959 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380	Message 1364966 - Posted: 5 May 2013, 21:22:36 UTC Last modified: 5 May 2013, 21:25:11 UTC Richard - the preceding post from Chuck looks to be another example for your collection. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1364966 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1364968 - Posted: 5 May 2013, 21:29:39 UTC - in response to Message 1364966. Last modified: 5 May 2013, 21:33:45 UTC Richard - the preceding post from Chuck looks to be another example for your collection. No, I think 201 (0xc9) EXIT_MISSING_COPROC comes from a different collection. Edit - if anybody has experience of an AMD ATI Radeon HD 5700 series (Juniper) under Windows 8, feel free to contribute. Outside my range of knowledge, I'm afraid. ID: 1364968 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380	Message 1364973 - Posted: 5 May 2013, 21:31:49 UTC A very different collection indeed! Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1364973 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34748 Credit: 261,360,520 RAC: 489	Message 1364974 - Posted: 5 May 2013, 21:32:37 UTC - in response to Message 1364959. Since the last Bionic Update, whenever any of the AstroPulse projects try to run, they are aborted. Is there something going on with the latest Bionic Update? Seeing as all of them have returned "201 (0xc9) EXIT_MISSING_COPROC" it seems that you are having some sort of problem communicating with your video card but as I've no experience with Radeons I can't help any further. Cheers. ID: 1364974 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1364979 - Posted: 5 May 2013, 22:14:37 UTC - in response to Message 1364968. Since there isn't any data online, the first step would be for Chuck to post the BOINC startup log that lists the AMD-APP version. The Stock SETI AstroPulse App has problems with any AMD-APP newer than 938.2. When I ran my AMD 6850 in Windows 8 with BOINC 7.0.64, I used Catalyst 12-8 with 938.2 and didn't have any problems. He could start with reinstalling Catalyst and then BOINC. I run the BOINC installer twice, just to be sure. ID: 1364979 ·

Chuck Send message Joined: 23 Mar 12 Posts: 3 Credit: 28,836,441 RAC: 37	Message 1365295 - Posted: 6 May 2013, 23:41:12 UTC - in response to Message 1364959. I just watched as I was downloading a bunch of tasks. Several of them were AstroPulse. As soon as the downloads finished -- it was FAST! -- I went back to Tasks. All of the AstroPulse tasks had been aborted before they even ran. Humm! Any suggestions? ID: 1365295 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1365297 - Posted: 6 May 2013, 23:46:11 UTC - in response to Message 1365295. I just watched as I was downloading a bunch of tasks. Several of them were AstroPulse. As soon as the downloads finished -- it was FAST! -- I went back to Tasks. All of the AstroPulse tasks had been aborted before they even ran. Humm! Any suggestions? Can you post the Boinc startup messages from the Event log please, the first 20 to 30 line will do. Claggy ID: 1365297 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 1370685 - Posted: 22 May 2013, 18:13:23 UTC Last modified: 22 May 2013, 18:31:43 UTC Twice today I've had ~200 tasks abandoned on one of my hosts. What is special about today is that I am on a rather shaky mobile connection. I guess this indicates that this problem is communications-related. Here's the log from the two abandon-instances. They both state that the previous RPC is too recent, something that is not actually true. There is a pattern, though: 1. First a scheduler call fails. 2. Then the next one succeeds. 3. But then the scheduler call after that fails, and all tasks have been abandoned. Host id is 6801223 Any news on the debugging of this problem..? 22.05.2013 13:31:35 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 13:31:35 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 13:31:48 \| \| Project communication failed: attempting access to reference site 22.05.2013 13:31:48 \| SETI@home \| Scheduler request failed: Couldn't resolve host name 22.05.2013 13:32:04 \| \| BOINC can't access Internet - check network connection or proxy configuration. 22.05.2013 13:33:40 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 13:33:40 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 13:39:00 \| SETI@home \| Scheduler request failed: Timeout was reached 22.05.2013 13:40:41 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 13:40:41 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 13:40:45 \| SETI@home \| Scheduler request completed: got 0 new tasks 22.05.2013 13:40:45 \| SETI@home \| No tasks sent 22.05.2013 13:40:45 \| SETI@home \| No tasks are available for AstroPulse v6 22.05.2013 13:40:45 \| SETI@home \| This computer has reached a limit on tasks in progress 22.05.2013 13:40:45 \| SETI@home \| Project has no tasks available 22.05.2013 13:45:50 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 13:45:50 \| SETI@home \| Requesting new tasks for NVIDIA 22.05.2013 13:45:56 \| SETI@home \| Scheduler request completed: got 0 new tasks 22.05.2013 13:45:56 \| SETI@home \| Not sending work - last request too recent: 76 sec ... and 22.05.2013 19:39:10 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 19:39:10 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 19:39:17 \| SETI@home \| Finished download of ap_04my12ae_B1_P1_00238_20130522_30370.wu 22.05.2013 19:39:35 \| SETI@home \| Finished download of ap_02jn10aa_B2_P0_00148_20130522_08465.wu 22.05.2013 19:40:21 \| SETI@home \| Finished download of ap_02jn10aa_B3_P1_00233_20130522_28071.wu 22.05.2013 19:40:37 \| SETI@home \| Finished download of ap_02jn10aa_B1_P0_00276_20130522_19949.wu 22.05.2013 19:43:05 \| SETI@home \| Computation for task 26my10ab.14735.305323.10.11.67_0 finished 22.05.2013 19:43:05 \| SETI@home \| Starting task 02au12ag.30623.20517.7.11.2_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1 22.05.2013 19:43:07 \| SETI@home \| Started upload of 26my10ab.14735.305323.10.11.67_0_0 22.05.2013 19:43:10 \| SETI@home \| Computation for task 27my10aa.5881.320667.8.11.27_1 finished 22.05.2013 19:43:10 \| SETI@home \| Starting task 22my12aa.32191.9474.11.11.152_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 0 22.05.2013 19:43:12 \| SETI@home \| Finished upload of 26my10ab.14735.305323.10.11.67_0_0 22.05.2013 19:43:12 \| SETI@home \| Started upload of 27my10aa.5881.320667.8.11.27_1_0 22.05.2013 19:43:19 \| SETI@home \| Finished upload of 27my10aa.5881.320667.8.11.27_1_0 22.05.2013 19:44:41 \| \| Project communication failed: attempting access to reference site 22.05.2013 19:44:41 \| SETI@home \| Scheduler request failed: Timeout was reached 22.05.2013 19:44:45 \| \| Internet access OK - project servers may be temporarily down. 22.05.2013 19:46:38 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 19:46:38 \| SETI@home \| Reporting 2 completed tasks 22.05.2013 19:46:38 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 19:46:42 \| SETI@home \| Scheduler request completed: got 4 new tasks 22.05.2013 19:46:44 \| SETI@home \| Started download of ap_01mr13ag_B4_P0_00313_20130427_21712.wu 22.05.2013 19:46:44 \| SETI@home \| Started download of ap_04my12ae_B1_P1_00376_20130522_30370.wu 22.05.2013 19:46:44 \| SETI@home \| Started download of ap_02jn10aa_B3_P1_00362_20130522_28071.wu 22.05.2013 19:46:44 \| SETI@home \| Started download of ap_07jn10aa_B4_P0_00051_20130522_10732.wu 22.05.2013 19:50:38 \| SETI@home \| Finished download of ap_01mr13ag_B4_P0_00313_20130427_21712.wu 22.05.2013 19:50:41 \| SETI@home \| Finished download of ap_07jn10aa_B4_P0_00051_20130522_10732.wu 22.05.2013 19:51:30 \| SETI@home \| Finished download of ap_02jn10aa_B3_P1_00362_20130522_28071.wu 22.05.2013 19:51:46 \| SETI@home \| Sending scheduler request: To fetch work. 22.05.2013 19:51:46 \| SETI@home \| Requesting new tasks for CPU and NVIDIA 22.05.2013 19:51:56 \| SETI@home \| Scheduler request completed: got 0 new tasks 22.05.2013 19:51:56 \| SETI@home \| Not sending work - last request too recent: 77 sec ID: 1370685 ·

Tom* Send message Joined: 12 Aug 11 Posts: 127 Credit: 20,769,223 RAC: 9	Message 1370696 - Posted: 22 May 2013, 19:07:47 UTC Last modified: 22 May 2013, 19:28:36 UTC Wag Could network instability cause RPC UDP to fallback to TCP causing a possible rpc call too recent? In TCP/IP networks, the authors of RPC faced the problem of mapping program numbers to generic network services. They designed each server to provide both a TCP and a UDP port for each program and each version. Generally, RPC applications use UDP when sending data, and fall back to TCP only when the data to be transferred doesn't fit into a single UDP datagram. The RPC architecture shows four possible protocols for connection oriented RPC requests TCP,SPX, Named Pipes and HTTP It also shows two protocols for Datagram RPC UDP and CDP so there may be more than one failover fallback here. I wonder if Richard can find the proper magic in REGEDIT to disable failover for RPC as a test ID: 1370696 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1370711 - Posted: 22 May 2013, 20:10:37 UTC - in response to Message 1370696. AFAIK, all the RPCs from BOINC are made through the HTTP protocol, so Im not sure if the UDP/TCP fallback is acting here... But, the mistery is how come a failure in the network or even a protocol fallback could delay a RPC long enough (more than 5 mins) to produce an out of sequence error... BTW, Im still conviced that no matter how misterious this is, it's still what causes the abandoning events... If it matters, it's still happening on my hosts, just not as constantly as it was because Im not using one of the ISPs that seems to be very flaky... Anyway, if one of the other ISPs have some failure it may lead to some of my hosts abandoning their tasks. ID: 1370711 ·

trader Volunteer tester Send message Joined: 25 Jun 00 Posts: 126 Credit: 4,968,173 RAC: 0	Message 1370714 - Posted: 22 May 2013, 20:20:34 UTC - in response to Message 1370711. just a thought as a work around. try changing network activity to something other then always run and try manually connecting and see if they abandon then. ID: 1370714 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1370723 - Posted: 22 May 2013, 20:45:12 UTC - in response to Message 1370714. just a thought as a work around. try changing network activity to something other then always run and try manually connecting and see if they abandon then. It may work on some hosts, but with the current limits it wont be easy on a host with a faster/multi GPU because you may need to do a manual update once every hour if you want to keep it crunching (or often if you get the "project has no tasks" answer)... I have a script in place (using a little app that Ive made) that detects the "abandoning events" by checking the tasks pages of the host and if needed it resets the project to clean the abandoned tasks... (As the abandoned tasks are not deleted in the hosts, if you dont catch it as soon as they happen, then the host will be just sucking electricity because the results of those WUs will be discarded when reported.) ID: 1370723 ·

Bill Collins Send message Joined: 5 Nov 05 Posts: 25 Credit: 57,544,918 RAC: 0	Message 1372632 - Posted: 28 May 2013, 4:58:12 UTC - in response to Message 1370685. I have been running Seti@Home on my machines on and off since 1999 and steadilly since 2008. Within the last week I have suddenly been hit with lots and lots of "abandoned" work units. My 2 machines gulp down as many work units as they get and I have recently replaced their Nvidia GPUs (one month ago and two months ago) with better Nvidia GPUs and saw my RAC rise significantly. 3 days ago the more recently GPU replaced of my two machines started a 1300 WU nosedive. Is there something going on here? -Bill -Bill ID: 1372632 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 1372745 - Posted: 28 May 2013, 13:45:45 UTC - in response to Message 1372632. Is there something going on here? -Bill First of all, when this happens on a host, you should reset the project on that host as quickly as possible. The host never gets to know that the work has been abandoned by the project, so it keeps working on tasks that the project has already decided to ignore. Any valid work that has been assigned after the abandonement will be resent as "lost workunits" after the project has been reset. This is a low-lurking problem that has been ongoing for at least half a year. I have had work abandoned IIRC seven times since last November. Richard Haselgrove has been looking into it, but it's a hard one to figure out. On my hosts, abandonement seems to often happen twice within a short timespan, and then two months will pass without any trouble. I suspect it may have something to do with communications trouble. The last time it happened (twice in one day), I was on a shaky mobile connection. ID: 1372745 ·

Bill Collins Send message Joined: 5 Nov 05 Posts: 25 Credit: 57,544,918 RAC: 0	Message 1372904 - Posted: 29 May 2013, 2:08:44 UTC - in response to Message 1372745. Thanks. I'll try that out. I've never had to reset the project before. -Bill -Bill ID: 1372904 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.