Abandoned tasks - Ongoing issue


log in

Advanced search

Message boards : Number crunching : Abandoned tasks - Ongoing issue

1 · 2 · 3 · 4 . . . 5 · Next
Author Message
Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1341373 - Posted: 27 Feb 2013, 21:00:43 UTC

I dont know if Im the only one suffering this, but my 3 SETI hosts have been abandoning tasks constantly on the last 2 or 3 weeks...

I thought that after some repairs that Matt did on the servers raid disks last week, this issue was fixed but it's not... yesterday it happened again in one of my hosts and today it happened in another...

Is it possible that this is due to something on my end? Can I do something about?
As Ive said in other thread the "abandoning" event happens almost at the same time the host does a RPC (the mismatch is just a couple of seconds, and probably due to clocks not beeing fully synchronized), and the only weird things Ive noticed around that time is that some of the server's answers do not seem to be right for that hosts, but not allways. For example Ive seen an answer saying that it didnt sent work because the last contact was too recent when the last one was more than 5 minutes ago and also after several succesfull contacts on which Ive received new tasks then the servers have sent me a batch of "lost" tasks, even when there was not any failled RPCs... I know, this last can be tricky to verify, so take that with a grain of salt.

And, by the way, what is or what triggers that "abandoned" thing?
If the servers were thinking that the host dont have the tasks then they should be resent (which will be less harmfull as the client will notice that already has the files so it wont waste bandwith)...
And also, why the servers dont tell to the client that the tasks on the cache were marked as abandoned so it can abort them? I know that the servers check the list, if they were not doing that they wont be able to resend ghosts...

TBH, this thing is annoying me a lot... As Ive said before, I can undertand that the project may not have enough tasks/resources to feed all the hosts out there, and I have no issues if due to that my hosts have to crunch for the backup project or in the worst case to sit idle, but due to this thing Im wasting a lot of money on electricity working on tasks that SETI is going to ignore...
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4422
Credit: 118,486,711
RAC: 136,181
United States
Message 1341639 - Posted: 28 Feb 2013, 15:25:32 UTC

I forget the triggers that cause tasks to be abandoned. You can check your stdoutdae.txt or stdoutdae.old to see if the reason was stated locally in your log.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1341645 - Posted: 28 Feb 2013, 15:44:17 UTC

In the logs, ussually, it doesnt appear anything weird or related to this, except the things noted in my previous post, but those might be just coincidences, as those things dont happen regularly...

In fact in the last computer that it happened the only thing there, by the time the tasks were marked as abandoned, is a lot of successfull RPCs...
As this is the slower cruncher I have, during at least half an hour previous to the event and up to half an hour after, there are no downloads in progress, and none of this RPCs was reporting any completed task, they all were just asking for more work and they all got the same answer saying that the host reached the limits...

Would it be worth to enable some debug options to be recorded in the logs? which ones?
____________

Oddbjornik
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 74
Credit: 87,109,044
RAC: 58,384
Norway
Message 1341647 - Posted: 28 Feb 2013, 15:57:44 UTC - in response to Message 1341645.

One of my hosts abandoned all its tasks on February 15 at approximately 12:06:35 local time (11:06:35 UTC). Here is the somewhat contradictory log fragment surrounding the abandonement:

15-Feb-2013 11:56:38 [SETI@home] Sending scheduler request: To fetch work. 15-Feb-2013 11:56:38 [SETI@home] Requesting new tasks for CPU and NVIDIA 15-Feb-2013 11:56:43 [SETI@home] Scheduler request completed: got 0 new tasks 15-Feb-2013 11:56:43 [SETI@home] No tasks sent 15-Feb-2013 11:56:43 [SETI@home] This computer has reached a limit on tasks in progress 15-Feb-2013 11:56:43 [SETI@home] Project has no tasks available 15-Feb-2013 12:06:52 [SETI@home] Sending scheduler request: To fetch work. 15-Feb-2013 12:06:52 [SETI@home] Requesting new tasks for CPU 15-Feb-2013 12:06:58 [SETI@home] Scheduler request completed: got 0 new tasks 15-Feb-2013 12:06:58 [SETI@home] Not sending work - last request too recent: 23 sec


Does that make any sense? I can't find any event log messages indicating computer clock adjustment around that time, and even if the time had been adjusted it certainly would only have been by a few seconds, not ten minutes.

And also; a too recent request should only be ignored, not punished like this :-)

It kind of points to the borderline paranoid assumption that someone else has called the scheduler, reporting that all tasks for this host have been abandoned, 23 seconds prior to my completely legitimate call at 12:06:52.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1341656 - Posted: 28 Feb 2013, 16:23:15 UTC

And it happened again just a couple of minutes ago in another host:
The relevant things on the log are this:


28-Feb-2013 13:04:13 [SETI@home] Sending scheduler request: To fetch work. 28-Feb-2013 13:04:13 [SETI@home] Reporting 1 completed tasks, requesting new tasks for CPU and GPU 28-Feb-2013 13:04:49 [SETI@home] Scheduler request completed: got 1 new tasks 28-Feb-2013 13:09:52 [SETI@home] Sending scheduler request: To fetch work. 28-Feb-2013 13:09:52 [SETI@home] Reporting 5 completed tasks, requesting new tasks for CPU and GPU 28-Feb-2013 13:09:56 [SETI@home] Scheduler request completed: got 0 new tasks 28-Feb-2013 13:09:56 [SETI@home] Message from server: Not sending work - last request too recent: 66 sec


And the tasks were abandoned on UTC 16:09:00 as Im at -3 its around 13:09:00 which is close to the time in which the servers think that the host has made the last contact...

Im not the only one, it seems like the servers are mixing different hosts RPCs or, may be, there is a bug or error in the databases or in the queries...
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,401,677
RAC: 50,530
United Kingdom
Message 1341666 - Posted: 28 Feb 2013, 16:43:56 UTC - in response to Message 1341656.

The "last request too recent" messages are certainly suggestive of some other computer attempting to contact the scheduler with the same HostID number. For those off you afflicted with this problem - it doesn't seem to affect all of us - it just might be helpful to keep an eye on the IP addresses shown on the host details page for that host (available to logged-in users only): if there is a 'ghost host' contacting the scheduler, that should change, and the interloper's IP address might help the staff to track it down.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5861
Credit: 60,403,790
RAC: 49,114
Australia
Message 1341678 - Posted: 28 Feb 2013, 17:02:48 UTC - in response to Message 1341666.


Are all these systems contacting Seti directly or through Proxies?
____________
Grant
Darwin NT.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1341681 - Posted: 28 Feb 2013, 17:06:09 UTC
Last modified: 28 Feb 2013, 17:35:23 UTC

In the host that has recently failed the IP address shown is the right one the host is using inside the local lan and it says it was the same the last 31565 times...
Ive checked the other hosts and all they are showing the right IP and all of them reported as the same the last 30K times... which is as it should be because im using fixed IPs inside...

The external IPs are in the expected range, but is a bit tricky to be sure because Im using a Multi-Wan router that gets internet from 2 ISPs and from a direct link with my office in which there are another Multi-Wan router getting internet from another 3 different conections...
(this whole thing of the multi ISPs its not related, last week Ive unpluged all but one ISP conections and it failed in the same way)
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1341684 - Posted: 28 Feb 2013, 17:07:38 UTC - in response to Message 1341678.
Last modified: 28 Feb 2013, 17:34:54 UTC


Are all these systems contacting Seti directly or through Proxies?

Last week it failed without using a proxy, today failed while using a proxy.
____________

Oddbjornik
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 74
Credit: 87,109,044
RAC: 58,384
Norway
Message 1341727 - Posted: 28 Feb 2013, 19:20:49 UTC - in response to Message 1341684.


Are all these systems contacting Seti directly or through Proxies?

Last week it failed without using a proxy, today failed while using a proxy.


I've seen the same, in opposite sequence; last November it failed twice while using a proxy, on February 14th and again on the 15th it failed without a proxy.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342071 - Posted: 1 Mar 2013, 17:01:14 UTC
Last modified: 1 Mar 2013, 17:03:45 UTC

It happened again last night in one of the hosts...
Im having at least one "abandoned" batch of task each day... sometimes more...

No more ideas?
Nobody knows why this happens? what are the theoretical triggers for this?

I cant expect this to end by itself... If I cant do nothing and this is not going to be checked by the project staff, then what? Should I stop crunching for SETI? ... I guess I know the answer :(
____________

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,357,863
RAC: 11,475
Canada
Message 1342082 - Posted: 1 Mar 2013, 17:17:32 UTC - in response to Message 1342071.

I think nobody knows what is going on Horatio, that's why you are getting no answer.

If it comes to the point where you really are about to quit SAH, how about instead make a completely new identity with new id's for you computers. The only theory I've seen put forward for mass abandons is 3rd-party unauthorized intervention, at least a new ID might be immune from that.

If the problem continues it will point fairly conclusively at some issue with the infrastructure.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342105 - Posted: 1 Mar 2013, 17:54:53 UTC - in response to Message 1342082.

I think nobody knows what is going on Horatio, that's why you are getting no answer.

If it comes to the point where you really are about to quit SAH, how about instead make a completely new identity with new id's for you computers. The only theory I've seen put forward for mass abandons is 3rd-party unauthorized intervention, at least a new ID might be immune from that.

If the problem continues it will point fairly conclusively at some issue with the infrastructure.


I apreciate your time to bear with me, and Im thankfull for that, but:

As is stated, in the IP logs there is no other hosts contacting the servers as if they were mines. Indeed, if that were happening my hosts should get a new computer ID due to the sequencial number of the request beeing lower than expected, and this is not happening.

Anyway if the issue is about someone cracking in the servers faking my ID in a clever way to not be noticed, then, first this should be addressed ASAP by the staff as this is a big security issue that is not going to be good for the credibility of the project and could be a serious risk for the accuracy of the data processed, but even if they dont care about that, I dont see how changing the ID of my hosts will help in this case, as the IDs of the hosts are public, unless, I made a completely new account with the hosts hidden from start... but Ive been crunching for SETI for more than 13 years with my current account and I dont want to lose it.

If there is no solution to my issue right now, Im sure that sooner or later this is going to explode in someone's face and they will fix it... Or in the best case they will be upgrading the software for something else and that will fix this as a collateral...
So if the only solutions are to stop crunching for a time or losing my account, for sure Ill choose to stop crunching.

Sorry if this sounds agressive or wrong in any way, it's not the intention, it's just that Im very frustrated and due to english not beeing my main language Im not sure if Im not writting this using wrong words.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342166 - Posted: 1 Mar 2013, 19:27:53 UTC
Last modified: 1 Mar 2013, 19:30:53 UTC

I have a (non paranoid) conjecture about what is happening here, but its based on certain things I dont know how they work...

The point is about how the scheduller works, my guess is that:
The scheduler does a check on the rpc_seq and if it finds that is ok then everything goes on as expected and ends ok, but if the number is wrong then
it marks all the tasks currently in the host as abandoned updating the DB and then it assigns a new ID and ask for aknwoledge to the client. If the client aswers it updates de DB if there is no AKN then it keep the host on the original ID.

Now if for some cause one RPC hits a faulty router and the packets with the RPC gets queued for more than 5 mins, then on client side that RPC is going to fail (due to the timeout) and later it will send another with a higher rpc_seq...
Now suposse that the first RPC gets dequeued after the second RPC was sucessfull and it reaches the scheduller with a wrong sequence number, this will trigger what Ive stated above, the host is not going to reply to that RPC as it think have failled so that will leave my host with the tasks marked as abandoned, but it will keep the original ID... And depending on the timing on which this happens, it may get that weird answer about last contact too recent...

Hard thing here is that Im pretty sure that the posibility of the RPC's packets beeing held queued in a router for so much time should be zero...

But if is anybody were able to confirm me that the scheduller works as I think, then... the imposible will become a fact and the scheduller will need a fix...

The only thing that still elludes me is why after this, the scheduller doesnt notice that the tasks informed as beeing cached in the client are wrong...
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4422
Credit: 118,486,711
RAC: 136,181
United States
Message 1342174 - Posted: 1 Mar 2013, 19:39:53 UTC

Once the tasks are flagged as abandoned are they not removed from your client on that or the next update?

This host shows ~ 200 "abandoned" today & 129 "In progress" at the moment. So if that host only has 129 tasks on hand I don't see what the issue is. If you have over 300 tasks on hand that is an issue.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342181 - Posted: 1 Mar 2013, 19:51:50 UTC - in response to Message 1342174.

Once the tasks are flagged as abandoned are they not removed from your client on that or the next update?

This host shows ~ 200 "abandoned" today & 129 "In progress" at the moment. So if that host only has 129 tasks on hand I don't see what the issue is. If you have over 300 tasks on hand that is an issue.

Nope, they are kept in the hosts and they are crunched normally...

My hosts are more or less synchronized now because I check the errors tabs on the web looking for recent "abandonements" and if there is a new one then I reset the project to resynch the tasks and stop crunching the abandoned ones.
I do a reset because doing a manual selection of the ones that were abandoned would take a lot of work and time :(

If I miss the event of the abandoned tasks (after all I have to sleep sometimes or work), and I dont do a reset then all the crunching is going to be wasted along with the electricity used that Ill have to pay anyway.
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4422
Credit: 118,486,711
RAC: 136,181
United States
Message 1342188 - Posted: 1 Mar 2013, 20:07:57 UTC

One of my hosts did have an issue with several abandoned a day some time ago. However the tasks were not retained on the system. I'm not sure if that was due to the nature how those tasks were flagged as abandoned or if the newer BOINC 6.12 client takes care of cleaning out abandoned tasks.

If a reset works you could automate running it on a schedule every 6 or 12 hours. The command line syntax would be: boinccmd --project http://setiathome.berkeley.edu/ reset

____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342192 - Posted: 1 Mar 2013, 20:28:54 UTC - in response to Message 1342188.
Last modified: 1 Mar 2013, 20:31:29 UTC

If a reset works you could automate running it on a schedule every 6 or 12 hours. The command line syntax would be: boinccmd --project http://setiathome.berkeley.edu/ reset

Ive thought on something like that, but with the current issues to download work, its not as simple, if I do a reset on regular basis then Ill be effectivelly not crunching for SETI...
Ive thought about making a little app polling the web page looking for abandoned task on each host so the reset can be triggered only when is really needed...
But if a newer version will take care of the cleaning of abandoned tasks then it might be worth to try that first... the only drawback is that as Ive said before with the longer backoffs and retry times of the new versions added to my slow (and worse, highly unreliable) conection to internet Im sure it will be the same as stop crunching...

Is anybody able to confirm me that newer versions will take care of the cleaning?

EDIT: Are you sure they were cleaned? Is not possible that by the time you noticed the issue, they were all already crunched and reported?... Once the tasks is abandoned even in the case it gets reported, the web page doesnt change, neither gives any clue that the task was crunched anyway...
____________

Profile Khangollo
Avatar
Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1342197 - Posted: 1 Mar 2013, 20:43:58 UTC
Last modified: 1 Mar 2013, 20:45:03 UTC

Is anybody able to confirm me that newer versions will take care of the cleaning?

I can confirm that it doesn't.

I was using 7.0.x when it happened to me (four times so far) and all tasks stayed, wasting time and electricity like there is no tomorrow. In my case, it always happened during constant scheduler timeouts.

And no, I totally don't believe it's a security issue (as in someone trying to fake your host ID). Definitely another bug in server software.
It all started happening suddenly, exactly at the same time than scheduler troubles, months ago.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,972,850
RAC: 42,868
Argentina
Message 1342201 - Posted: 1 Mar 2013, 20:57:30 UTC - in response to Message 1342197.

Thanks!
Oh... well...

____________

1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Abandoned tasks - Ongoing issue

Copyright © 2014 University of California