Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15

AuthorMessage
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1697649 - Posted: 2 Jul 2015, 7:32:01 UTC - in response to Message 1697579.  

Just had 273 APs abandoned due to to short a deadline. Shortest time to report was 12 minutes. Machine number 7248689.

This problem has been around for years. When is someone going to fix it...

Do you mean Error tasks for computer 7248689?

I see no sign of a short deadline, let alone "abandoned due to...". All the tasks I've spot-checked had the normal 25-day deadlines for AP tasks.

Unless we're careful and precise in our error reporting - as I and others have spent the last six days trying to demonstrate - it's unlikely it'll ever get fixed.


Richard, I don't know what you were looking at but when I go to the Error tasks section it shows tasks that have very short deadlines.

The column that you are looking at and calling the deadline time is dual use, at the top of the column it says: Time reported or deadline,
Where before the task is reported it'll show the deadline time, after it is reported it shows the time it is reported.

Claggy
ID: 1697649 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1697668 - Posted: 2 Jul 2015, 8:53:11 UTC - in response to Message 1697265.  
Last modified: 2 Jul 2015, 8:55:07 UTC

my client trigger just leaves a reportable task off the server request every so often (reporting it in a later request). Haven't tried it lately, because I haven't had ghost tasks, so your guess is as good as mine as to whether that logic is intact.

Are you referring to Claggy's resend trigger? I thought that relied on having a report rejected for duplication - in which case, the trigger would be skipping an ack in the server reply, so a reported task remained live in client_state.


Not sure. I recall bouncing it back and forth with claggy and coding it not exactly as was described, after manual attempts at connection chopping didn't work. certainly wasn't that complex in implementation. Anyway, I made it, it's all MINE *muahahahaha*

The double reporting of a task does still trigger a resend lost tasks event:

Here i reported a task and got a task (without resend lost tasks occuring):

02-Jul-2015 08:33:42 [SETI@home] update requested by user
02-Jul-2015 08:33:42 [SETI@home] Sending scheduler request: Requested by user.
02-Jul-2015 08:33:42 [SETI@home] Reporting 1 completed tasks
02-Jul-2015 08:33:42 [SETI@home] Requesting new tasks for CPU
02-Jul-2015 08:33:43 [SETI@home] Scheduler request completed: got 1 new tasks
02-Jul-2015 08:35:16 [SETI@home] Started download of 31dc12ad.30467.464287.438086664204.12.215.vlar
02-Jul-2015 08:35:19 [SETI@home] Finished download of 31dc12ad.30467.464287.438086664204.12.215.vlar

After restoring the reported task back into the CS, and reporting it again (with suitably increased cache values), my four ghosts are resent:

Thu Jul 2 08:44:52 2015 | SETI@home | update requested by user
Thu Jul 2 08:44:55 2015 | SETI@home | sched RPC pending: Requested by user
Thu Jul 2 08:44:55 2015 | SETI@home | [sched_op] Starting scheduler request
Thu Jul 2 08:44:55 2015 | SETI@home | Sending scheduler request: Requested by user.
Thu Jul 2 08:44:55 2015 | SETI@home | Reporting 1 completed tasks
Thu Jul 2 08:44:55 2015 | SETI@home | Requesting new tasks for CPU
Thu Jul 2 08:44:55 2015 | SETI@home | [sched_op] CPU work request: 997588.31 seconds; 0.00 devices
Thu Jul 2 08:44:57 2015 | SETI@home | Scheduler request completed: got 4 new tasks
Thu Jul 2 08:44:57 2015 | SETI@home | [sched_op] Server version 707
Thu Jul 2 08:44:57 2015 | SETI@home | Resent lost task 13se12af.32371.215837.438086664207.12.127.vlar_1
Thu Jul 2 08:44:57 2015 | SETI@home | Resent lost task 19se12af.753.16836.438086664199.12.153_1
Thu Jul 2 08:44:57 2015 | SETI@home | Resent lost task 19se12af.753.16836.438086664199.12.215_1
Thu Jul 2 08:44:57 2015 | SETI@home | Resent lost task 20au12ag.15429.18063.438086664201.12.128_0
Thu Jul 2 08:44:57 2015 | SETI@home | Project requested delay of 303 seconds
Thu Jul 2 08:44:57 2015 | SETI@home | [sched_op] estimated total CPU task duration: 883795 seconds
Thu Jul 2 08:44:57 2015 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 11se12ab.7663.16427.438086664201.12.135_0
Thu Jul 2 08:44:57 2015 | SETI@home | [sched_op] Deferring communication for 00:05:03
Thu Jul 2 08:44:57 2015 | SETI@home | [sched_op] Reason: requested by project
Thu Jul 2 08:45:25 2015 | SETI@home | Started download of 13se12af.32371.215837.438086664207.12.127.vlar
Thu Jul 2 08:45:25 2015 | SETI@home | Started download of 19se12af.753.16836.438086664199.12.153
Thu Jul 2 08:45:28 2015 | SETI@home | Finished download of 13se12af.32371.215837.438086664207.12.127.vlar
Thu Jul 2 08:45:28 2015 | SETI@home | Finished download of 19se12af.753.16836.438086664199.12.153
Thu Jul 2 08:45:28 2015 | SETI@home | Started download of 19se12af.753.16836.438086664199.12.215
Thu Jul 2 08:45:28 2015 | SETI@home | Started download of 20au12ag.15429.18063.438086664201.12.128
Thu Jul 2 08:45:36 2015 | SETI@home | Finished download of 19se12af.753.16836.438086664199.12.215
Thu Jul 2 08:45:36 2015 | SETI@home | Finished download of 20au12ag.15429.18063.438086664201.12.128

http://setiathome.berkeley.edu/results.php?hostid=7506529

Claggy
ID: 1697668 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697669 - Posted: 2 Jul 2015, 8:59:09 UTC - in response to Message 1697668.  

hoohoo great, feature stands
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697669 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1699813 - Posted: 9 Jul 2015, 12:37:45 UTC
Last modified: 9 Jul 2015, 12:37:59 UTC

So, after a nice email we've had the fix applied.

I had a quick test and it doesn't look like it has been deployed yet though.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1699813 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1699927 - Posted: 9 Jul 2015, 20:25:21 UTC - in response to Message 1699813.  

That's good news. It still doesn't look as if it's been deployed yet though. This host has been suffering multiple Abandon Events a day, he just received another, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714
So, I'll keep watching to see if it helps that host.
Thanks for getting something on the table, hopefully it will work as expected.
ID: 1699927 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1700425 - Posted: 11 Jul 2015, 7:35:52 UTC

I just had 173 abandonded tasks on my vista # 5065145 machine.
Stdder just states outcome- abandoned. Client state new.
What ever that means.
Ive had it happen before. But it still sucks.
[/quote]

Old James
ID: 1700425 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1700427 - Posted: 11 Jul 2015, 7:50:15 UTC - in response to Message 1700425.  

I just had 173 abandonded tasks on my vista # 5065145 machine.
Stdder just states outcome- abandoned. Client state new.
What ever that means.
Ive had it happen before. But it still sucks.

Were those tasks still being crunched on the host afterwards?

What does the Event Log say for that time period?

Claggy
ID: 1700427 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1700636 - Posted: 12 Jul 2015, 0:46:43 UTC - in response to Message 1700427.  

I just had 173 abandonded tasks on my vista # 5065145 machine.
Stdder just states outcome- abandoned. Client state new.
What ever that means.
Ive had it happen before. But it still sucks.

Were those tasks still being crunched on the host afterwards?

What does the Event Log say for that time period?

Claggy

Like an idiot I rebooted the computer before I saw your message. So event log just shows the last half hour. Sorry.
[/quote]

Old James
ID: 1700636 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1700679 - Posted: 12 Jul 2015, 8:08:37 UTC - in response to Message 1700636.  

I just had 173 abandonded tasks on my vista # 5065145 machine.
Stdder just states outcome- abandoned. Client state new.
What ever that means.
Ive had it happen before. But it still sucks.

Were those tasks still being crunched on the host afterwards?

What does the Event Log say for that time period?

Claggy

Like an idiot I rebooted the computer before I saw your message. So event log just shows the last half hour. Sorry.

You'll find the Event Log information stored in the stdoutdae.txt and stdoutdae.old files in the Boinc Data directory.

Claggy
ID: 1700679 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1701831 - Posted: 15 Jul 2015, 17:51:19 UTC
Last modified: 15 Jul 2015, 17:53:44 UTC

It appears the problem with Abandoned tasks may be over. This host has gone almost 2 days without an Abandoned task, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714
Unfortunately he just suffered a horde of 5 minute Time-outs, and they don't appear to be VLARs.
One down one to go?
Well, at least his machine won't waste time working those Timed-out tasks...
;-)
ID: 1701831 · Report as offensive
woohoo
Volunteer tester

Send message
Joined: 30 Oct 13
Posts: 972
Credit: 165,671,404
RAC: 5
United States
Message 1701837 - Posted: 15 Jul 2015, 18:02:28 UTC

i had that problem a few weeks ago but i haven't had it since
ID: 1701837 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701846 - Posted: 15 Jul 2015, 18:13:29 UTC - in response to Message 1701831.  

It appears the problem with Abandoned tasks may be over. This host has gone almost 2 days without an Abandoned task, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714
Unfortunately he just suffered a horde of 5 minute Time-outs, and they don't appear to be VLARs.
One down one to go?
Well, at least his machine won't waste time working those Timed-out tasks...
;-)

That's certainly very encouraging, if not entirely definitive. Perhaps the fix got deployed during this week's outage. I guess if we hear no more anguished cries of abandonment, it'll be good news!
ID: 1701846 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701847 - Posted: 15 Jul 2015, 18:14:16 UTC - in response to Message 1701831.  

It appears the problem with Abandoned tasks may be over. This host has gone almost 2 days without an Abandoned task, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714

There was supposed to be a note in Eric's lunchbox on Monday, so a "final abandonment" time of 13 Jul 2015, 22:42:03 UTC (15:43 PDT) sounds about right for a late lunch.

If anyone spots an abandonment significantly later than that (say 15 July onwards), could they draw it to our attention, please?
ID: 1701847 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701969 - Posted: 15 Jul 2015, 23:57:06 UTC - in response to Message 1701847.  

At the risk of invoking Murphy, or gremlins lurking in that spaghetti-code, I'd say the same mode of failure can't happen now. There's a fair amount of DoS vulnerability in that authentication routine, and still chances of spontaneous rashes of unwanted new hostids and similar, but I suspect that particular piece of duct tape will hold.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701969 · Report as offensive
Previous · 1 . . . 12 · 13 · 14 · 15

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.