Ghost WU issue (and some talk about deadlines)

Message boards : Number crunching : Ghost WU issue (and some talk about deadlines)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569694 - Posted: 17 May 2007, 18:20:17 UTC - in response to Message 569681.  
Last modified: 17 May 2007, 18:34:58 UTC

I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not.
So maybe Hank is on to something here.
I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem.


Tested and confirmed. From my log:


5/17/2007 2:16:10 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
5/17/2007 2:16:10 PM|SETI@home|Reason: To fetch work
5/17/2007 2:16:10 PM|SETI@home|Requesting 149853 seconds of new work
5/17/2007 2:16:31 PM||Project communication failed: attempting access to reference site
5/17/2007 2:16:33 PM||Access to reference site succeeded - project servers may be temporarily down.
5/17/2007 2:16:36 PM|SETI@home|Scheduler request failed: couldn't connect to server
5/17/2007 2:16:36 PM|SETI@home|Deferring scheduler requests for 1 minutes and 0 seconds
5/17/2007 2:16:41 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
5/17/2007 2:16:41 PM|SETI@home|Reason: Requested by user
5/17/2007 2:16:41 PM|SETI@home|Requesting 149889 seconds of new work
5/17/2007 2:17:11 PM|SETI@home|Scheduler request failed: HTTP internal server error
5/17/2007 2:17:11 PM|SETI@home|Deferring scheduler requests for 1 minutes and 0 seconds
5/17/2007 2:18:12 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
5/17/2007 2:18:12 PM|SETI@home|Reason: Requested by user
5/17/2007 2:18:12 PM|SETI@home|Requesting 149998 seconds of new work
5/17/2007 2:18:32 PM|SETI@home|Scheduler request failed: HTTP internal server error
5/17/2007 2:18:32 PM|SETI@home|Deferring scheduler requests for 1 minutes and 0 seconds
5/17/2007 2:19:32 PM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
5/17/2007 2:19:32 PM|SETI@home|Reason: Requested by user
5/17/2007 2:19:32 PM|SETI@home|(not requesting new work or reporting completed tasks)
5/17/2007 2:19:47 PM|SETI@home|Scheduler request succeeded


Results page for my Intel host now shows:

534644251 129353870 17 May 2007 18:15:22 UTC 11 Jun 2007 16:48:02 UTC In Progress Unknown New --- --- ---
534643482 129353633 17 May 2007 18:13:59 UTC 11 Jun 2007 16:46:39 UTC In Progress Unknown New --- --- ---

These are not waiting for me to work on. Don't know what the time differential is either...

What I'd be interested in knowing is if you can get any new work without unloading and reloading the manager? IOW, once the condition happens does it continue to happen until the manager is unloaded and reloaded, or is it simply a server-side issue?
ID: 569694 · Report as offensive
Profile Rene
Volunteer tester
Avatar

Send message
Joined: 22 Mar 04
Posts: 53
Credit: 323,591
RAC: 0
Netherlands
Message 569695 - Posted: 17 May 2007, 18:21:56 UTC
Last modified: 17 May 2007, 18:43:09 UTC

Here's a small part of the messages from the manager:
(after re-opening network usage)

17-5-2007 9:20:33|SETI@home|Requesting 86400 seconds of new work
17-5-2007 9:20:58|SETI@home|Scheduler RPC succeeded [server version 509]
17-5-2007 9:20:58|SETI@home|Message from server: Incomplete request received.
17-5-2007 9:20:58|SETI@home|New host venue:
17-5-2007 9:20:58||General prefs: from SETI@home (last modified 2007-03-30 20:34:18)
17-5-2007 9:20:58||Host location: none
17-5-2007 9:20:58||General prefs: using your defaults
17-5-2007 9:20:58|SETI@home|Deferring communication for 11 sec
17-5-2007 9:20:58|SETI@home|Reason: requested by project
17-5-2007 9:20:58|SETI@home|Deferring communication for 1 min 0 sec
17-5-2007 9:20:58|SETI@home|Reason: no work from project
17-5-2007 9:21:58|SETI@home|Fetching scheduler list
17-5-2007 9:22:03|SETI@home|Master file download succeeded
17-5-2007 9:22:08|SETI@home|Sending scheduler request: To fetch work
17-5-2007 9:22:08|SETI@home|Requesting 86400 seconds of new work
17-5-2007 9:22:38|SETI@home|Scheduler RPC succeeded [server version 509]
17-5-2007 9:22:38|SETI@home|Message from server: Incomplete request received.
17-5-2007 9:22:38|SETI@home|Deferring communication for 11 sec


And a bit later on...

17-5-2007 9:37:42|SETI@home|Sending scheduler request: To fetch work
17-5-2007 9:37:42|SETI@home|Requesting 86400 seconds of new work
17-5-2007 9:38:07|SETI@home|Scheduler RPC succeeded [server version 509]
17-5-2007 9:38:07|SETI@home|Message from server: Incomplete request received.
17-5-2007 9:38:07|SETI@home|Deferring communication for 11 sec
17-5-2007 9:38:07|SETI@home|Reason: requested by project
17-5-2007 9:38:07|SETI@home|Deferring communication for 8 min 47 sec
17-5-2007 9:38:07|SETI@home|Reason: no work from project


Note: time notation DD-MM-YYYY...

;-)




ID: 569695 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 569706 - Posted: 17 May 2007, 18:42:46 UTC - in response to Message 569681.  

I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not.
So maybe Hank is on to something here.
I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem.

Also confirmed. 'HTTP internal server error' coincides with a ghost WU, 'couldn't connect to server' doesn't.

But for host 2901600 I now have:

Last time contacted server 16 May 2007 21:33:44 UTC

534648073 129355132 17 May 2007 18:29:16 UTC 11 Jun 2007 17:02:10 UTC In Progress Unknown New

so it's not just failing to send the WU (or rather, the instruction to tell the client to download the WU) - it's failing to update its own table to acknowledge that the host has contacted the server.

The scheduler request was for 38040 seconds of work, which would have been multiple WUs, yet I only got one ghost: if I cut the cache value right down so it only asks for 1 WU, will that help, I wonder?
ID: 569706 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569713 - Posted: 17 May 2007, 18:50:17 UTC
Last modified: 17 May 2007, 18:53:49 UTC

Editing subject again

Self-depreciating commentary in the subject title removed...
Really promise this is the last time I'll change the title... LOL
ID: 569713 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569725 - Posted: 17 May 2007, 19:10:09 UTC

Posted a link to this thread over in the blog section in this post
ID: 569725 · Report as offensive
Profile Rene
Volunteer tester
Avatar

Send message
Joined: 22 Mar 04
Posts: 53
Credit: 323,591
RAC: 0
Netherlands
Message 569730 - Posted: 17 May 2007, 19:19:17 UTC

Also got one now on my Athlon running XP.
"Scheduler request failed: HTTP internal server error"
Suspended network usage on that one now.

;-)
ID: 569730 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 569734 - Posted: 17 May 2007, 19:24:13 UTC - in response to Message 569706.  

The scheduler request was for 38040 seconds of work, which would have been multiple WUs, yet I only got one ghost: if I cut the cache value right down so it only asks for 1 WU, will that help, I wonder?

Well, after much fiddling with the magic jumping host venue (and several more ghost WUs), I got the scheduler request down to 1 WU (760 seconds) - and almost immediately got a successful scheduler RPC (which reset my host venue again....). Still "no work from project", though.

Arrrrrgggghhhhh - while I was typing that, another request for 760 seconds got the internal server error, and another ghost. Supposition disproved: back to the drawing board.
ID: 569734 · Report as offensive
Conrad Human
Volunteer tester

Send message
Joined: 17 Nov 00
Posts: 67
Credit: 2,009,224
RAC: 0
South Africa
Message 569737 - Posted: 17 May 2007, 19:31:23 UTC

What would have been nice if i could mark a unfinished unit to be resend to me by the sceduler .

Oh please uncle brune i need some work
ID: 569737 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569769 - Posted: 17 May 2007, 20:44:18 UTC

Humble suggestion before I leave for school:

Since a fair number of us are seeing these ghost units come up, it might make some sense to not attempt to get new work for a while. The more occurrances of this happening just makes the clutter harder to clean up and may be what's causing the other messages of "no work from project".

We might want to give the team some time to figure out what is going on. They can likely go back and pull session logs based on those of us who have reported specific events.

Just a thought, and remember, it is the thought that counts... :)
ID: 569769 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 569789 - Posted: 17 May 2007, 21:14:57 UTC

I also had several http errors in a row and now have several ghost units that correspond to when I get those errors.

ID: 569789 · Report as offensive
Profile Gavin Shaw
Avatar

Send message
Joined: 8 Aug 00
Posts: 1116
Credit: 1,304,337
RAC: 0
Australia
Message 569860 - Posted: 17 May 2007, 22:55:24 UTC - in response to Message 569789.  

I also had several http errors in a row and now have several ghost units that correspond to when I get those errors.


Same here. I asked my system to update, got a HTTP internal server error and now have workunits listed in my result page, but they are not on my computers.

To the best of my knowledge I have actually only received a total of 6 or 8 units since the new server went online and those were several days ago. I have had nothing since. It work several days just to report the results from those units and others I had before the server died.

I have spent the week working on Rosetta, since I can not get anything here. the sooner everything is back to normal here here, the sooner I can get going again.

And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days.

Never surrender and never give up. In the darkest hour there is always hope.

ID: 569860 · Report as offensive
Profile AstroNerdBoy

Send message
Joined: 3 Jun 99
Posts: 1
Credit: 19,448,583
RAC: 0
United States
Message 570129 - Posted: 18 May 2007, 8:08:11 UTC - in response to Message 569860.  

And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days.


In another forum, there was a suggestion to uninstall BOINC, delete the BOINC directory, and re-install fresh. Since it has been two days since my last computer made a successful communication with the SETI server(s), I decided to try it myself. I didn't delete the old directory but did rename it. Sure enough, while there are communication problems seen, I have both CPU's on my machine now processing new data.

Now, I've read through this thread and I'm guessing the three WU in the "Tasks" section were ghost ones since in the past, all completed WU's were sitting in the "Transfers" section until they were successfully able to transfer. Would this be a correct statement?
ID: 570129 · Report as offensive
Profile ecpa
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 35
Credit: 9,588,416
RAC: 0
Germany
Message 570132 - Posted: 18 May 2007, 8:12:30 UTC - in response to Message 569789.  

I also had several http errors in a row and now have several ghost units that correspond to when I get those errors.

Got about 50 or more ghosts on my 7 computers. I hope this will be resolved soon.

ID: 570132 · Report as offensive
Profile Henk Haneveld
Volunteer tester

Send message
Joined: 16 May 99
Posts: 154
Credit: 1,577,293
RAC: 1
Netherlands
Message 570133 - Posted: 18 May 2007, 8:16:21 UTC - in response to Message 570129.  

And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days.


In another forum, there was a suggestion to uninstall BOINC, delete the BOINC directory, and re-install fresh. Since it has been two days since my last computer made a successful communication with the SETI server(s), I decided to try it myself. I didn't delete the old directory but did rename it. Sure enough, while there are communication problems seen, I have both CPU's on my machine now processing new data.

Now, I've read through this thread and I'm guessing the three WU in the "Tasks" section were ghost ones since in the past, all completed WU's were sitting in the "Transfers" section until they were successfully able to transfer. Would this be a correct statement?


You are lucky. You have real work. Ghosts are results that show up on the results page of your Seti account but not in Tasks section.
ID: 570133 · Report as offensive
Profile Henk Haneveld
Volunteer tester

Send message
Joined: 16 May 99
Posts: 154
Credit: 1,577,293
RAC: 1
Netherlands
Message 570143 - Posted: 18 May 2007, 9:13:59 UTC

It is possible that a other problem will come in to play pretty soon.

If ghost units have a short deadline then they time-out after a couple off days, they will be resent to other hosts possible again as ghosts.

If this happens to often the WU will get the "to many results" flag with-out ever being crunched.

Maybe the Seti staff needs to think about shutting down downloads until the problem is solved.
ID: 570143 · Report as offensive
Profile GreggyBee
Volunteer tester
Avatar

Send message
Joined: 9 Mar 01
Posts: 203
Credit: 1,600,521
RAC: 0
Message 570154 - Posted: 18 May 2007, 10:31:36 UTC
Last modified: 18 May 2007, 10:36:42 UTC

Just spotted the thread, and checked my results page: 25 ghost units; unfortunately, I rebooted this morning and lost the message log- so I'll keep tabs on what happens if (when) things have settled down- in the meantime, I've set it to 'no new tasks'.

Besides Beta managed to dump 12 Astropulse units on me before Bruno fell over: the most-processed has taken 45 3/4 hours to crunch 23.8%!!! So, I've got enough to keep me going for weeks.

PLUS, Proteins@ could do with a few more active crunchers, and they're quick WU's.


Patience, my friends; positive thoughts; and a gentle 'thump' for thumper
8:P

/Edit miscounted the Astropulse numbers
ID: 570154 · Report as offensive
Profile Kirsten
Volunteer tester
Avatar

Send message
Joined: 7 Jul 00
Posts: 190
Credit: 566,047
RAC: 0
Denmark
Message 570161 - Posted: 18 May 2007, 10:53:26 UTC
Last modified: 18 May 2007, 11:00:34 UTC

Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications?

I have got nothing but ghost units for my two hosts the last couple of days.

I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work.

(This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.)

The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml
Kind regards
Kirsten

ID: 570161 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 570170 - Posted: 18 May 2007, 11:08:11 UTC - in response to Message 570161.  
Last modified: 18 May 2007, 11:16:23 UTC

Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications?

I have got nothing but ghost units for my two hosts the last couple of days.

I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work.

(This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.)

The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml

YES!!! I was just about to post the same thing.

The following has worked for me on three systems - two late version 5.8 BOINC, and a 5.3.12.tx. All were service installs, running appropriate Chicken 2.2B, and had run completely dry.

Recipe:

Rename app_info.xml so it won't be recognised
Restart BOINC (service)
Update SETI - may not get through first time, but keep trying
Restore app_info.xml to original name
Wait until all transfers have finished
Restart BOINC (service)

Outcome - decent sized cache (if I haven't nabbed them all already, LOL), still running optimised, time to open a beer.
ID: 570170 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 570178 - Posted: 18 May 2007, 11:25:44 UTC - in response to Message 570161.  
Last modified: 18 May 2007, 11:29:27 UTC

Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications?

I have got nothing but ghost units for my two hosts the last couple of days.

I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work.

(This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.)

The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml


My first thought to this question was NO, the optimized science app should not affect BOINC when fetching work.

But then I remember reading somewhere that the app_info file does take a small amount of time at the server to parse.

So maybe it could be possible!. I still have 1 SETI WU that is partly crunched, and 1 Beta that will be finished in about 20 minutes. If I don't get any new WUs when the last SETI WU is finished, I may try removing the optimized app temporarily, worth a try!

P.S. You can switch back to using the optimized app part-way through a WU, though it does cause the "stderr" file to have confusing information in it this result is an example where I switched back to the optimized app.
Sir Arthur C Clarke 1917-2008
ID: 570178 · Report as offensive
Profile GreggyBee
Volunteer tester
Avatar

Send message
Joined: 9 Mar 01
Posts: 203
Credit: 1,600,521
RAC: 0
Message 570189 - Posted: 18 May 2007, 11:47:36 UTC - in response to Message 570170.  
Last modified: 18 May 2007, 11:49:33 UTC

Hey Richard H- Thanx, I loved the recipe:

Rename app_info.xml so it won't be recognised
Restart BOINC (service)
Update SETI - may not get through first time, but keep trying
Restore app_info.xml to original name
Wait until all transfers have finished
Restart BOINC (service)


It's working for me too: this should be posted in the Tech News thread ASAP
ID: 570189 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : Ghost WU issue (and some talk about deadlines)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.