Ghost WU issue (and some talk about deadlines)

Author	Message
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569694 - Posted: 17 May 2007, 18:20:17 UTC - in response to Message 569681. Last modified: 17 May 2007, 18:34:58 UTC I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not. So maybe Hank is on to something here. I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem. Tested and confirmed. From my log: 5/17/2007 2:16:10 PM\|SETI@home\|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 5/17/2007 2:16:10 PM\|SETI@home\|Reason: To fetch work 5/17/2007 2:16:10 PM\|SETI@home\|Requesting 149853 seconds of new work 5/17/2007 2:16:31 PM\|\|Project communication failed: attempting access to reference site 5/17/2007 2:16:33 PM\|\|Access to reference site succeeded - project servers may be temporarily down. 5/17/2007 2:16:36 PM\|SETI@home\|Scheduler request failed: couldn't connect to server 5/17/2007 2:16:36 PM\|SETI@home\|Deferring scheduler requests for 1 minutes and 0 seconds 5/17/2007 2:16:41 PM\|SETI@home\|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 5/17/2007 2:16:41 PM\|SETI@home\|Reason: Requested by user 5/17/2007 2:16:41 PM\|SETI@home\|Requesting 149889 seconds of new work 5/17/2007 2:17:11 PM\|SETI@home\|Scheduler request failed: HTTP internal server error 5/17/2007 2:17:11 PM\|SETI@home\|Deferring scheduler requests for 1 minutes and 0 seconds 5/17/2007 2:18:12 PM\|SETI@home\|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 5/17/2007 2:18:12 PM\|SETI@home\|Reason: Requested by user 5/17/2007 2:18:12 PM\|SETI@home\|Requesting 149998 seconds of new work 5/17/2007 2:18:32 PM\|SETI@home\|Scheduler request failed: HTTP internal server error 5/17/2007 2:18:32 PM\|SETI@home\|Deferring scheduler requests for 1 minutes and 0 seconds 5/17/2007 2:19:32 PM\|SETI@home\|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 5/17/2007 2:19:32 PM\|SETI@home\|Reason: Requested by user 5/17/2007 2:19:32 PM\|SETI@home\|(not requesting new work or reporting completed tasks) 5/17/2007 2:19:47 PM\|SETI@home\|Scheduler request succeeded Results page for my Intel host now shows: 534644251 129353870 17 May 2007 18:15:22 UTC 11 Jun 2007 16:48:02 UTC In Progress Unknown New --- --- --- 534643482 129353633 17 May 2007 18:13:59 UTC 11 Jun 2007 16:46:39 UTC In Progress Unknown New --- --- --- These are not waiting for me to work on. Don't know what the time differential is either... What I'd be interested in knowing is if you can get any new work without unloading and reloading the manager? IOW, once the condition happens does it continue to happen until the manager is unloaded and reloaded, or is it simply a server-side issue? ID: 569694 ·

Rene Volunteer tester Send message Joined: 22 Mar 04 Posts: 53 Credit: 323,591 RAC: 0	Message 569695 - Posted: 17 May 2007, 18:21:56 UTC Last modified: 17 May 2007, 18:43:09 UTC Here's a small part of the messages from the manager: (after re-opening network usage) 17-5-2007 9:20:33\|SETI@home\|Requesting 86400 seconds of new work 17-5-2007 9:20:58\|SETI@home\|Scheduler RPC succeeded [server version 509] 17-5-2007 9:20:58\|SETI@home\|Message from server: Incomplete request received. 17-5-2007 9:20:58\|SETI@home\|New host venue: 17-5-2007 9:20:58\|\|General prefs: from SETI@home (last modified 2007-03-30 20:34:18) 17-5-2007 9:20:58\|\|Host location: none 17-5-2007 9:20:58\|\|General prefs: using your defaults 17-5-2007 9:20:58\|SETI@home\|Deferring communication for 11 sec 17-5-2007 9:20:58\|SETI@home\|Reason: requested by project 17-5-2007 9:20:58\|SETI@home\|Deferring communication for 1 min 0 sec 17-5-2007 9:20:58\|SETI@home\|Reason: no work from project 17-5-2007 9:21:58\|SETI@home\|Fetching scheduler list 17-5-2007 9:22:03\|SETI@home\|Master file download succeeded 17-5-2007 9:22:08\|SETI@home\|Sending scheduler request: To fetch work 17-5-2007 9:22:08\|SETI@home\|Requesting 86400 seconds of new work 17-5-2007 9:22:38\|SETI@home\|Scheduler RPC succeeded [server version 509] 17-5-2007 9:22:38\|SETI@home\|Message from server: Incomplete request received. 17-5-2007 9:22:38\|SETI@home\|Deferring communication for 11 sec And a bit later on... 17-5-2007 9:37:42\|SETI@home\|Sending scheduler request: To fetch work 17-5-2007 9:37:42\|SETI@home\|Requesting 86400 seconds of new work 17-5-2007 9:38:07\|SETI@home\|Scheduler RPC succeeded [server version 509] 17-5-2007 9:38:07\|SETI@home\|Message from server: Incomplete request received. 17-5-2007 9:38:07\|SETI@home\|Deferring communication for 11 sec 17-5-2007 9:38:07\|SETI@home\|Reason: requested by project 17-5-2007 9:38:07\|SETI@home\|Deferring communication for 8 min 47 sec 17-5-2007 9:38:07\|SETI@home\|Reason: no work from project Note: time notation DD-MM-YYYY... ;-) ID: 569695 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 569706 - Posted: 17 May 2007, 18:42:46 UTC - in response to Message 569681. I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not. So maybe Hank is on to something here. I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem. Also confirmed. 'HTTP internal server error' coincides with a ghost WU, 'couldn't connect to server' doesn't. But for host 2901600 I now have: Last time contacted server 16 May 2007 21:33:44 UTC 534648073 129355132 17 May 2007 18:29:16 UTC 11 Jun 2007 17:02:10 UTC In Progress Unknown New so it's not just failing to send the WU (or rather, the instruction to tell the client to download the WU) - it's failing to update its own table to acknowledge that the host has contacted the server. The scheduler request was for 38040 seconds of work, which would have been multiple WUs, yet I only got one ghost: if I cut the cache value right down so it only asks for 1 WU, will that help, I wonder? ID: 569706 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569713 - Posted: 17 May 2007, 18:50:17 UTC Last modified: 17 May 2007, 18:53:49 UTC Editing subject again Self-depreciating commentary in the subject title removed... Really promise this is the last time I'll change the title... LOL ID: 569713 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569725 - Posted: 17 May 2007, 19:10:09 UTC Posted a link to this thread over in the blog section in this post ID: 569725 ·

Rene Volunteer tester Send message Joined: 22 Mar 04 Posts: 53 Credit: 323,591 RAC: 0	Message 569730 - Posted: 17 May 2007, 19:19:17 UTC Also got one now on my Athlon running XP. "Scheduler request failed: HTTP internal server error" Suspended network usage on that one now. ;-) ID: 569730 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 569734 - Posted: 17 May 2007, 19:24:13 UTC - in response to Message 569706. The scheduler request was for 38040 seconds of work, which would have been multiple WUs, yet I only got one ghost: if I cut the cache value right down so it only asks for 1 WU, will that help, I wonder? Well, after much fiddling with the magic jumping host venue (and several more ghost WUs), I got the scheduler request down to 1 WU (760 seconds) - and almost immediately got a successful scheduler RPC (which reset my host venue again....). Still "no work from project", though. Arrrrrgggghhhhh - while I was typing that, another request for 760 seconds got the internal server error, and another ghost. Supposition disproved: back to the drawing board. ID: 569734 ·

Conrad Human Volunteer tester Send message Joined: 17 Nov 00 Posts: 67 Credit: 2,009,224 RAC: 0	Message 569737 - Posted: 17 May 2007, 19:31:23 UTC What would have been nice if i could mark a unfinished unit to be resend to me by the sceduler . Oh please uncle brune i need some work ID: 569737 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569769 - Posted: 17 May 2007, 20:44:18 UTC Humble suggestion before I leave for school: Since a fair number of us are seeing these ghost units come up, it might make some sense to not attempt to get new work for a while. The more occurrances of this happening just makes the clutter harder to clean up and may be what's causing the other messages of "no work from project". We might want to give the team some time to figure out what is going on. They can likely go back and pull session logs based on those of us who have reported specific events. Just a thought, and remember, it is the thought that counts... :) ID: 569769 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 569789 - Posted: 17 May 2007, 21:14:57 UTC I also had several http errors in a row and now have several ghost units that correspond to when I get those errors. ID: 569789 ·

Gavin Shaw Send message Joined: 8 Aug 00 Posts: 1116 Credit: 1,304,337 RAC: 0	Message 569860 - Posted: 17 May 2007, 22:55:24 UTC - in response to Message 569789. I also had several http errors in a row and now have several ghost units that correspond to when I get those errors. Same here. I asked my system to update, got a HTTP internal server error and now have workunits listed in my result page, but they are not on my computers. To the best of my knowledge I have actually only received a total of 6 or 8 units since the new server went online and those were several days ago. I have had nothing since. It work several days just to report the results from those units and others I had before the server died. I have spent the week working on Rosetta, since I can not get anything here. the sooner everything is back to normal here here, the sooner I can get going again. And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days. Never surrender and never give up. In the darkest hour there is always hope. ID: 569860 ·

AstroNerdBoy Send message Joined: 3 Jun 99 Posts: 1 Credit: 19,448,583 RAC: 0	Message 570129 - Posted: 18 May 2007, 8:08:11 UTC - in response to Message 569860. And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days. In another forum, there was a suggestion to uninstall BOINC, delete the BOINC directory, and re-install fresh. Since it has been two days since my last computer made a successful communication with the SETI server(s), I decided to try it myself. I didn't delete the old directory but did rename it. Sure enough, while there are communication problems seen, I have both CPU's on my machine now processing new data. Now, I've read through this thread and I'm guessing the three WU in the "Tasks" section were ghost ones since in the past, all completed WU's were sitting in the "Transfers" section until they were successfully able to transfer. Would this be a correct statement? ID: 570129 ·

ecpa Volunteer tester Send message Joined: 3 Apr 99 Posts: 35 Credit: 9,588,416 RAC: 0	Message 570132 - Posted: 18 May 2007, 8:12:30 UTC - in response to Message 569789. I also had several http errors in a row and now have several ghost units that correspond to when I get those errors. Got about 50 or more ghosts on my 7 computers. I hope this will be resolved soon. ID: 570132 ·

Henk Haneveld Volunteer tester Send message Joined: 16 May 99 Posts: 154 Credit: 1,577,293 RAC: 1	Message 570133 - Posted: 18 May 2007, 8:16:21 UTC - in response to Message 570129. And those who are getting actual workunits to work on, consider yourselves lucky. There are some of us who have had nothing for days. In another forum, there was a suggestion to uninstall BOINC, delete the BOINC directory, and re-install fresh. Since it has been two days since my last computer made a successful communication with the SETI server(s), I decided to try it myself. I didn't delete the old directory but did rename it. Sure enough, while there are communication problems seen, I have both CPU's on my machine now processing new data. Now, I've read through this thread and I'm guessing the three WU in the "Tasks" section were ghost ones since in the past, all completed WU's were sitting in the "Transfers" section until they were successfully able to transfer. Would this be a correct statement? You are lucky. You have real work. Ghosts are results that show up on the results page of your Seti account but not in Tasks section. ID: 570133 ·

Henk Haneveld Volunteer tester Send message Joined: 16 May 99 Posts: 154 Credit: 1,577,293 RAC: 1	Message 570143 - Posted: 18 May 2007, 9:13:59 UTC It is possible that a other problem will come in to play pretty soon. If ghost units have a short deadline then they time-out after a couple off days, they will be resent to other hosts possible again as ghosts. If this happens to often the WU will get the "to many results" flag with-out ever being crunched. Maybe the Seti staff needs to think about shutting down downloads until the problem is solved. ID: 570143 ·

GreggyBee Volunteer tester Send message Joined: 9 Mar 01 Posts: 203 Credit: 1,600,521 RAC: 0	Message 570154 - Posted: 18 May 2007, 10:31:36 UTC Last modified: 18 May 2007, 10:36:42 UTC Just spotted the thread, and checked my results page: 25 ghost units; unfortunately, I rebooted this morning and lost the message log- so I'll keep tabs on what happens if (when) things have settled down- in the meantime, I've set it to 'no new tasks'. Besides Beta managed to dump 12 Astropulse units on me before Bruno fell over: the most-processed has taken 45 3/4 hours to crunch 23.8%!!! So, I've got enough to keep me going for weeks. PLUS, Proteins@ could do with a few more active crunchers, and they're quick WU's. Patience, my friends; positive thoughts; and a gentle 'thump' for thumper 8:P /Edit miscounted the Astropulse numbers ID: 570154 ·

Kirsten Volunteer tester Send message Joined: 7 Jul 00 Posts: 190 Credit: 566,047 RAC: 0	Message 570161 - Posted: 18 May 2007, 10:53:26 UTC Last modified: 18 May 2007, 11:00:34 UTC Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications? I have got nothing but ghost units for my two hosts the last couple of days. I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work. (This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.) The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml Kind regards Kirsten ID: 570161 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 570170 - Posted: 18 May 2007, 11:08:11 UTC - in response to Message 570161. Last modified: 18 May 2007, 11:16:23 UTC Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications? I have got nothing but ghost units for my two hosts the last couple of days. I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work. (This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.) The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml YES!!! I was just about to post the same thing. The following has worked for me on three systems - two late version 5.8 BOINC, and a 5.3.12.tx. All were service installs, running appropriate Chicken 2.2B, and had run completely dry. Recipe: Rename app_info.xml so it won't be recognised Restart BOINC (service) Update SETI - may not get through first time, but keep trying Restore app_info.xml to original name Wait until all transfers have finished Restart BOINC (service) Outcome - decent sized cache (if I haven't nabbed them all already, LOL), still running optimised, time to open a beer. ID: 570170 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 570178 - Posted: 18 May 2007, 11:25:44 UTC - in response to Message 570161. Last modified: 18 May 2007, 11:29:27 UTC Let me ask a very stupid question: has the internal server error, that produce ghost units, anything to do with the fact that I am using KWSN's optimized applications? I have got nothing but ghost units for my two hosts the last couple of days. I saw that another user uninstalled BOINC, manually deleted his BOINC folder and reinstalled BOINC. All the hosts he did this to is now receiving work. His "untouched" hosts are still getting ghosts and/or no new work. (This is not a solution for me, as I am running other BOINC projects instead of SETI for the time being. At least I think it is bad BOINC behaviour.) The above mentioned solution does start from scratch, though. It made me think of my optimized applications and the app_info.xml My first thought to this question was NO, the optimized science app should not affect BOINC when fetching work. But then I remember reading somewhere that the app_info file does take a small amount of time at the server to parse. So maybe it could be possible!. I still have 1 SETI WU that is partly crunched, and 1 Beta that will be finished in about 20 minutes. If I don't get any new WUs when the last SETI WU is finished, I may try removing the optimized app temporarily, worth a try! P.S. You can switch back to using the optimized app part-way through a WU, though it does cause the "stderr" file to have confusing information in it this result is an example where I switched back to the optimized app. Sir Arthur C Clarke 1917-2008 ID: 570178 ·

GreggyBee Volunteer tester Send message Joined: 9 Mar 01 Posts: 203 Credit: 1,600,521 RAC: 0	Message 570189 - Posted: 18 May 2007, 11:47:36 UTC - in response to Message 570170. Last modified: 18 May 2007, 11:49:33 UTC Hey Richard H- Thanx, I loved the recipe: Rename app_info.xml so it won't be recognised Restart BOINC (service) Update SETI - may not get through first time, but keep trying Restore app_info.xml to original name Wait until all transfers have finished Restart BOINC (service) It's working for me too: this should be posted in the Tech News thread ASAP ID: 570189 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.