Message boards :
Number crunching :
Ghost Units
Message board moderation
Author | Message |
---|---|
Luigi Naruszewicz Send message Joined: 19 Nov 99 Posts: 620 Credit: 23,910,372 RAC: 14 |
It would appear I have 19 work units allocated to me on 5th July that never got, plus 2 more on 22nd June. Apologies to wingmen. . A person who makes no mistakes, creates nothing. |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
What I would like to know, is what causes ghosts in the first place. I'm not sure if I read and forgot, or never found the information. It is a bit difficult to always be on top of all the vast information shared here. I have had plenty of them, as well as other strange things going on, but I have not figured out why the server says you can have a work unit that never registered in BOINC. I would guess that any limits in wu's would be looking at the ghosts as well. If there is a thread describing what is happening, I would be most happy if someone could point me to it. Thanks, Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
sarmitage Send message Joined: 2 Dec 09 Posts: 56 Credit: 1,123,857 RAC: 0 |
I certainly think you are right; I've got quite a few WUs stacked up in my "in progress" that I don't have in BOINC Manager, and when I update SAH, it says "Reporting x completed tasks, not requesting new tasks", even though I only have 18 CPU WUs (for an 8-core CPU), and 0 GPU WUs in my cache (which is set for 5 days at the moment). Worse still, these are WUs that aren't supposed to expire until mid-to-late August, so I could be over a month without getting new work if these are preventing it =( |
sarmitage Send message Joined: 2 Dec 09 Posts: 56 Credit: 1,123,857 RAC: 0 |
I certainly think you are right; I've got quite a few WUs stacked up in my "in progress" that I don't have in BOINC Manager, and when I update SAH, it says "Reporting x completed tasks, not requesting new tasks", even though I only have 18 CPU WUs (for an 8-core CPU), and 0 GPU WUs in my cache (which is set for 5 days at the moment). I take it back. As soon as I posted that, I got about 72 new tasks (about half GPU, half CPU), and then started seeing a distinct error message: "This computer has reached a limit on tasks in progress"; so does not seem to be the same issue. |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
The server starts to send a batch of WUs, the bandwidth gets maxed out, the work doesn't reach it's destination. The server thinks it sent them, the destination doesn't even know they were sent. The WUs are lost in limbo. PROUD MEMBER OF Team Starfire World BOINC |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
Isn't there any hand shaking that goes on between the server and BOINC? I guess it's time again for another donation to SETI to bump up their bandwidth again. :D. I'll see what I can do... Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
There is supposed to be a handshake but it too gets lost or the server hears a noise it thinks is it....Who knows!! As for SAH's bandwidth, they are getting fiber run up their hill but they will still be limited to 100MB because the University owns the line and the fiber is for the whole lab. PROUD MEMBER OF Team Starfire World BOINC |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
Within the next couple of weeks, I'll throw another couple hundred dollars at them before one of my cards expires. (I opted out of the huge interest increase they offered. The last time I use it will be for another donation to SETI.) :) At least that way I will have paid for my ghosts. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
Jason Safoutin Send message Joined: 8 Sep 05 Posts: 1386 Credit: 200,389 RAC: 0 |
Just a thought, but I could be wrong: Check how many days of work you have set to gather, aka retrieve work. I am currently set at 3 days. Also, could this have been something that was caused by the last outage? Maybe they got stuck right as the servers went down last? "By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3 |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
Just a thought, but I could be wrong: Check how many days of work you have set to gather, aka retrieve work. I am currently set at 3 days. Also, could this have been something that was caused by the last outage? Maybe they got stuck right as the servers went down last? I've noticed the problem since the situation with quota's first came about. I have since had it happen several times with cache sizes running the full range. It may have been happening all along, but I never noticed it until I ran out of work the first time, and the server said I still had a bunch of wu's. I had seen others speak of it, but ignored it until it happened to me. Oh well! Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Jason, yes that causes it too. But again it has to do with bandwidth. The server starts the process, gets interrupted, thinks it got them out to the client but.... Ain't 'lectronical stuff fun?? :-) PROUD MEMBER OF Team Starfire World BOINC |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Most people don't even notice them unless they are looking for them. Going back through your tasks page and suddenly notice a bunch of unfinished or timed out work units is how most find them. Now, with the quotas, they show up easier when people try to figure out why they aren't getting work. PROUD MEMBER OF Team Starfire World BOINC |
Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0 |
I've killed about 500 of them on my three rigs. waited until everything I had on the rigs was finished, uploaded and reported, and then looked at the database's list of what I supposedly still had. The 12 core had 380 ghosts, the 8 core 118. Seems somewhat proportional to how much work each box can do, or at least, how much it tries to download. My ghosts seem to have a temporal relationship; I received some units at or around the same time, but groups of 30-50 others became "ghosts". This would for me lend credence to the lost packet theory. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist. Quick summary of what I think is the most usual cause: 1. Your host sends a request to the scheduler reporting some work and asking for more. 2. Apache on the server opens a connection, receives the request, and triggers a Scheduler process to handle it. 3. That process gets through database lookups of host, user, and team to authenticate the request, and database updates for the reported work. 4. The process starts going through the Feeder slots finding work to send, as each one which is feasible is found a database update marking that result "sent" to your host is done, the slot is marked empty, the task information is added to the reply being built up. 5. If more work is needed on the host and there are unchecked Feeder slots and no limit or quota has been reached, repeat the previous step. 2a or 3a or 4a. OOPS, the connection has remained open as long as the Apache setting allows, so it sends a SIGTERM to the Scheduler process, and an HTTP 500 error reply to your host. That timeout mechanism is what's called "dropping a TCP connection", though there are other possible causes. The cause I'm describing is commented in the sched_main.cpp file as probably caused by the database being slow. Note that if the drop happens before step 4 no ghosts will be created. Jan. 20 2008 I started a thread on the boinc_dev mailing list, "resend_lost_results improvement?" which over the next few days discussed this kind of lost results as well as those created by a user doing Reset on a project without the resend feature enabled. Dr. Anderson's replies in that thread included what seemed to me practical fix ideas for both of those causes. I don't know whether he forgot in the press of trying to adapt BOINC as the computing world changes, or he has those ideas on a "To Do" list and they've simply never reached the top spot. Joe |
SciManStev Send message Joined: 20 Jun 99 Posts: 6652 Credit: 121,090,076 RAC: 0 |
Thank you for the reply. I understand a lot better. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist. Hmm aren't those "ghosts" units the result of when one of the servers are hung which they can do. If for instance the DB server flags them to be assigned to you and then crashes in the middle of that procedure , the computer never got the response to even download the results to your "personal queue" because you only get the response that "system is down xxx" .. I'm not sure but i believe there are a lot of handshaking and routines that work when the systems are actually up and running even with a saturated bandwidth, but a sporadic server crash would definitely fit the question more handy in my mind.. 2 cents Kind regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Luigi Naruszewicz Send message Joined: 19 Nov 99 Posts: 620 Credit: 23,910,372 RAC: 14 |
I have had the odd one in past before and just accepted it, but to get 19 in one it's a bit steep. Fortuneatly, most of them are shorties and due to time out 19 July, so my wingmen will not have too long to wait for their credit, though I suspect I will get clobbered timing out on so many units at once. . A person who makes no mistakes, creates nothing. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Hmm aren't those "ghosts" units the result of when one of the servers are hung which they can do. Certainly it could be anything which keeps the Scheduler process from finishing normally so the reply gets sent. I don't have any practical experience of that kind of server environment, your judgement may well be better than mine. But the essential thing which could delay a Scheduler as I decided likely is the BOINC database being temporarily tied up much as it was a couple of days ago before Jeff hid those two news threads, and sluggishness of the database happens frequently enough it still seems a prime candidate to me. Joe |
Iona Send message Joined: 12 Jul 07 Posts: 790 Credit: 22,438,118 RAC: 0 |
I think you're pretty well right, Vyper; the situation you described was very much what I noticed, last Friday (9th July). The 9 'ghost WUs' I had, were apparently downloaded to me at 15:21 UTC, only a few minutes, if that, after I managed to report the last of the tasks from the 'outage'. Work was being requested, yet, the messages I was getting at about that time, said the servers may be temporarily down and no work was downloaded until some 10 mins or so, later. Needless to say, I thought that the work I actually got, was the work that had 'originally' been requested. Obviously, I did a detach, early this morning, after doing all the work I had...it'll save those 'wingmen' waiting a long time for their credit. I think I read somewhere else in the forums, that by doing the detach, there would be very little, if any, 'penalty' involved, so, it was a 'no-brainer' as well as being courteous. Now I have to be patient, while the estimated WU times, zero themselves in! Don't take life too seriously, as you'll never come out of it alive! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.