Ghost Units

Message boards : Number crunching : Ghost Units
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Luigi Naruszewicz
Avatar

Send message
Joined: 19 Nov 99
Posts: 620
Credit: 23,910,372
RAC: 14
United Kingdom
Message 1014623 - Posted: 11 Jul 2010, 11:19:53 UTC

It would appear I have 19 work units allocated to me on 5th July that never got, plus 2 more on 22nd June.

Apologies to wingmen.


.


A person who makes no mistakes, creates nothing.
ID: 1014623 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014724 - Posted: 11 Jul 2010, 17:56:54 UTC
Last modified: 11 Jul 2010, 17:57:28 UTC

What I would like to know, is what causes ghosts in the first place. I'm not sure if I read and forgot, or never found the information. It is a bit difficult to always be on top of all the vast information shared here. I have had plenty of them, as well as other strange things going on, but I have not figured out why the server says you can have a work unit that never registered in BOINC. I would guess that any limits in wu's would be looking at the ghosts as well. If there is a thread describing what is happening, I would be most happy if someone could point me to it.

Thanks,

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014724 · Report as offensive
sarmitage

Send message
Joined: 2 Dec 09
Posts: 56
Credit: 1,123,857
RAC: 0
Canada
Message 1014730 - Posted: 11 Jul 2010, 18:02:52 UTC - in response to Message 1014724.  

I certainly think you are right; I've got quite a few WUs stacked up in my "in progress" that I don't have in BOINC Manager, and when I update SAH, it says "Reporting x completed tasks, not requesting new tasks", even though I only have 18 CPU WUs (for an 8-core CPU), and 0 GPU WUs in my cache (which is set for 5 days at the moment).

Worse still, these are WUs that aren't supposed to expire until mid-to-late August, so I could be over a month without getting new work if these are preventing it =(
ID: 1014730 · Report as offensive
sarmitage

Send message
Joined: 2 Dec 09
Posts: 56
Credit: 1,123,857
RAC: 0
Canada
Message 1014731 - Posted: 11 Jul 2010, 18:07:39 UTC - in response to Message 1014730.  

I certainly think you are right; I've got quite a few WUs stacked up in my "in progress" that I don't have in BOINC Manager, and when I update SAH, it says "Reporting x completed tasks, not requesting new tasks", even though I only have 18 CPU WUs (for an 8-core CPU), and 0 GPU WUs in my cache (which is set for 5 days at the moment).

Worse still, these are WUs that aren't supposed to expire until mid-to-late August, so I could be over a month without getting new work if these are preventing it =(


I take it back. As soon as I posted that, I got about 72 new tasks (about half GPU, half CPU), and then started seeing a distinct error message: "This computer has reached a limit on tasks in progress"; so does not seem to be the same issue.
ID: 1014731 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014733 - Posted: 11 Jul 2010, 18:11:11 UTC - in response to Message 1014731.  

It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014733 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1014736 - Posted: 11 Jul 2010, 18:17:15 UTC - in response to Message 1014733.  

The server starts to send a batch of WUs, the bandwidth gets maxed out, the work doesn't reach it's destination. The server thinks it sent them, the destination doesn't even know they were sent. The WUs are lost in limbo.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1014736 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014737 - Posted: 11 Jul 2010, 18:20:58 UTC - in response to Message 1014736.  

Isn't there any hand shaking that goes on between the server and BOINC? I guess it's time again for another donation to SETI to bump up their bandwidth again. :D. I'll see what I can do...

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014737 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1014742 - Posted: 11 Jul 2010, 18:37:07 UTC - in response to Message 1014737.  

There is supposed to be a handshake but it too gets lost or the server hears a noise it thinks is it....Who knows!! As for SAH's bandwidth, they are getting fiber run up their hill but they will still be limited to 100MB because the University owns the line and the fiber is for the whole lab.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1014742 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014745 - Posted: 11 Jul 2010, 18:42:08 UTC - in response to Message 1014742.  

Within the next couple of weeks, I'll throw another couple hundred dollars at them before one of my cards expires. (I opted out of the huge interest increase they offered. The last time I use it will be for another donation to SETI.) :) At least that way I will have paid for my ghosts.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014745 · Report as offensive
Profile Jason Safoutin
Volunteer tester
Avatar

Send message
Joined: 8 Sep 05
Posts: 1386
Credit: 200,389
RAC: 0
United States
Message 1014752 - Posted: 11 Jul 2010, 18:57:35 UTC

Just a thought, but I could be wrong: Check how many days of work you have set to gather, aka retrieve work. I am currently set at 3 days. Also, could this have been something that was caused by the last outage? Maybe they got stuck right as the servers went down last?
"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3

ID: 1014752 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014755 - Posted: 11 Jul 2010, 19:06:45 UTC - in response to Message 1014752.  

Just a thought, but I could be wrong: Check how many days of work you have set to gather, aka retrieve work. I am currently set at 3 days. Also, could this have been something that was caused by the last outage? Maybe they got stuck right as the servers went down last?


I've noticed the problem since the situation with quota's first came about. I have since had it happen several times with cache sizes running the full range. It may have been happening all along, but I never noticed it until I ran out of work the first time, and the server said I still had a bunch of wu's. I had seen others speak of it, but ignored it until it happened to me. Oh well!

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014755 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1014757 - Posted: 11 Jul 2010, 19:07:49 UTC - in response to Message 1014752.  

Jason, yes that causes it too. But again it has to do with bandwidth. The server starts the process, gets interrupted, thinks it got them out to the client but.... Ain't 'lectronical stuff fun?? :-)


PROUD MEMBER OF Team Starfire World BOINC
ID: 1014757 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1014760 - Posted: 11 Jul 2010, 19:12:42 UTC - in response to Message 1014755.  

Most people don't even notice them unless they are looking for them. Going back through your tasks page and suddenly notice a bunch of unfinished or timed out work units is how most find them. Now, with the quotas, they show up easier when people try to figure out why they aren't getting work.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1014760 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1014800 - Posted: 11 Jul 2010, 21:52:36 UTC

I've killed about 500 of them on my three rigs. waited until everything I had on the rigs was finished, uploaded and reported, and then looked at the database's list of what I supposedly still had. The 12 core had 380 ghosts, the 8 core 118. Seems somewhat proportional to how much work each box can do, or at least, how much it tries to download. My ghosts seem to have a temporal relationship; I received some units at or around the same time, but groups of 30-50 others became "ghosts". This would for me lend credence to the lost packet theory.
ID: 1014800 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1014804 - Posted: 11 Jul 2010, 22:05:17 UTC - in response to Message 1014733.  

It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist.

Steve

Quick summary of what I think is the most usual cause:

1. Your host sends a request to the scheduler reporting some work and asking for more.
2. Apache on the server opens a connection, receives the request, and triggers a Scheduler process to handle it.
3. That process gets through database lookups of host, user, and team to authenticate the request, and database updates for the reported work.
4. The process starts going through the Feeder slots finding work to send, as each one which is feasible is found a database update marking that result "sent" to your host is done, the slot is marked empty, the task information is added to the reply being built up.
5. If more work is needed on the host and there are unchecked Feeder slots and no limit or quota has been reached, repeat the previous step.
2a or 3a or 4a. OOPS, the connection has remained open as long as the Apache setting allows, so it sends a SIGTERM to the Scheduler process, and an HTTP 500 error reply to your host.

That timeout mechanism is what's called "dropping a TCP connection", though there are other possible causes. The cause I'm describing is commented in the sched_main.cpp file as probably caused by the database being slow. Note that if the drop happens before step 4 no ghosts will be created.

Jan. 20 2008 I started a thread on the boinc_dev mailing list, "resend_lost_results improvement?" which over the next few days discussed this kind of lost results as well as those created by a user doing Reset on a project without the resend feature enabled. Dr. Anderson's replies in that thread included what seemed to me practical fix ideas for both of those causes. I don't know whether he forgot in the press of trying to adapt BOINC as the computing world changes, or he has those ideas on a "To Do" list and they've simply never reached the top spot.
                                                                  Joe
ID: 1014804 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6652
Credit: 121,090,076
RAC: 0
United States
Message 1014807 - Posted: 11 Jul 2010, 22:18:40 UTC - in response to Message 1014804.  

Thank you for the reply. I understand a lot better.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1014807 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1014810 - Posted: 11 Jul 2010, 22:34:30 UTC - in response to Message 1014804.  

It would still be interesting to find out what causes these to happen in the first place. I even detached and reattached the last time to get rid of them. I would only do that if I had a very low or empty cache. Actually, I 'm really just curious as to why they exist.

Steve

Quick summary of what I think is the most usual cause:

1. Your host sends a request to the scheduler reporting some work and asking for more.
2. Apache on the server opens a connection, receives the request, and triggers a Scheduler process to handle it.
3. That process gets through database lookups of host, user, and team to authenticate the request, and database updates for the reported work.
4. The process starts going through the Feeder slots finding work to send, as each one which is feasible is found a database update marking that result "sent" to your host is done, the slot is marked empty, the task information is added to the reply being built up.
5. If more work is needed on the host and there are unchecked Feeder slots and no limit or quota has been reached, repeat the previous step.
2a or 3a or 4a. OOPS, the connection has remained open as long as the Apache setting allows, so it sends a SIGTERM to the Scheduler process, and an HTTP 500 error reply to your host.

That timeout mechanism is what's called "dropping a TCP connection", though there are other possible causes. The cause I'm describing is commented in the sched_main.cpp file as probably caused by the database being slow. Note that if the drop happens before step 4 no ghosts will be created.

Jan. 20 2008 I started a thread on the boinc_dev mailing list, "resend_lost_results improvement?" which over the next few days discussed this kind of lost results as well as those created by a user doing Reset on a project without the resend feature enabled. Dr. Anderson's replies in that thread included what seemed to me practical fix ideas for both of those causes. I don't know whether he forgot in the press of trying to adapt BOINC as the computing world changes, or he has those ideas on a "To Do" list and they've simply never reached the top spot.
                                                                  Joe


Hmm aren't those "ghosts" units the result of when one of the servers are hung which they can do.

If for instance the DB server flags them to be assigned to you and then crashes in the middle of that procedure , the computer never got the response to even download the results to your "personal queue" because you only get the response that "system is down xxx" ..

I'm not sure but i believe there are a lot of handshaking and routines that work when the systems are actually up and running even with a saturated bandwidth, but a sporadic server crash would definitely fit the question more handy in my mind..

2 cents

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1014810 · Report as offensive
Luigi Naruszewicz
Avatar

Send message
Joined: 19 Nov 99
Posts: 620
Credit: 23,910,372
RAC: 14
United Kingdom
Message 1014813 - Posted: 11 Jul 2010, 22:37:31 UTC

I have had the odd one in past before and just accepted it, but to get 19 in one it's a bit steep. Fortuneatly, most of them are shorties and due to time out 19 July, so my wingmen will not have too long to wait for their credit, though I suspect I will get clobbered timing out on so many units at once.
.


A person who makes no mistakes, creates nothing.
ID: 1014813 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1014879 - Posted: 12 Jul 2010, 3:47:30 UTC - in response to Message 1014810.  

Hmm aren't those "ghosts" units the result of when one of the servers are hung which they can do.

If for instance the DB server flags them to be assigned to you and then crashes in the middle of that procedure , the computer never got the response to even download the results to your "personal queue" because you only get the response that "system is down xxx" ..

I'm not sure but i believe there are a lot of handshaking and routines that work when the systems are actually up and running even with a saturated bandwidth, but a sporadic server crash would definitely fit the question more handy in my mind..

2 cents

Kind regards Vyper

Certainly it could be anything which keeps the Scheduler process from finishing normally so the reply gets sent. I don't have any practical experience of that kind of server environment, your judgement may well be better than mine.

But the essential thing which could delay a Scheduler as I decided likely is the BOINC database being temporarily tied up much as it was a couple of days ago before Jeff hid those two news threads, and sluggishness of the database happens frequently enough it still seems a prime candidate to me.
                                                                 Joe
ID: 1014879 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1015119 - Posted: 12 Jul 2010, 20:00:52 UTC - in response to Message 1014810.  

I think you're pretty well right, Vyper; the situation you described was very much what I noticed, last Friday (9th July). The 9 'ghost WUs' I had, were apparently downloaded to me at 15:21 UTC, only a few minutes, if that, after I managed to report the last of the tasks from the 'outage'. Work was being requested, yet, the messages I was getting at about that time, said the servers may be temporarily down and no work was downloaded until some 10 mins or so, later. Needless to say, I thought that the work I actually got, was the work that had 'originally' been requested. Obviously, I did a detach, early this morning, after doing all the work I had...it'll save those 'wingmen' waiting a long time for their credit. I think I read somewhere else in the forums, that by doing the detach, there would be very little, if any, 'penalty' involved, so, it was a 'no-brainer' as well as being courteous. Now I have to be patient, while the estimated WU times, zero themselves in!



Don't take life too seriously, as you'll never come out of it alive!
ID: 1015119 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Ghost Units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.