Post-Weekend Roundup (Feb 05 2007)

Message boards : Technical News : Post-Weekend Roundup (Feb 05 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 513868 - Posted: 6 Feb 2007, 3:08:16 UTC
Last modified: 6 Feb 2007, 3:08:52 UTC

Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.

Kryten has been getting a lot of heat for this, but outside of some inexplicable load issues on Sunday it was well behaved over the weekend. No lost mounts, and nothing noteworthy in /var/adm/messages.

I was busy today doing the usual monday whack-a-mole. Usual ad-hoc discussions and the weekly general meeting. Had to reboot one non-public administrative server (/tmp was full of old log files), had to debug some CVS issues (some BOINC developers couldn't check in their code), deal with some donation-related stuff, work on some database diagnostics (collecting more info to determine what's behind our weird "slow query" periods), and wrote/deployed a script to clean a surprising number of zombie results off the upload server (i.e. results on disk that aren't in the database - why is this happening?! - maybe cleaning these up and therefore reducing directory sizes will grease the wheels on kryten).

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 513868 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 513870 - Posted: 6 Feb 2007, 3:17:12 UTC

Tell us more about the zombies; it's intriguing. How many? How big? Wasn't this a problem about 18m ago?? Sounds familiar.
May this Farce be with You
ID: 513870 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 513874 - Posted: 6 Feb 2007, 3:24:07 UTC

Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 513874 · Report as offensive
Profile Marta Holt

Send message
Joined: 28 Mar 02
Posts: 11
Credit: 576,381
RAC: 0
United States
Message 513914 - Posted: 6 Feb 2007, 6:00:12 UTC - in response to Message 513874.  

Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff.

- Matt

Is this the reason that I cannot get any new work and the work I have got has been going for over 100 hours.....

thanks for your help
marta
ID: 513914 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 514117 - Posted: 6 Feb 2007, 15:41:45 UTC

Most people are getting work, Marta, so your issue is probably elsewhere.

I can't imagine 4.5 million files, but I would think the hole in boinc creating them should get more than low priority. Does einstein have the same problem? If not, why not? What's unique about seti.

Thanks for the info, Matt!
May this Farce be with You
ID: 514117 · Report as offensive
alexnuke

Send message
Joined: 24 Nov 02
Posts: 3
Credit: 160,483
RAC: 0
Germany
Message 514145 - Posted: 6 Feb 2007, 16:55:47 UTC - in response to Message 513874.  

Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff.

- Matt


I think i know where, these can come from. I had problem ater installing BOINC 5.4.11, something went wrong with the benchmark and before i realized anything i had 80-100(couldnt count) of WUs downloaded (because BOINC tought i needed only!16mins for each), this surely produces a lot of Zombies.
ID: 514145 · Report as offensive
Profile daniel
Volunteer tester
Avatar

Send message
Joined: 17 Aug 06
Posts: 183
Credit: 495,473
RAC: 0
United States
Message 514211 - Posted: 6 Feb 2007, 21:07:03 UTC - in response to Message 514145.  

i can't upload or download work
ID: 514211 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 514266 - Posted: 7 Feb 2007, 0:04:02 UTC - in response to Message 514145.  

Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff.

- Matt


I think i know where, these can come from. I had problem ater installing BOINC 5.4.11, something went wrong with the benchmark and before i realized anything i had 80-100(couldnt count) of WUs downloaded (because BOINC tought i needed only!16mins for each), this surely produces a lot of Zombies.


I had a problem where WUs were listed as being in my cache, but which had not actually been downloaded by my machine.

There was a runaway effect, as the number of WUs listed under "Results" in my "Your Account" page was increasing hourly. I was receiving nothing. I have seen similar happen to other machines on several occasions, where their "Results" list is chock a block with non-returned results. I feel, after what happened to me, that none of these were ever received by the afflicted machine.

Maybe this is the source of the "Zombies"? A glitch at S@h, producing WUs that are never sent, and just pile up, drowning poor old Kryten.
ID: 514266 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 514273 - Posted: 7 Feb 2007, 0:09:07 UTC - in response to Message 514211.  

i can't upload or download work


Hi Daniel

The recent re-config of the S@h network has caused some of us grief, as changing IP addresses has caused DNS info to become out of date. (a bit clumsy, but I hope you understand)

The way to fix this at your end, is to flush your DNS cache. The easiest way of doing this is to reboot your modem/router. If you have a DNS server on your network, stopping and restarting the service, or rebooting the server should also help.
ID: 514273 · Report as offensive
The Jedi Alliance - Ranger
Avatar

Send message
Joined: 27 Dec 00
Posts: 72
Credit: 60,982,863
RAC: 0
United States
Message 514290 - Posted: 7 Feb 2007, 0:30:26 UTC - in response to Message 513868.  

Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.


I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks.

ID: 514290 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 514314 - Posted: 7 Feb 2007, 1:36:13 UTC

I have another source of zombies. A number of people set their time between connects to 10 days in order to be sure of having work when SETI is down. The problem is I have been issued jobs that will expire in as little as 4 days which would expire if they have to wait up to 10 days for processing. If this parameter is change, it should not be set for more than a day or two. If you always want work, join another project and that will always ensure work is available to your system. Zombies can also be created by someone not flushing their system before leaving on vacation and then shutting down the system for a week or two with with work still in the system.
ID: 514314 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65640
Credit: 55,293,173
RAC: 49
United States
Message 514350 - Posted: 7 Feb 2007, 2:18:48 UTC - in response to Message 514290.  

Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.


I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks.

Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 514350 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 514417 - Posted: 7 Feb 2007, 3:44:21 UTC - in response to Message 514350.  

Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.


I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks.

Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything.

One of the libraries that BOINC uses caches the IP addresses...


BOINC WIKI
ID: 514417 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 514418 - Posted: 7 Feb 2007, 3:45:33 UTC - in response to Message 514314.  

I have another source of zombies. A number of people set their time between connects to 10 days in order to be sure of having work when SETI is down. The problem is I have been issued jobs that will expire in as little as 4 days which would expire if they have to wait up to 10 days for processing. If this parameter is change, it should not be set for more than a day or two. If you always want work, join another project and that will always ensure work is available to your system. Zombies can also be created by someone not flushing their system before leaving on vacation and then shutting down the system for a week or two with with work still in the system.

This is an unlikely source. If they get a task with a deadline of 4 days, that task will be done first. Of course, if they get 10 days of tasks all of which have the same 4 day deadline - that would indeed be trouble.


BOINC WIKI
ID: 514418 · Report as offensive
Profile TimeLord04
Volunteer tester
Avatar

Send message
Joined: 9 Mar 06
Posts: 21140
Credit: 33,933,039
RAC: 23
United States
Message 514575 - Posted: 7 Feb 2007, 14:27:11 UTC - in response to Message 514350.  

Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.


I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks.

Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything.



Actually; under Time Warner; with the changes going on at their end of things, (especially in SoCAL), they have had me reset my cable modem many times since the first week of August, 2006. (When they took over Adelphia.) Their latest "tips" in resetting the cable modem state, "Unplug the cable modem, unplug your router, turn off your computer. After two minutes plug in the cable modem, wait two more minutes, plug in the router, wait one minute, turn back on the computer." Following these steps has resolved various afflictions when Time Warner makes modifications that they don't bother to warn nor contact me about... (Especially with the fact that now they are admitting to their own DNS issues at Time Warner specifically for the SoCAL area.)

Combine this with the DNS Flush trick that Little Green Man has posted here in the Forums, and I now have little trouble maintaining contact between BOINC and Berkeley. Well, other than the fact that resetting all of these things at various times can be seen as hassle; still, it is working. So, I hope that this additional information helps; as not all Time Warner CSR Personnel are even aware of the Time Warner DNS Issues... I had to get to their Tier 3 Tech Support; even then, the Rep had to check with his Supervisor, that's when I received confirmation of all of this - that was almost a week and a half ago...


TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join Calm Chaos
ID: 514575 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 514604 - Posted: 7 Feb 2007, 16:16:04 UTC

Zombies should not be created under the circumstances mentioned so far in this thread.

Seti issues 4 copies of results for a WU to 4 hosts.
Each result has a deadline (but not written in stone).
If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned.

There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information.

So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation.

ID: 514604 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 514623 - Posted: 7 Feb 2007, 17:21:47 UTC - in response to Message 514604.  

Zombies should not be created under the circumstances mentioned so far in this thread.

Seti issues 4 copies of results for a WU to 4 hosts.
Each result has a deadline (but not written in stone).
If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned.

There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information.

So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation.


What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported?
ID: 514623 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 514692 - Posted: 7 Feb 2007, 20:39:02 UTC - in response to Message 514623.  

Zombies should not be created under the circumstances mentioned so far in this thread.

Seti issues 4 copies of results for a WU to 4 hosts.
Each result has a deadline (but not written in stone).
If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned.

There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information.

So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation.


What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported?

Actually uploaded and reported. There is the hole. If the upload happens after the DB entry has been removed, the hook to delete the result is gone.


BOINC WIKI
ID: 514692 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 514706 - Posted: 7 Feb 2007, 21:56:36 UTC - in response to Message 514692.  

Zombies should not be created under the circumstances mentioned so far in this thread.

Seti issues 4 copies of results for a WU to 4 hosts.
Each result has a deadline (but not written in stone).
If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned.

There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information.

So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation.


What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported?

Actually uploaded and reported. There is the hole. If the upload happens after the DB entry has been removed, the hook to delete the result is gone.

I rest my case.
ID: 514706 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 545617 - Posted: 13 Apr 2007, 19:48:16 UTC

Additional information on Zombie creation. The records do not seem to be processed in report date order. They seem to be processed by download order instead. I watched this happen several times and have recorded an event as an example. In order the jobs to be run were dated May 10, April 21 and May 10 with run times of 7, 2 and 7 hours. When the next job was selected, the first in the list (May 10) was processed. The system has been running for several days now and is an Apple OS X PPC system. This is not a problem for me because I limit data to only one day worth of processing, but people loading a week or more worth of data could have problems with this and create Zombies by having jobs time out before being processed.
ID: 545617 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Post-Weekend Roundup (Feb 05 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.