Post-Weekend Roundup (Feb 05 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 513868 - Posted: 6 Feb 2007, 3:08:16 UTC Last modified: 6 Feb 2007, 3:08:52 UTC Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state. Kryten has been getting a lot of heat for this, but outside of some inexplicable load issues on Sunday it was well behaved over the weekend. No lost mounts, and nothing noteworthy in /var/adm/messages. I was busy today doing the usual monday whack-a-mole. Usual ad-hoc discussions and the weekly general meeting. Had to reboot one non-public administrative server (/tmp was full of old log files), had to debug some CVS issues (some BOINC developers couldn't check in their code), deal with some donation-related stuff, work on some database diagnostics (collecting more info to determine what's behind our weird "slow query" periods), and wrote/deployed a script to clean a surprising number of zombie results off the upload server (i.e. results on disk that aren't in the database - why is this happening?! - maybe cleaning these up and therefore reducing directory sizes will grease the wheels on kryten). - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 513868 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 513870 - Posted: 6 Feb 2007, 3:17:12 UTC Tell us more about the zombies; it's intriguing. How many? How big? Wasn't this a problem about 18m ago?? Sounds familiar. May this Farce be with You ID: 513870 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 513874 - Posted: 6 Feb 2007, 3:24:07 UTC Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 513874 ·

Marta Holt Send message Joined: 28 Mar 02 Posts: 11 Credit: 576,381 RAC: 0	Message 513914 - Posted: 6 Feb 2007, 6:00:12 UTC - in response to Message 513874. Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff. - Matt Is this the reason that I cannot get any new work and the work I have got has been going for over 100 hours..... thanks for your help marta ID: 513914 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 514117 - Posted: 6 Feb 2007, 15:41:45 UTC Most people are getting work, Marta, so your issue is probably elsewhere. I can't imagine 4.5 million files, but I would think the hole in boinc creating them should get more than low priority. Does einstein have the same problem? If not, why not? What's unique about seti. Thanks for the info, Matt! May this Farce be with You ID: 514117 ·

alexnuke Send message Joined: 24 Nov 02 Posts: 3 Credit: 160,483 RAC: 0	Message 514145 - Posted: 6 Feb 2007, 16:55:47 UTC - in response to Message 513874. Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff. - Matt I think i know where, these can come from. I had problem ater installing BOINC 5.4.11, something went wrong with the benchmark and before i realized anything i had 80-100(couldnt count) of WUs downloaded (because BOINC tought i needed only!16mins for each), this surely produces a lot of Zombies. ID: 514145 ·

daniel Volunteer tester Send message Joined: 17 Aug 06 Posts: 183 Credit: 495,473 RAC: 0	Message 514211 - Posted: 6 Feb 2007, 21:07:03 UTC - in response to Message 514145. i can't upload or download work ID: 514211 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 514266 - Posted: 7 Feb 2007, 0:04:02 UTC - in response to Message 514145. Zombies: They've taken over. About 4.5 million of them (compared to the 1.5 million that aren't zombies). Most of them are old, i.e. the respective workunit has come and gone a long time ago. However, we still get about 2000 a day (a completely rough estimate). This is most likely due to results being uploaded long after they are due, so the respective workunit is gone, so nothing gets input into the database and the file is left to rot. Maybe not. Just a thought. I'm getting a lot of data about it so me, Jeff and David can discuss what's going on in BOINC-land. But this is low priority stuff. - Matt I think i know where, these can come from. I had problem ater installing BOINC 5.4.11, something went wrong with the benchmark and before i realized anything i had 80-100(couldnt count) of WUs downloaded (because BOINC tought i needed only!16mins for each), this surely produces a lot of Zombies. I had a problem where WUs were listed as being in my cache, but which had not actually been downloaded by my machine. There was a runaway effect, as the number of WUs listed under "Results" in my "Your Account" page was increasing hourly. I was receiving nothing. I have seen similar happen to other machines on several occasions, where their "Results" list is chock a block with non-returned results. I feel, after what happened to me, that none of these were ever received by the afflicted machine. Maybe this is the source of the "Zombies"? A glitch at S@h, producing WUs that are never sent, and just pile up, drowning poor old Kryten. ID: 514266 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 514273 - Posted: 7 Feb 2007, 0:09:07 UTC - in response to Message 514211. i can't upload or download work Hi Daniel The recent re-config of the S@h network has caused some of us grief, as changing IP addresses has caused DNS info to become out of date. (a bit clumsy, but I hope you understand) The way to fix this at your end, is to flush your DNS cache. The easiest way of doing this is to reboot your modem/router. If you have a DNS server on your network, stopping and restarting the service, or rebooting the server should also help. ID: 514273 ·

The Jedi Alliance - Ranger Send message Joined: 27 Dec 00 Posts: 72 Credit: 60,982,863 RAC: 0	Message 514290 - Posted: 7 Feb 2007, 0:30:26 UTC - in response to Message 513868. Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state. I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks. ID: 514290 ·

Dena Wiltsie Volunteer tester Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26	Message 514314 - Posted: 7 Feb 2007, 1:36:13 UTC I have another source of zombies. A number of people set their time between connects to 10 days in order to be sure of having work when SETI is down. The problem is I have been issued jobs that will expire in as little as 4 days which would expire if they have to wait up to 10 days for processing. If this parameter is change, it should not be set for more than a day or two. If you always want work, join another project and that will always ensure work is available to your system. Zombies can also be created by someone not flushing their system before leaving on vacation and then shutting down the system for a week or two with with work still in the system. ID: 514314 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65735 Credit: 55,293,173 RAC: 49	Message 514350 - Posted: 7 Feb 2007, 2:18:48 UTC - in response to Message 514290. Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state. I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks. Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 514350 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 514417 - Posted: 7 Feb 2007, 3:44:21 UTC - in response to Message 514350. Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state. I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks. Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything. One of the libraries that BOINC uses caches the IP addresses... BOINC WIKI ID: 514417 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 514418 - Posted: 7 Feb 2007, 3:45:33 UTC - in response to Message 514314. I have another source of zombies. A number of people set their time between connects to 10 days in order to be sure of having work when SETI is down. The problem is I have been issued jobs that will expire in as little as 4 days which would expire if they have to wait up to 10 days for processing. If this parameter is change, it should not be set for more than a day or two. If you always want work, join another project and that will always ensure work is available to your system. Zombies can also be created by someone not flushing their system before leaving on vacation and then shutting down the system for a week or two with with work still in the system. This is an unlikely source. If they get a task with a deadline of 4 days, that task will be done first. Of course, if they get 10 days of tasks all of which have the same 4 day deadline - that would indeed be trouble. BOINC WIKI ID: 514418 ·

TimeLord04 Volunteer tester Send message Joined: 9 Mar 06 Posts: 21140 Credit: 33,933,039 RAC: 23	Message 514575 - Posted: 7 Feb 2007, 14:27:11 UTC - in response to Message 514350. Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state. I originally thought it may be some sort of DNS caching at my end, but flushing/registering didn't help. What did help, oddly enough, was stopping BOINC and restarting it. This worked on 42 machines on 3 different networks. Yeah, I tried that today too, No effect of course, My DHCP server is My router, I unplugged It for 10 seconds, I also flushed the dns and registered It too(flushed, yep down the electronic drain);), At least I don't have to turn off My cable modem, Like that would do anything. Actually; under Time Warner; with the changes going on at their end of things, (especially in SoCAL), they have had me reset my cable modem many times since the first week of August, 2006. (When they took over Adelphia.) Their latest "tips" in resetting the cable modem state, "Unplug the cable modem, unplug your router, turn off your computer. After two minutes plug in the cable modem, wait two more minutes, plug in the router, wait one minute, turn back on the computer." Following these steps has resolved various afflictions when Time Warner makes modifications that they don't bother to warn nor contact me about... (Especially with the fact that now they are admitting to their own DNS issues at Time Warner specifically for the SoCAL area.) Combine this with the DNS Flush trick that Little Green Man has posted here in the Forums, and I now have little trouble maintaining contact between BOINC and Berkeley. Well, other than the fact that resetting all of these things at various times can be seen as hassle; still, it is working. So, I hope that this additional information helps; as not all Time Warner CSR Personnel are even aware of the Time Warner DNS Issues... I had to get to their Tier 3 Tech Support; even then, the Rep had to check with his Supervisor, that's when I received confirmation of all of this - that was almost a week and a half ago... TimeLord04 Have TARDIS, will travel... Come along K-9! Join Calm Chaos ID: 514575 ·

Benher Volunteer developer Volunteer tester Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0	Message 514604 - Posted: 7 Feb 2007, 16:16:04 UTC Zombies should not be created under the circumstances mentioned so far in this thread. Seti issues 4 copies of results for a WU to 4 hosts. Each result has a deadline (but not written in stone). If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned. There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information. So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation. ID: 514604 ·

Dena Wiltsie Volunteer tester Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26	Message 514623 - Posted: 7 Feb 2007, 17:21:47 UTC - in response to Message 514604. Zombies should not be created under the circumstances mentioned so far in this thread. Seti issues 4 copies of results for a WU to 4 hosts. Each result has a deadline (but not written in stone). If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned. There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information. So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation. What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported? ID: 514623 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 514692 - Posted: 7 Feb 2007, 20:39:02 UTC - in response to Message 514623. Zombies should not be created under the circumstances mentioned so far in this thread. Seti issues 4 copies of results for a WU to 4 hosts. Each result has a deadline (but not written in stone). If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned. There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information. So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation. What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported? Actually uploaded and reported. There is the hole. If the upload happens after the DB entry has been removed, the hook to delete the result is gone. BOINC WIKI ID: 514692 ·

Dena Wiltsie Volunteer tester Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26	Message 514706 - Posted: 7 Feb 2007, 21:56:36 UTC - in response to Message 514692. Zombies should not be created under the circumstances mentioned so far in this thread. Seti issues 4 copies of results for a WU to 4 hosts. Each result has a deadline (but not written in stone). If that deadline time is reached and fewer than 3 cross comparable results have not been returned, seti servers then send more copies to some other hosts until the required 3 have been returned. There are NOT 4 (or more) actual files sitting on seti's servers (with blank spaces for host computations). There are 4 result "slots" in the big, single file, database awaiting returned result information. So if some host doesn't return a result, or cancels it or whatever, this doesn't create an orphan file. Other things might, just not this situation. What happens if three good results are received and the deadline time comes and goes. Next a late work unit is reported? Actually uploaded and reported. There is the hole. If the upload happens after the DB entry has been removed, the hook to delete the result is gone. I rest my case. ID: 514706 ·

Dena Wiltsie Volunteer tester Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26	Message 545617 - Posted: 13 Apr 2007, 19:48:16 UTC Additional information on Zombie creation. The records do not seem to be processed in report date order. They seem to be processed by download order instead. I watched this happen several times and have recorded an event as an example. In order the jobs to be run were dated May 10, April 21 and May 10 with run times of 7, 2 and 7 hours. When the next job was selected, the first in the list (May 10) was processed. The system has been running for several days now and is an Apple OS X PPC system. This is not a problem for me because I limit data to only one day worth of processing, but people loading a week or more worth of data could have problems with this and create Zombies by having jobs time out before being processed. ID: 545617 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.