Weird Day (Sep 05 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 634393 - Posted: 5 Sep 2007, 23:00:12 UTC A drive on thumper failed this morning. No major tragedy - there were many spare drives and one was pulled into place immediately and the whole device was resynced by mid-afternoon. We'll have to replace that drive at some point I guess. Spent a chunk of time learning about the current state of the Astropulse research. Also started setting up a small NAS recently purchased by Andrew (who is working on Optical SETI among other things) for his own research. More of the day was occupied tracking down some splitter issues which came to light only after I finished my new multibeam status program and ran it a couple times. We found certain sequence numbers in our data headers were, as it turns out, not necessarily in sequence. This doesn't affect the raw data, so the scientific analysis is just fine. However, we have some annoying cleanup ahead of us as as well as some band-aid programming. By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 634393 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 634397 - Posted: 5 Sep 2007, 23:07:16 UTC Thank You Matt for the Update . . . and to the Rest of You @ Berkeley working on the Issues - Thank You too!!! ID: 634397 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 634623 - Posted: 6 Sep 2007, 7:03:21 UTC Thank you so much Matt for the continued updates. It makes it much easier to exercise patience during the problematic times on Seti when we at least have your insight as to what is going on in the background. Some folks do not realize how much interaction there is between the various subsystems that make up the Seti project. The time you take to give us these informative posts is very much appreciated! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 634623 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20359 Credit: 7,508,002 RAC: 20	Message 634690 - Posted: 6 Sep 2007, 12:42:52 UTC - in response to Message 634393. ... By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, ... So, oddly enough, as it stands right now making less work [immediately available] means more work can be sent out. Very interesting. That should give a little insight on the performance bottlenecks. I guess that in effect, the fileserver load is getting interleaved between the work generators and the downloads/uploads... Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 634690 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 634712 - Posted: 6 Sep 2007, 13:34:07 UTC - in response to Message 634690. ... By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, ... So, oddly enough, as it stands right now making less work [immediately available] means more work can be sent out. Very interesting. That should give a little insight on the performance bottlenecks. I guess that in effect, the fileserver load is getting interleaved between the work generators and the downloads/uploads... It feels as if the whole splitter/file store/download system is balancing on a knife-edge at the moment. Overnight, the splitters were mostly churning out long deadline / long crunchtime WUs. They kept ahead of the demand, and even built up a small surplus. The WU download rate was low enough that most downloads went through at the first attempt, which kept the packet-pester rate low too. But in the last few hours, the tapes being split have started to include some short deadline / short crunch WUs. That change has been enough to make demand outstrip supply: WUs requests increase, the 'ready to send' queue has dropped, congestion starts to set in with the repeat requests. Result - we're seeing both "system connect" download errors, and "no work from project", at the same time. I think we're seeing the throughput capacity limit on the workunit file server: and with the current network topology, it feels like a pretty hard limit. ID: 634712 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 634726 - Posted: 6 Sep 2007, 13:45:49 UTC Another hour or so and the results ready to send que will get to zero, and the number of downloads trying to happen should start to go down, 6500 ready to send @ 1300 UTC and 4500 @ 1330 UTC. Claggy ID: 634726 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 634740 - Posted: 6 Sep 2007, 14:08:23 UTC - in response to Message 634726. Another hour or so and the results ready to send que will get to zero, and the number of downloads trying to happen should start to go down, 6500 ready to send @ 1300 UTC and 4500 @ 1330 UTC. Claggy Unfortunately, it's not a simple as that. The splitters have been making a fairly steady 9 results/sec, or 32,400 results per hour. Another 4,000 per hour more have been coming out of buffer storage recently, but the rate will only drop by ~10% when the buffer store is empty. Also, I'm pretty sure that the server status page should really be read as "Results ready to be allocated" - in other words, the number "Ready to ..." on the status page is decreased by one as soon as your computer makes contact with the server and is assigned a WU, whether the file download has completed or not. I don't think Berkeley would have any easy way of knowing whether a particular file download has been completed or not: the whole point of separating scheduling from file transfer was to avoid a (costly) database access for every download/upload: and if it isn't stored in the database, it probably isn't stored anywhere. So we have no way of knowing how many downloads, globally, are pending at any one time: the cricket graphs only give an idea of how many are succeeding. But I strongly suspect we won't see any reduction in the download 'system connect' errors until well after the "Ready to ..." figure has started to increase above zero again, meaning that all personal caches are full and the WU request rate has dropped. ID: 634740 ·

K2XM Send message Joined: 19 Aug 99 Posts: 2 Credit: 197,477 RAC: 0	Message 634771 - Posted: 6 Sep 2007, 15:24:48 UTC - in response to Message 634740. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Thanks, Peter The Rocky Coast of Maine ID: 634771 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 634772 - Posted: 6 Sep 2007, 15:31:52 UTC - in response to Message 634771. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Something other than BOINC is looking at the files in the BOINC folder: probably a virus scanner or some such. Whatever it is, it's in your computer, rather than anything to do with the Technical News from Berkeley. If you need further help, it would be better to post in Q&A or Number Crunching. ID: 634772 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 634791 - Posted: 6 Sep 2007, 16:22:34 UTC - in response to Message 634771. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Thanks, Peter The Rocky Coast of Maine I had that error at the start of the year, spontaneously on one of two machines every few hours, when I was running BoincLogX. Removing BoinclogX stopped it and it hasn't happened since. I had searched high and low trying removing antivirus programs, and all sorts of investigation before I pinned that down. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 634791 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 634793 - Posted: 6 Sep 2007, 16:32:45 UTC - in response to Message 634393. That's a brilliant observation about having 3 splitters. Just as in road construction, it doesn't pay to have a nice, wide road if the traffic has to merge to one lane down the pipe. The widest pipe should be at the end first, then upgrade each component upstream on the chain until you get the performance desired. That way, in this case, the file servers can keep up by throttling the load at the splitters. ID: 634793 ·

K2XM Send message Joined: 19 Aug 99 Posts: 2 Credit: 197,477 RAC: 0	Message 635420 - Posted: 7 Sep 2007, 15:06:27 UTC - in response to Message 634791. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Thanks, Peter The Rocky Coast of Maine I had that error at the start of the year, spontaneously on one of two machines every few hours, when I was running BoincLogX. Removing BoinclogX stopped it and it hasn't happened since. I had searched high and low trying removing antivirus programs, and all sorts of investigation before I pinned that down. I gave the computer a good swift kick in the CPU ( aka shutting down and restarting ) and I haven't had the problem since...only time will tell if this cured it. ID: 635420 ·

William Roeder Volunteer tester Send message Joined: 19 May 99 Posts: 69 Credit: 523,414 RAC: 0	Message 636410 - Posted: 8 Sep 2007, 14:19:51 UTC - in response to Message 634771. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename On windows, get a copy of unlocker ID: 636410 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 636496 - Posted: 8 Sep 2007, 15:09:50 UTC - in response to Message 634771. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Thanks, Peter The Rocky Coast of Maine Typically it means the the user account that BOINC is running under does not have sufficient rights to write to the BOINC directory tree. The default location is in the program files directory. All users in VISTA and restricted users in NT/2K/XP cannot write into the program files directory. In NT/2K/XP this can be fixed by changing the rights required to write into the BOINC directory. In Vista this is best fixed by installing someplace other than the Program Files directory (this works in the earlier versions as well). BOINC WIKI ID: 636496 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 636502 - Posted: 8 Sep 2007, 15:18:14 UTC - in response to Message 634771. I just started getting a weird error on the work computer, anybody know what this means?? 9/6/2007 9:47:00 AM\|\|Can't rename state file; The process cannot access the file because it is being used by another process. (0x20) 9/6/2007 9:47:00 AM\|\|[error] Couldn't write state file: system rename Thanks, Peter The Rocky Coast of Maine Windows got a file locked but lost the pointer. Shutdown all applications and reboot. That should fix it. ID: 636502 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 638055 - Posted: 10 Sep 2007, 13:57:15 UTC - in response to Message 634393. By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out. - Matt It depends on the work being split. To a rough approximation, the *supply* of workunits is constant, at about 3 per splitter per second. But the *demand* varies according to the AR, and hence the estimated crunching time, of the results being downloaded - by a factor up 4 or more. To put some very rough figures on it, Matt's "sweet spot" of three splitters running at any one time should generate about three-quarters of a million results per day. BOINCStats shows that SETI awards around 25 million credits per day. Divide one figure by the other, and you get a rough 'balance point' around 33 credits per WU. If the work being split averages more than that, Matt will be winning - more work will be being split than is being crunched. Cold CPUs will warm up, caches will be filling, and the 'Results ready to send' buffer will start to fill (hopefully in that order). On the other hand, if the work being split averages less than 33 credits per WU, then the crunchers will be winning (in the short term) - first the 'Results ready to send' buffer will shrink, and then CPUs will start to go cold and host caches will be drawn down. As we've been discussing here, the Arecibo telescope (not yet RIP) has three main recording modes: Looking intently at a single point - AR = ~0.01 or lower Stationary, letting the sky move overhead - AR = ~0.39 Basketweave, doing a rapid sky survey - AR = ~1.49 Other angle ranges are much rarer, and probably correspond to the telescope switching modes during a 107-second data recording period. The three main recording modes translate into WUs yielding approximately 64, 74/54 and 19 credits respectively. (The stationary recording position is very close to a discontinuity in the credit/angle range curve). So Matt wins if the data being split comes from an 'intense study' or 'stationary' period of observing, but loses badly if it comes from a 'basketweave' sky survey. Matt - do you know in advance what kind of recording is on each of your "tapes"? I know they're disks really. If so, it might be an idea to choose them so that no more than one 'basketweave' recording is being split at any one time. ID: 638055 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.