Weird Day (Sep 05 2007)

Message boards : Technical News : Weird Day (Sep 05 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 634393 - Posted: 5 Sep 2007, 23:00:12 UTC

A drive on thumper failed this morning. No major tragedy - there were many spare drives and one was pulled into place immediately and the whole device was resynced by mid-afternoon. We'll have to replace that drive at some point I guess. Spent a chunk of time learning about the current state of the Astropulse research. Also started setting up a small NAS recently purchased by Andrew (who is working on Optical SETI among other things) for his own research.

More of the day was occupied tracking down some splitter issues which came to light only after I finished my new multibeam status program and ran it a couple times. We found certain sequence numbers in our data headers were, as it turns out, not necessarily in sequence. This doesn't affect the raw data, so the scientific analysis is just fine. However, we have some annoying cleanup ahead of us as as well as some band-aid programming.

By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 634393 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 634397 - Posted: 5 Sep 2007, 23:07:16 UTC


Thank You Matt for the Update . . .

and to the Rest of You @ Berkeley working on the Issues - Thank You too!!!

ID: 634397 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 634623 - Posted: 6 Sep 2007, 7:03:21 UTC

Thank you so much Matt for the continued updates. It makes it much easier to exercise patience during the problematic times on Seti when we at least have your insight as to what is going on in the background. Some folks do not realize how much interaction there is between the various subsystems that make up the Seti project. The time you take to give us these informative posts is very much appreciated!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 634623 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20359
Credit: 7,508,002
RAC: 20
United Kingdom
Message 634690 - Posted: 6 Sep 2007, 12:42:52 UTC - in response to Message 634393.  

... By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, ... So, oddly enough, as it stands right now making less work [immediately available] means more work can be sent out.

Very interesting. That should give a little insight on the performance bottlenecks.

I guess that in effect, the fileserver load is getting interleaved between the work generators and the downloads/uploads...

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 634690 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 634712 - Posted: 6 Sep 2007, 13:34:07 UTC - in response to Message 634690.  

... By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, ... So, oddly enough, as it stands right now making less work [immediately available] means more work can be sent out.

Very interesting. That should give a little insight on the performance bottlenecks.

I guess that in effect, the fileserver load is getting interleaved between the work generators and the downloads/uploads...

It feels as if the whole splitter/file store/download system is balancing on a knife-edge at the moment.

Overnight, the splitters were mostly churning out long deadline / long crunchtime WUs. They kept ahead of the demand, and even built up a small surplus. The WU download rate was low enough that most downloads went through at the first attempt, which kept the packet-pester rate low too.

But in the last few hours, the tapes being split have started to include some short deadline / short crunch WUs. That change has been enough to make demand outstrip supply: WUs requests increase, the 'ready to send' queue has dropped, congestion starts to set in with the repeat requests. Result - we're seeing both "system connect" download errors, and "no work from project", at the same time.

I think we're seeing the throughput capacity limit on the workunit file server: and with the current network topology, it feels like a pretty hard limit.
ID: 634712 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 634726 - Posted: 6 Sep 2007, 13:45:49 UTC

Another hour or so and the results ready to send que will get to zero,
and the number of downloads trying to happen should start to go down,
6500 ready to send @ 1300 UTC and 4500 @ 1330 UTC.

Claggy
ID: 634726 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 634740 - Posted: 6 Sep 2007, 14:08:23 UTC - in response to Message 634726.  

Another hour or so and the results ready to send que will get to zero,
and the number of downloads trying to happen should start to go down,
6500 ready to send @ 1300 UTC and 4500 @ 1330 UTC.

Claggy

Unfortunately, it's not a simple as that.

The splitters have been making a fairly steady 9 results/sec, or 32,400 results per hour. Another 4,000 per hour more have been coming out of buffer storage recently, but the rate will only drop by ~10% when the buffer store is empty.

Also, I'm pretty sure that the server status page should really be read as "Results ready to be allocated" - in other words, the number "Ready to ..." on the status page is decreased by one as soon as your computer makes contact with the server and is assigned a WU, whether the file download has completed or not.

I don't think Berkeley would have any easy way of knowing whether a particular file download has been completed or not: the whole point of separating scheduling from file transfer was to avoid a (costly) database access for every download/upload: and if it isn't stored in the database, it probably isn't stored anywhere.

So we have no way of knowing how many downloads, globally, are pending at any one time: the cricket graphs only give an idea of how many are succeeding. But I strongly suspect we won't see any reduction in the download 'system connect' errors until well after the "Ready to ..." figure has started to increase above zero again, meaning that all personal caches are full and the WU request rate has dropped.
ID: 634740 · Report as offensive
Profile K2XM

Send message
Joined: 19 Aug 99
Posts: 2
Credit: 197,477
RAC: 0
United States
Message 634771 - Posted: 6 Sep 2007, 15:24:48 UTC - in response to Message 634740.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename


Thanks,
Peter
The Rocky Coast of Maine
ID: 634771 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 634772 - Posted: 6 Sep 2007, 15:31:52 UTC - in response to Message 634771.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename

Something other than BOINC is looking at the files in the BOINC folder: probably a virus scanner or some such.

Whatever it is, it's in your computer, rather than anything to do with the Technical News from Berkeley. If you need further help, it would be better to post in Q&A or Number Crunching.
ID: 634772 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 634791 - Posted: 6 Sep 2007, 16:22:34 UTC - in response to Message 634771.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename


Thanks,
Peter
The Rocky Coast of Maine


I had that error at the start of the year, spontaneously on one of two machines every few hours, when I was running BoincLogX. Removing BoinclogX stopped it and it hasn't happened since. I had searched high and low trying removing antivirus programs, and all sorts of investigation before I pinned that down.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 634791 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 634793 - Posted: 6 Sep 2007, 16:32:45 UTC - in response to Message 634393.  

That's a brilliant observation about having 3 splitters. Just as in road construction, it doesn't pay to have a nice, wide road if the traffic has to merge to one lane down the pipe. The widest pipe should be at the end first, then upgrade each component upstream on the chain until you get the performance desired. That way, in this case, the file servers can keep up by throttling the load at the splitters.
ID: 634793 · Report as offensive
Profile K2XM

Send message
Joined: 19 Aug 99
Posts: 2
Credit: 197,477
RAC: 0
United States
Message 635420 - Posted: 7 Sep 2007, 15:06:27 UTC - in response to Message 634791.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename


Thanks,
Peter
The Rocky Coast of Maine


I had that error at the start of the year, spontaneously on one of two machines every few hours, when I was running BoincLogX. Removing BoinclogX stopped it and it hasn't happened since. I had searched high and low trying removing antivirus programs, and all sorts of investigation before I pinned that down.


I gave the computer a good swift kick in the CPU ( aka shutting down and restarting ) and I haven't had the problem since...only time will tell if this cured it.


ID: 635420 · Report as offensive
William Roeder
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 69
Credit: 523,414
RAC: 0
United States
Message 636410 - Posted: 8 Sep 2007, 14:19:51 UTC - in response to Message 634771.  

I just started getting a weird error on the work computer, anybody know what this means??

9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename

On windows, get a copy of unlocker
ID: 636410 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 636496 - Posted: 8 Sep 2007, 15:09:50 UTC - in response to Message 634771.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename


Thanks,
Peter
The Rocky Coast of Maine

Typically it means the the user account that BOINC is running under does not have sufficient rights to write to the BOINC directory tree. The default location is in the program files directory. All users in VISTA and restricted users in NT/2K/XP cannot write into the program files directory. In NT/2K/XP this can be fixed by changing the rights required to write into the BOINC directory. In Vista this is best fixed by installing someplace other than the Program Files directory (this works in the earlier versions as well).


BOINC WIKI
ID: 636496 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 636502 - Posted: 8 Sep 2007, 15:18:14 UTC - in response to Message 634771.  

I just started getting a weird error on the work computer, anybody know what this means??


9/6/2007 9:47:00 AM||Can't rename state file; The process cannot access the file because it is being used by another process. (0x20)

9/6/2007 9:47:00 AM||[error] Couldn't write state file: system rename


Thanks,
Peter
The Rocky Coast of Maine


Windows got a file locked but lost the pointer. Shutdown all applications and reboot. That should fix it.
ID: 636502 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 638055 - Posted: 10 Sep 2007, 13:57:15 UTC - in response to Message 634393.  

By the way, I'm finding that, given current client work demand, that running three splitters is a good amount, even though we're not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out.

- Matt

It depends on the work being split. To a rough approximation, the supply of workunits is constant, at about 3 per splitter per second. But the demand varies according to the AR, and hence the estimated crunching time, of the results being downloaded - by a factor up 4 or more.

To put some very rough figures on it, Matt's "sweet spot" of three splitters running at any one time should generate about three-quarters of a million results per day. BOINCStats shows that SETI awards around 25 million credits per day. Divide one figure by the other, and you get a rough 'balance point' around 33 credits per WU.

If the work being split averages more than that, Matt will be winning - more work will be being split than is being crunched. Cold CPUs will warm up, caches will be filling, and the 'Results ready to send' buffer will start to fill (hopefully in that order).

On the other hand, if the work being split averages less than 33 credits per WU, then the crunchers will be winning (in the short term) - first the 'Results ready to send' buffer will shrink, and then CPUs will start to go cold and host caches will be drawn down.

As we've been discussing here, the Arecibo telescope (not yet RIP) has three main recording modes:

Looking intently at a single point - AR = ~0.01 or lower
Stationary, letting the sky move overhead - AR = ~0.39
Basketweave, doing a rapid sky survey - AR = ~1.49

Other angle ranges are much rarer, and probably correspond to the telescope switching modes during a 107-second data recording period.

The three main recording modes translate into WUs yielding approximately 64, 74/54 and 19 credits respectively. (The stationary recording position is very close to a discontinuity in the credit/angle range curve).

So Matt wins if the data being split comes from an 'intense study' or 'stationary' period of observing, but loses badly if it comes from a 'basketweave' sky survey.

Matt - do you know in advance what kind of recording is on each of your "tapes"? I know they're disks really. If so, it might be an idea to choose them so that no more than one 'basketweave' recording is being split at any one time.
ID: 638055 · Report as offensive

Message boards : Technical News : Weird Day (Sep 05 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.