Movin' Along (Apr 18 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 548713 - Posted: 18 Apr 2007, 20:37:16 UTC Yesterday we started the creation of a new index in the science database on a field in the Gaussian table. When creating an index, the table gets locked, so you can't insert anything, so we disabled the assimilators. This is a step towards developing the near time persistency checker (the thing that actually hunts for ET automatically in the background as signals come in without waiting for our intervention - me might got some science done after all!). However, during the post-outage recovery yesterday and starting up the assimilators this morning we found bruno was dropping TCP connections. Eric adjusted various tcp parameters last night and again this morning to alleviate this bottleneck. That helped a bit, but it wasn't until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I'm not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier. In brighter news, db_dump seems to be working again. Cool. Today's batch is being generated as I type. Stats all around! - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 548713 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 548715 - Posted: 18 Apr 2007, 20:42:08 UTC Last modified: 18 Apr 2007, 20:43:58 UTC Well, I might be the only happy camper participant about the bottleneck. It may end letting a few more of the detached results I had on my inaccesible host get credit because everyone else got slowed big time and it got a chance to catch up more. ;-) In any event, hopefully things will simmer down a bit over in NC. :-) <edit> Oh yeah, new personal high day for me on SAH! Alinator ID: 548715 ·

Mugsy Send message Joined: 10 Jan 03 Posts: 2 Credit: 669,231 RAC: 0	Message 548750 - Posted: 18 Apr 2007, 22:32:21 UTC Why no credit for the last seven days for me? Problem on my end or does your message explain why? Marc Mugmon ID: 548750 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20267 Credit: 7,508,002 RAC: 20	Message 548756 - Posted: 18 Apr 2007, 22:48:41 UTC - in response to Message 548713. ... it wasn't until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I'm not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier. Is that possibly a knock-on effect from the Boinc clients requiring to open a series of connections to complete an upload/download transaction? For example, the first enquiry succeeds, and then a second subsequent connection request to actually make the upload/download fails due to no more free connections being available. The whole process then must start from the beginning again... Just a wild guess hypothesis for if the MaxClients was actually getting hit... Cheers, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 548756 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 548764 - Posted: 18 Apr 2007, 23:01:04 UTC - in response to Message 548750. Marc Welcome to Seti I one of the previous posts explained that the Replica Database had issues and the db_dump process ran against that database. So after the database repair, the db_dump process was refusing to run properly... Today It had decided it was done messing with Matt... So No, it was not on your end... Why no credit for the last seven days for me? Problem on my end or does your message explain why? Marc Mugmon Please consider a Donation to the Seti Project. ID: 548764 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 548777 - Posted: 18 Apr 2007, 23:43:18 UTC - in response to Message 548750. Why no credit for the last seven days for me? Problem on my end or does your message explain why? Marc Mugmon If this work unit is typical, you're the first one to return it, and you'll get credit when it is validated against at least one other result. I didn't study all of your results, and I didn't look to see what the validator backlog might be. ID: 548777 ·

Walla Volunteer tester Send message Joined: 14 May 06 Posts: 329 Credit: 177,013 RAC: 0	Message 548808 - Posted: 19 Apr 2007, 1:44:19 UTC - in response to Message 548777. Last modified: 19 Apr 2007, 1:44:58 UTC and I didn't look to see what the validator backlog might be. a really big number at the moment ID: 548808 ·

Viking Send message Joined: 2 Nov 03 Posts: 17 Credit: 1,051,900 RAC: 1	Message 548830 - Posted: 19 Apr 2007, 2:34:52 UTC And the Validator backlog keeps getting bigger, not smaller, over the last 4 hours or so... 138,728 at the moment. * Viking * ID: 548830 ·

JohnAlton Send message Joined: 28 Aug 01 Posts: 54 Credit: 164,417,653 RAC: 369	Message 548844 - Posted: 19 Apr 2007, 3:01:47 UTC - in response to Message 548830. And the Validator backlog keeps getting bigger, not smaller, over the last 4 hours or so... 138,728 at the moment. Is it validating at all? Mine are really mounting up. ID: 548844 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 548858 - Posted: 19 Apr 2007, 3:20:48 UTC I have had a some validate in the last few hours, so it must still be doing something just not very quickly. Alinator ID: 548858 ·

Haos.PL Volunteer tester Send message Joined: 18 Mar 04 Posts: 63 Credit: 3,268,546 RAC: 0	Message 548867 - Posted: 19 Apr 2007, 4:23:48 UTC This is quite a backlog. I`m a small-scale kruncher, and having three WU`s pending concurrenlty... this is a rare sight for me. ID: 548867 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67	Message 548913 - Posted: 19 Apr 2007, 6:47:11 UTC Accoding to Scarecrows graphs the awaiting validation is coming down slowly. Andy ID: 548913 ·

M4rtyn Volunteer tester Send message Joined: 4 Aug 03 Posts: 48 Credit: 799,965 RAC: 0	Message 548955 - Posted: 19 Apr 2007, 10:53:49 UTC Last modified: 19 Apr 2007, 10:54:53 UTC Strange! In the last 24 hours I've seen my pending credit plummet from around my usual 2000 to 450, lower than its been for a long time, and no real change in my work pattern either. m4rtyn ID: 548955 ·

Mugsy Send message Joined: 10 Jan 03 Posts: 2 Credit: 669,231 RAC: 0	Message 548960 - Posted: 19 Apr 2007, 11:13:13 UTC - in response to Message 548764. Marc Welcome to Seti I one of the previous posts explained that the Replica Database had issues and the db_dump process ran against that database. So after the database repair, the db_dump process was refusing to run properly... Today It had decided it was done messing with Matt... So No, it was not on your end... Why no credit for the last seven days for me? Problem on my end or does your message explain why? Marc Mugmon Thanks. Today I see over 10,000 credited, so I guess all is well. I've been on Seti for a long time, but only started in December to again become active, after adding two screamingly fast Macs to my collection. Marc ID: 548960 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 549050 - Posted: 19 Apr 2007, 16:41:18 UTC This is time for a reality check, folks. The reason SETI has separate processes for different purposes (validators, schedulers, upload servers, download servers, etc.) is so that parts of the system can be shut down with very little impact on the project as a whole. If the validator is shut down, work is queued, and when the validator comes back it catches up. If the scheduler is down, work can still be uploaded, and reported later. If the upload server is down, work can queue on the BOINC clients. This is not intended to be a real-time system. It usually is near-real-time, but when things slow down it isn't an automatic crisis. In other news: Matt explained why the validators are down. Has to do with the real-time-persistency-checker we're all interested in -- the code that will tell us if the signals we've found are in the same place over, and over, and over. ID: 549050 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 549052 - Posted: 19 Apr 2007, 16:49:39 UTC Actually I believe it was the assimilators he mentioned because mods were being made to the MSD for RTPC, but your point is well taken in any event. Alinator ID: 549052 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 549053 - Posted: 19 Apr 2007, 16:51:57 UTC - in response to Message 549050. This is time for a reality check, folks. The reason SETI has separate processes for different purposes (validators, schedulers, upload servers, download servers, etc.) is so that parts of the system can be shut down with very little impact on the project as a whole. If the validator is shut down, work is queued, and when the validator comes back it catches up. If the scheduler is down, work can still be uploaded, and reported later. If the upload server is down, work can queue on the BOINC clients. This is not intended to be a real-time system. It usually is near-real-time, but when things slow down it isn't an automatic crisis. In other news: Matt explained why the validators are down. Has to do with the real-time-persistency-checker we're all interested in -- the code that will tell us if the signals we've found are in the same place over, and over, and over. nice points thanks, I was thinking about it in terms of hardware fault tolerance, but yours says it nicer. The whole "real time persistancy checker" issue sends chills down my spine, I vote that takes precedence .... Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 549053 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 549104 - Posted: 19 Apr 2007, 18:09:10 UTC - in response to Message 549053. Last modified: 19 Apr 2007, 18:09:42 UTC nice points thanks, I was thinking about it in terms of hardware fault tolerance, but yours says it nicer. The whole "real time persistancy checker" issue sends chills down my spine, I vote that takes precedence .... Jason Jason, The design concepts behind BOINC start with the idea that fault tolerance and redundancy are expensive. We saw this with Classic: the classic screensaver downloaded one work unit, did it, and uploaded. If SETI was down, it waited. ... and if you're someone like Amazon or Google, you address this by spreading servers across many datacenters, you have multiple connections to the net, etc. At a minimum, SETI would buy two or three connections to the net, buy a bunch more servers, and bring them on campus by diverse paths. Expensive. The alternative is to build a BOINC client that tolerates outages. By doing that SETI can avoid things like hot-failover, redundant connections, and the crunching keeps going. If things are down for a few hours, it's no big deal. It means that SETI can be done on a shoestring budget. It means that we can actually have BOINC based projects run by individuals out of their own pockets. ... and that's very cool. We need to look at SETI stability and let go of the "Amazon" model. If I want to buy something, and Amazon can't take my order, someone else will. If the BOINC servers are down, the BOINC client will deal with it later. We need to think of 90% uptime as a reasonable goal. -- Ned ID: 549104 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 549200 - Posted: 19 Apr 2007, 21:01:20 UTC - in response to Message 549104. Last modified: 19 Apr 2007, 21:03:58 UTC We need to think of 90% uptime as a reasonable goal. Can't see any reason to argue with that, Once approach the boinc developers may or may not have thought of is the "minimal feedback" aproach from the system. Okay there are those of us that like to look at our credit graphs etc.. that's important and should stay IMO... but boinc logs tend to induce Panic... even though it handles most stuff (pretty much everything)well on its own. one small example was when i flicked to the messages pane, although everything was running smoothly there was a bold red line indicating a certain workunit file could not be deleted. boinc in its fault tolerant manner handled it fine and removed the file later. of course I frantically rechecked my antivirus wasn't locking boinc folder files etc... not necessary, the log made me do it :D If boinc can't upload a workunit, it tries again later. do we really need to see the verbose logs? I guess this question is partially answered by the introduction of the simple interface. do we need to ride the suspend network activity button? probably not, but its there. have option will fiddle. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 549200 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 549267 - Posted: 19 Apr 2007, 23:03:18 UTC - in response to Message 549104. [quote}We need to think of 90% uptime as a reasonable goal.[/quote] There are some projects where I could wish for 5% uptime... BOINC WIKI ID: 549267 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.