Movin' Along (Apr 18 2007)

Message boards : Technical News : Movin' Along (Apr 18 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 548713 - Posted: 18 Apr 2007, 20:37:16 UTC

Yesterday we started the creation of a new index in the science database on a field in the Gaussian table. When creating an index, the table gets locked, so you can't insert anything, so we disabled the assimilators. This is a step towards developing the near time persistency checker (the thing that actually hunts for ET automatically in the background as signals come in without waiting for our intervention - me might got some science done after all!).

However, during the post-outage recovery yesterday and starting up the assimilators this morning we found bruno was dropping TCP connections. Eric adjusted various tcp parameters last night and again this morning to alleviate this bottleneck. That helped a bit, but it wasn't until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I'm not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier.

In brighter news, db_dump seems to be working again. Cool. Today's batch is being generated as I type. Stats all around!

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 548713 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 548715 - Posted: 18 Apr 2007, 20:42:08 UTC
Last modified: 18 Apr 2007, 20:43:58 UTC

Well, I might be the only happy camper participant about the bottleneck. It may end letting a few more of the detached results I had on my inaccesible host get credit because everyone else got slowed big time and it got a chance to catch up more. ;-)

In any event, hopefully things will simmer down a bit over in NC. :-)

<edit> Oh yeah, new personal high day for me on SAH!

Alinator
ID: 548715 · Report as offensive
Profile Mugsy

Send message
Joined: 10 Jan 03
Posts: 2
Credit: 669,231
RAC: 0
United States
Message 548750 - Posted: 18 Apr 2007, 22:32:21 UTC

Why no credit for the last seven days for me? Problem on my end or does your message explain why?

Marc Mugmon
ID: 548750 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 548756 - Posted: 18 Apr 2007, 22:48:41 UTC - in response to Message 548713.  

... it wasn't until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I'm not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier.

Is that possibly a knock-on effect from the Boinc clients requiring to open a series of connections to complete an upload/download transaction?

For example, the first enquiry succeeds, and then a second subsequent connection request to actually make the upload/download fails due to no more free connections being available. The whole process then must start from the beginning again...

Just a wild guess hypothesis for if the MaxClients was actually getting hit...

Cheers,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 548756 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 548764 - Posted: 18 Apr 2007, 23:01:04 UTC - in response to Message 548750.  

Marc

Welcome to Seti

I one of the previous posts explained that the Replica Database had issues and the db_dump process ran against that database. So after the database repair, the db_dump process was refusing to run properly... Today It had decided it was done messing with Matt...

So No, it was not on your end...

Why no credit for the last seven days for me? Problem on my end or does your message explain why?

Marc Mugmon


Please consider a Donation to the Seti Project.

ID: 548764 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 548777 - Posted: 18 Apr 2007, 23:43:18 UTC - in response to Message 548750.  

Why no credit for the last seven days for me? Problem on my end or does your message explain why?

Marc Mugmon

If this work unit is typical, you're the first one to return it, and you'll get credit when it is validated against at least one other result.

I didn't study all of your results, and I didn't look to see what the validator backlog might be.
ID: 548777 · Report as offensive
Profile Walla
Volunteer tester
Avatar

Send message
Joined: 14 May 06
Posts: 329
Credit: 177,013
RAC: 0
United States
Message 548808 - Posted: 19 Apr 2007, 1:44:19 UTC - in response to Message 548777.  
Last modified: 19 Apr 2007, 1:44:58 UTC

and I didn't look to see what the validator backlog might be.


a really big number at the moment
ID: 548808 · Report as offensive
Profile *Viking*
Avatar

Send message
Joined: 2 Nov 03
Posts: 17
Credit: 1,051,900
RAC: 1
Canada
Message 548830 - Posted: 19 Apr 2007, 2:34:52 UTC

And the Validator backlog keeps getting bigger, not smaller, over the last 4 hours or so... 138,728 at the moment.
* Viking *
ID: 548830 · Report as offensive
JohnAlton
Avatar

Send message
Joined: 28 Aug 01
Posts: 54
Credit: 164,417,653
RAC: 369
United States
Message 548844 - Posted: 19 Apr 2007, 3:01:47 UTC - in response to Message 548830.  

And the Validator backlog keeps getting bigger, not smaller, over the last 4 hours or so... 138,728 at the moment.


Is it validating at all? Mine are really mounting up.

ID: 548844 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 548858 - Posted: 19 Apr 2007, 3:20:48 UTC

I have had a some validate in the last few hours, so it must still be doing something just not very quickly.

Alinator
ID: 548858 · Report as offensive
Haos.PL
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 63
Credit: 3,268,546
RAC: 0
Poland
Message 548867 - Posted: 19 Apr 2007, 4:23:48 UTC

This is quite a backlog. I`m a small-scale kruncher, and having three WU`s pending concurrenlty... this is a rare sight for me.
ID: 548867 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 548913 - Posted: 19 Apr 2007, 6:47:11 UTC

Accoding to Scarecrows graphs the awaiting validation is coming down slowly.



Andy
ID: 548913 · Report as offensive
Profile M4rtyn
Volunteer tester
Avatar

Send message
Joined: 4 Aug 03
Posts: 48
Credit: 799,965
RAC: 0
United Kingdom
Message 548955 - Posted: 19 Apr 2007, 10:53:49 UTC
Last modified: 19 Apr 2007, 10:54:53 UTC

Strange! In the last 24 hours I've seen my pending credit plummet from around my usual 2000 to 450, lower than its been for a long time, and no real change in my work pattern either.

m4rtyn
ID: 548955 · Report as offensive
Profile Mugsy

Send message
Joined: 10 Jan 03
Posts: 2
Credit: 669,231
RAC: 0
United States
Message 548960 - Posted: 19 Apr 2007, 11:13:13 UTC - in response to Message 548764.  

Marc

Welcome to Seti

I one of the previous posts explained that the Replica Database had issues and the db_dump process ran against that database. So after the database repair, the db_dump process was refusing to run properly... Today It had decided it was done messing with Matt...

So No, it was not on your end...

Why no credit for the last seven days for me? Problem on my end or does your message explain why?

Marc Mugmon




Thanks. Today I see over 10,000 credited, so I guess all is well.

I've been on Seti for a long time, but only started in December to again become active, after adding two screamingly fast Macs to my collection.

Marc
ID: 548960 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 549050 - Posted: 19 Apr 2007, 16:41:18 UTC

This is time for a reality check, folks.

The reason SETI has separate processes for different purposes (validators, schedulers, upload servers, download servers, etc.) is so that parts of the system can be shut down with very little impact on the project as a whole.

If the validator is shut down, work is queued, and when the validator comes back it catches up.

If the scheduler is down, work can still be uploaded, and reported later.

If the upload server is down, work can queue on the BOINC clients.

This is not intended to be a real-time system. It usually is near-real-time, but when things slow down it isn't an automatic crisis.

In other news: Matt explained why the validators are down. Has to do with the real-time-persistency-checker we're all interested in -- the code that will tell us if the signals we've found are in the same place over, and over, and over.
ID: 549050 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 549052 - Posted: 19 Apr 2007, 16:49:39 UTC

Actually I believe it was the assimilators he mentioned because mods were being made to the MSD for RTPC, but your point is well taken in any event.

Alinator
ID: 549052 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 549053 - Posted: 19 Apr 2007, 16:51:57 UTC - in response to Message 549050.  

This is time for a reality check, folks.

The reason SETI has separate processes for different purposes (validators, schedulers, upload servers, download servers, etc.) is so that parts of the system can be shut down with very little impact on the project as a whole.

If the validator is shut down, work is queued, and when the validator comes back it catches up.

If the scheduler is down, work can still be uploaded, and reported later.

If the upload server is down, work can queue on the BOINC clients.

This is not intended to be a real-time system. It usually is near-real-time, but when things slow down it isn't an automatic crisis.

In other news: Matt explained why the validators are down. Has to do with the real-time-persistency-checker we're all interested in -- the code that will tell us if the signals we've found are in the same place over, and over, and over.


nice points thanks, I was thinking about it in terms of hardware fault tolerance, but yours says it nicer. The whole "real time persistancy checker" issue sends chills down my spine, I vote that takes precedence ....

Jason

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 549053 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 549104 - Posted: 19 Apr 2007, 18:09:10 UTC - in response to Message 549053.  
Last modified: 19 Apr 2007, 18:09:42 UTC



nice points thanks, I was thinking about it in terms of hardware fault tolerance, but yours says it nicer. The whole "real time persistancy checker" issue sends chills down my spine, I vote that takes precedence ....

Jason

Jason,

The design concepts behind BOINC start with the idea that fault tolerance and redundancy are expensive.

We saw this with Classic: the classic screensaver downloaded one work unit, did it, and uploaded. If SETI was down, it waited.

... and if you're someone like Amazon or Google, you address this by spreading servers across many datacenters, you have multiple connections to the net, etc.

At a minimum, SETI would buy two or three connections to the net, buy a bunch more servers, and bring them on campus by diverse paths. Expensive.

The alternative is to build a BOINC client that tolerates outages. By doing that SETI can avoid things like hot-failover, redundant connections, and the crunching keeps going. If things are down for a few hours, it's no big deal.

It means that SETI can be done on a shoestring budget. It means that we can actually have BOINC based projects run by individuals out of their own pockets.

... and that's very cool.

We need to look at SETI stability and let go of the "Amazon" model. If I want to buy something, and Amazon can't take my order, someone else will. If the BOINC servers are down, the BOINC client will deal with it later.

We need to think of 90% uptime as a reasonable goal.

-- Ned
ID: 549104 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 549200 - Posted: 19 Apr 2007, 21:01:20 UTC - in response to Message 549104.  
Last modified: 19 Apr 2007, 21:03:58 UTC

We need to think of 90% uptime as a reasonable goal.


Can't see any reason to argue with that, Once approach the boinc developers may or may not have thought of is the "minimal feedback" aproach from the system. Okay there are those of us that like to look at our credit graphs etc.. that's important and should stay IMO... but boinc logs tend to induce Panic... even though it handles most stuff (pretty much everything)well on its own.

one small example was when i flicked to the messages pane, although everything was running smoothly there was a bold red line indicating a certain workunit file could not be deleted. boinc in its fault tolerant manner handled it fine and removed the file later. of course I frantically rechecked my antivirus wasn't locking boinc folder files etc... not necessary, the log made me do it :D

If boinc can't upload a workunit, it tries again later. do we really need to see the verbose logs? I guess this question is partially answered by the introduction of the simple interface. do we need to ride the suspend network activity button? probably not, but its there. have option will fiddle.

Jason

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 549200 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 549267 - Posted: 19 Apr 2007, 23:03:18 UTC - in response to Message 549104.  

[quote}We need to think of 90% uptime as a reasonable goal.[/quote]
There are some projects where I could wish for 5% uptime...


BOINC WIKI
ID: 549267 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Movin' Along (Apr 18 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.