Splitter off? WHY?

Message boards : Number crunching : Splitter off? WHY?
Message board moderation

To post messages, you must log in.

AuthorMessage
Josh Linscott

Send message
Joined: 10 Sep 03
Posts: 121
Credit: 32,250
RAC: 0
United States
Message 89969 - Posted: 23 Mar 2005, 21:10:43 UTC

Server status
Program Host Status
data-driven web pages klaatu Running
scheduler kryten Running
feeder kryten Running
file_deleter koloth Running
transitioner1 klaatu Running
transitioner2 klaatu Running
transitioner3 koloth Running
transitioner4 koloth Running
sah_validate1 koloth Running
sah_validate2 koloth Running
sah_validate3 kosh Running
sah_validate4 kosh Running
sah_assimilator galileo Running
sah_splitter galileo Running
sah_splitter2 milkyway Not running
sah_splitter3 philmor Running

<img src="http://boinc.mundayweb.com/one/stats.php?userID=1052&amp;prj=1&amp;trans=off">
<img src="http://seti.mundayweb.com/stats.php?userID=749&amp;trans=off">
ID: 89969 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 89978 - Posted: 23 Mar 2005, 21:21:33 UTC - in response to Message 89969.  

From the very same page you saw that told you the splitters were off:

Sah_splitter: Reads tapes (or tape images on disk) containing raw telescope data and creates workunits for the BOINC/SETI@home clients. At least one needs to be running to produce work, and that's usually enough. If some are not running, it's probably because we are doing fine with less, and are keeping excess production down to prevent the upload/download disks from getting hit too hard.

> Server status
> Program Host Status
> data-driven web pages klaatu Running
> scheduler kryten Running
> feeder kryten Running
> file_deleter koloth Running
> transitioner1 klaatu Running
> transitioner2 klaatu Running
> transitioner3 koloth Running
> transitioner4 koloth Running
> sah_validate1 koloth Running
> sah_validate2 koloth Running
> sah_validate3 kosh Running
> sah_validate4 kosh Running
> sah_assimilator galileo Running
> sah_splitter galileo Running
> sah_splitter2 milkyway Not running
> sah_splitter3 philmor Running
>
>
ID: 89978 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 90016 - Posted: 23 Mar 2005, 22:42:15 UTC - in response to Message 89969.  
Last modified: 23 Mar 2005, 22:47:00 UTC

> Server status
> Program Host Status
> data-driven web pages klaatu Running
> sah_splitter2 milkyway Not running
> sah_splitter3 philmor Running
>
>

From the same page....

The splitters read tapes of aproximately 33000 blocks. milkyway is at block 33070 (or so).... It's reached the end of the tape and waiting for another....

(Educated guess :-) )

/edit
Oops wasn't that :( [Re-education required]- it's started running again and moved on to
13ja05ab milkyway 33121

Must have hung/stopped somehow and it's been kicked-off again.

It will still reach end of tape soonish....

ID: 90016 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 90021 - Posted: 23 Mar 2005, 22:50:58 UTC - in response to Message 90016.  

> > Server status
> > Program Host Status
> > data-driven web pages klaatu Running
> > sah_splitter2 milkyway Not running
> > sah_splitter3 philmor Running
> >
> >
>
> From the same page....
>
> The splitters read tapes of aproximately 33000 blocks. milkyway is at block
> 33070 (or so).... It's reached the end of the tape and waiting for
> another....
>
> (Educated guess :-) )

Actually, the same page says that they don't run all the splitters all the time, they turn them off and on to stay a little bit ahead of the demand for work.
ID: 90021 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90024 - Posted: 23 Mar 2005, 22:54:36 UTC
Last modified: 23 Mar 2005, 22:54:51 UTC

/tmp filled up on milkyway. Had to stop splitter to clear it up. Then restarted it. Really no big deal.

FYI, we don't need to "switch" tapes. The tape images are already read onto disk in advance and the splitter finishes one file and quickly starts chewing on the next unread file.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90024 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 90029 - Posted: 23 Mar 2005, 23:02:40 UTC - in response to Message 90024.  

> /tmp filled up on milkyway. Had to stop splitter to clear it up. Then
> restarted it. Really no big deal.
>
> FYI, we don't need to "switch" tapes. The tape images are already read onto
> disk in advance and the splitter finishes one file and quickly starts chewing
> on the next unread file.
>
> - Matt
>

Learn something new every day, Thanks.

BTW, on the classic page it shows about 9 tapes currently being split. Does this mean that classic has 9 splitter processes, or is that therefore a list of the disk files being/waiting to be processed?

Put another way, when classic shuts down, will BOINC need/have more splitters?
[Assumes some/most of the classic hardware is ear-marked for transfer to move to BOINC?]

ID: 90029 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90039 - Posted: 23 Mar 2005, 23:14:32 UTC

The 9 tapes on the classic status page refer to the disk images on line that are ready to be split (or have been split, they just haven't been deleted yet).

The machines used to split workunits for classic are the same as BOINC - so they are competing for CPU cycles (and internal bandwidth for that matter). So, in essence, we will gain more splitter power when classic gets shut off. But.. we will want to add more CPUs to the fray soon enough.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90039 · Report as offensive
AdRoc

Send message
Joined: 3 Apr 99
Posts: 5
Credit: 10,443
RAC: 0
United States
Message 90043 - Posted: 23 Mar 2005, 23:27:51 UTC

Why aren't the tapes split in order. For example, I see thatthere are tapes from as far back as Dec 19th and then seem to jump to various non-sequential dates.

19 December 2004 (ab)
25 December 2004 (aa)
27 December 2004 (aa)
27 December 2004 (ab)
11 January 2005 (ab)
12 January 2005 (aa)
13 January 2005 (ab)
16 January 2005 (aa)
18 January 2005 (aa)


Why not split like below?

19 December 2004 (aa)
20 December 2004 (aa)
21 December 2004 (aa)
22 December 2004 (aa)
.
.
.


Just curious.


ID: 90043 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90050 - Posted: 23 Mar 2005, 23:49:16 UTC - in response to Message 90043.  

> Why aren't the tapes split in order

There's is a little rhyme to this reason. I have boxes of tapes (about half a year's worth of backlogged data) next to my desk. If the splitter stacker needs to be filled again, I grab 9 tapes at semi-random, do some preliminary data checking, and then throw them in the stacker.

BUT! If I notice I have a bunch of tapes all recorded on sequential days, I try to mix it up a bit. Why? Well.. if one of the tapes has a bunch of RFI on it, there's a good chance the tape on the next day has some RFI as well. RFI (as in radio frequency interference) tends to make workunits full of junk, so they only get crunched on a few minutes before the client throws them back at us. This causes an increase in our server activity. Not much, but still an increase. As well, it makes our users upset.

So, by mixing up the tape chronologically, this vastly reduces the chance that we'll be making RFI workunits for days on end - instead we'll send out healthy workunits between the unhealthy ones so we won't be DOS'ing ourselves forever and users won't be stuck chewing on several bad workunits in a row.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90050 · Report as offensive
Dave Mickey

Send message
Joined: 19 Oct 99
Posts: 178
Credit: 11,122,965
RAC: 0
United States
Message 90063 - Posted: 24 Mar 2005, 0:21:31 UTC

Ahhhh, very good!

I take 2 good points from this:

1) That the sah "machine" that we might tend to imagine is in
fact composed of such things as Matt grabbing a bunch of
tapes from a box by his desk, pretty much at random

2) There is, as yet, value in having humans use their brains
in deciding what the computers should do...... getting too
many RFI-laced units in a row is indeed a bummer after very
carefully adjusting cache levels.

Carry (and crunch) on!

Dave
ID: 90063 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90087 - Posted: 24 Mar 2005, 1:16:22 UTC

Funny thing..

Turns out the splitters on milkyway and philmor haven't been working since we came up from the power outage yesterday. So we've actually been getting by working with just one splitter (on galileo).

So.. we're going to have to resplit some tapes. No big deal - I still have yet to figure out what the fallout will be, but the worst is that users will be getting spurious "file not found" type errors when trying to download work until this clears up, or users will get workunits with no work in them and return them right away. Either case amounts to more server load for a while, but that's about it. Shouldn't mess with credit or science. At any rate, *sigh*.

I'm posting this here just so people hopefully won't freak out if they see tapes being resplit on the server pages. Maybe I'll cough up a tech news item once I have a better idea what's going to happen.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90087 · Report as offensive
Josh Linscott

Send message
Joined: 10 Sep 03
Posts: 121
Credit: 32,250
RAC: 0
United States
Message 90116 - Posted: 24 Mar 2005, 2:06:23 UTC

Ok, thank you for the info!
<img src="http://boinc.mundayweb.com/one/stats.php?userID=1052&amp;prj=1&amp;trans=off">
<img src="http://seti.mundayweb.com/stats.php?userID=749&amp;trans=off">
ID: 90116 · Report as offensive
Alan M. MacRobert

Send message
Joined: 10 Apr 99
Posts: 13
Credit: 4,385,900
RAC: 1
United States
Message 90373 - Posted: 24 Mar 2005, 14:45:52 UTC - in response to Message 90050.  


> I have boxes of tapes (about half a
> year's worth of backlogged data)...

Matt, thanks for the explanation. A question: If there's half a year of backlogged unsplit data, while S@H users are recrunching the same work units more times than is useful, why not split the tapes, clear out the backlog, and put more of us crunchers to good use? I've wondered about this for years. Are the splitters a bottleneck? Dave Anderson once said no -- but then what is?

Thanks again,

Alan MacRobert

ID: 90373 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90412 - Posted: 24 Mar 2005, 16:22:35 UTC - in response to Message 90373.  
Last modified: 24 Mar 2005, 17:22:16 UTC

> Matt, thanks for the explanation. A question: If there's half a year of
> backlogged unsplit data, while S@H users are recrunching the same work units
> more times than is useful, why not split the tapes, clear out the backlog, and
> put more of us crunchers to good use? I've wondered about this for years. Are
> the splitters a bottleneck? Dave Anderson once said no -- but then what is?

Good question. Like all things around here there are a tangled web of dependencies that are usually too difficult to fully explain. But here's the gist of it.

First and foremost, the data recorder down at Arecibo is working 24/7. In case you haven't noticed, our data servers are not. Every minute our servers are off and people can't get data, that's one minute we fall behind on splitting. Over the years we've been off for many, many days (last weekend for 3, that time those people stole copper wire down the hill and broke our internet connection for a week, not to mention weekly 30-60 outages for maintenance - it all adds up).

Of course, there are many times we need to do database repair. These outages, to fix broken indexes or corrupt tables or simply add rows to tables, etc., result in a "behind the scenes" outage. All the data servers are up serving data, but we are unable to split data - so we fall behind. Yes, this means users get overly redundant data at this time - we decided more people would rather have data to chew on than another full outage which typically results in message board posts of the sort "Berkeley is down again - those idiots have no idea what they are doing, blah blah blah".

There are many other pieces to the puzzle. For instance, there's a "garbage collector" in SETI classic which fails from time to time (doesn't restart after a power outage, for example) which causes a backlog in the workunit queue and therefore the splitter idle until the queue clears.

Even if things are going well, demand in SETI@home classic is so high that we can't split fast enough, which means the redundancy level is more than scientifically necessary, but moreso means that our
servers in general are hit with a greater load - more results coming in that need to be validated and shoveled into the database - and if the database is busy the splitters might run slower, etc. This all
gets fixed in BOINC - we only send out as much work we need to, and beyond that we encourage users to run other projects to keep their CPUs busy.

DESPITE ALL THIS there have been many times when every system was working properly during the course of SETI@home that we caught up on splitting. As well, there are times when Arecibo is not recording as well, or the data is too littered with RFI we don't split it. So it's all a game of give and take. An important fact to remember (or learn) is that we started recording SETI@home data in December of 1998, and the project started in May 1999, so we had a six month backlog at the onset.

Okay.. I better finish my breakfast and get up to the lab.

- Matt


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90412 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 90456 - Posted: 24 Mar 2005, 18:04:19 UTC

Hi

Thanks for this information Matt.

greetz from Germany Mike



With each crime and every kindness we birth our future.
ID: 90456 · Report as offensive
Profile D.J. Schweitz
Volunteer tester
Avatar

Send message
Joined: 29 Oct 02
Posts: 157
Credit: 871,078
RAC: 0
United States
Message 90478 - Posted: 24 Mar 2005, 19:15:30 UTC

Matt, a question for you from someone on my team. Figured since you were already doing such a great job of answering other WU questions I would jump in.

Member has 130 classic WU's left to crunch, and wonders of the "ethics" of dumping them, to un-install classic and install BOINC. Ive heard that some classic WU's get sent out as many as 40 times, where with BOINC its more like 5 or 6.

Will his dumping classic WU's compromise the science of those WU's?

Are those same classic WU's sent out as BOINC WU's?

Did you enjoy your breakfast?


Click below for our Team Website
ID: 90478 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 90488 - Posted: 24 Mar 2005, 19:28:40 UTC - in response to Message 90478.  

> Will his dumping classic WU's compromise the science of those WU's?
> Are those same classic WU's sent out as BOINC WU's?

At this point in time (and for a while) the workunits sent to classic and BOINC have been the same (good for validation that classic and BOINC return the same scientific results). So there is no harm in science for dumping classic workunits. Hope that encourages your team member(s) to move to BOINC sooner.

> Did you enjoy your breakfast?

It was adequate. Will have chinese food for lunch, and probably something light for dinner so I'm not burping into the microphone at my gig tonight.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 90488 · Report as offensive
Profile D.J. Schweitz
Volunteer tester
Avatar

Send message
Joined: 29 Oct 02
Posts: 157
Credit: 871,078
RAC: 0
United States
Message 90498 - Posted: 24 Mar 2005, 19:48:27 UTC

>> Will his dumping classic WU's compromise the science of those WU's?
> Are those same classic WU's sent out as BOINC WU's?

At this point in time (and for a while) the workunits sent to classic and BOINC have been the same (good for validation that classic and BOINC return the same scientific results). So there is no harm in science for dumping classic workunits. Hope that encourages your team member(s) to move to BOINC sooner.

> Did you enjoy your breakfast?

It was adequate. Will have chinese food for lunch, and probably something light for dinner so I'm not burping into the microphone at my gig tonight.

Thanks for the quick reply Matt, am headed out for sushi myself :-P
Click below for our Team Website
ID: 90498 · Report as offensive

Message boards : Number crunching : Splitter off? WHY?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.