Penguin doing too much?

Message boards : Number crunching : Penguin doing too much?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 149610 - Posted: 10 Aug 2005, 7:55:59 UTC

Is it just me or does it look like there is too much reliance on penguin at the moment in the server status list.

Whilst penguin is the same spec as kosh, kosh is apparently just being used to run the 5th and 6th instances of the transitioner

Whilst penguin by comparison is currently
file_deleter1
file_deleter2
file_deleter3
file_deleter4
transitioner1
transitioner2
sah_validate1
sah_validate2
sah_validate3
sah_validate4
sah_assimilator1
sah_assimilator2
sah_assimilator3
sah_assimilator4

This means that penguin is the only validator and therefore with all this other load on the box, it is hardly surprising that we have a validator backlog of well over half a million units.

Not knowing the full network situation, it would appear to make sense to run 2 instances of the validator on kosh in an attempt to reduce the validator backlog or alternatively to move some of these other instances off penguin so it has a higher ability to work through the validator backlog.

Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 149610 · Report as offensive
Profile Harry.nl

Send message
Joined: 21 Apr 03
Posts: 53
Credit: 67,821
RAC: 0
Netherlands
Message 149614 - Posted: 10 Aug 2005, 8:20:26 UTC

They are probably rearranging the serverpower available.
ID: 149614 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 149665 - Posted: 10 Aug 2005, 12:21:42 UTC

Yesterday, Matt answered this question --


Message 149135 - Posted 9 Aug 2005 16:39:45 UTC - in response to Message ID 149132.


It looks like "Penguin" could really use some help. To have one kinda small machine working 12 major processes seems silly. "Penguin" should be a 6-CPU with lots of RAM to handle all that.


Oddly enough, penguin is probably the machine with the lightest load and least memory being used at this point. Of course, bottlenecks are abounds and adding more to this machine wouldn't really solve much.

- Matt
ID: 149665 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 149788 - Posted: 10 Aug 2005, 23:26:18 UTC

The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts.








Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 149788 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20291
Credit: 7,508,002
RAC: 20
United Kingdom
Message 149796 - Posted: 10 Aug 2005, 23:39:07 UTC - in response to Message 149788.  

The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts.

If you can junk all the unneeded files, the file system can then access the remaining files more quickly.

If penguin is lightly loaded by its processes, then you can add more processes to push the loading harder elsewhere...

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 149796 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 149798 - Posted: 10 Aug 2005, 23:43:22 UTC - in response to Message 149796.  

The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts.

If you can junk all the unneeded files, the file system can then access the remaining files more quickly.

If penguin is lightly loaded by its processes, then you can add more processes to push the loading harder elsewhere...

Regards,
Martin



Well, if it's so lightly loaded why not add a couple of validators as well?

Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 149798 · Report as offensive
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 149801 - Posted: 10 Aug 2005, 23:46:41 UTC - in response to Message 149665.  


Oddly enough, penguin is probably the machine with the lightest load and least memory being used at this point. Of course, bottlenecks are abounds and adding more to this machine wouldn't really solve much.

- Matt



Just seems strange that the only machine doing any validation that has a queue of items to get through that currently number in excess of 630,000 should be the lightest loaded machine when it is theoretically also doing the same as kosh plus all the other extra tasks.

I dont understand what type of bottleneck would allow kosh to function effectively with just 2 transitioner instances and penguin (same spec) to be operating at less load doing the same plus another 18 (with the addition of another 6 deleter instances on penguin) tasks as well.

Either it is not working very effectively as a transitioner, or the other processes are working at such a low priority that they hardly ever get any cpu cycles.


Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 149801 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20291
Credit: 7,508,002
RAC: 20
United Kingdom
Message 149803 - Posted: 10 Aug 2005, 23:58:34 UTC - in response to Message 149801.  
Last modified: 10 Aug 2005, 23:59:53 UTC

Just seems strange that the only machine doing any validation that has a queue of items to get through that currently number in excess of 630,000 should be the lightest loaded machine when it is theoretically also doing the same as kosh plus all the other extra tasks.

That all depends on how complicated the tasks are!

The real bottlenecks are likely the database hardware and the fileserver hardware and possibly the ethernet connections into them.

A wild guess is that all this is more a systems and efficiency balancing problem for the hardware that they have available.

If they have their file systems crippled with too many files per directory, then literally everything will inexorably get dragged down to a halt...

I'm sure they're working through how best to get things working better.


Its a bit like tweaking the air/fuel/EMCU mixtures on a high performance engine whilst trying to convert from petrol to diesel and trying not to miss a single stroke!

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 149803 · Report as offensive
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 149807 - Posted: 11 Aug 2005, 0:08:32 UTC - in response to Message 149803.  


A wild guess is that all this is more a systems and efficiency balancing problem for the hardware that they have available.


Quite possibly.

From a technical veiwpoint though I don't understand the 2 machines doing the same tasks and then adding another 18 tasks to one of them and it then being under less load than the one that is just doing 2 of the 20 tasks the other is.

The only way I can envisage it is if 2 instances of the transitioner are more intensive than all the other tasks added together and the 2 transitioner instances runing on penguin are throttled right back.



Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 149807 · Report as offensive
Profile htrae
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 241
Credit: 768,379
RAC: 0
Canada
Message 149812 - Posted: 11 Aug 2005, 0:19:47 UTC

Things are really getting changed around on the Server Status Page now...!!!.

Also a new process......db_purge.

http://setiathome.berkeley.edu/sah_status.html


ID: 149812 · Report as offensive
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 149819 - Posted: 11 Aug 2005, 0:30:15 UTC - in response to Message 149812.  

Things are really getting changed around on the Server Status Page now...!!!.

Also a new process......db_purge.



I see the additional 6 file_deleter instances have gone from penguin as well
back to just the 4 again instead of the 10 after the outage
Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 149819 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 149820 - Posted: 11 Aug 2005, 0:34:39 UTC

The number of results waiting for validation is up by almost 8500.



Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 149820 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 149821 - Posted: 11 Aug 2005, 0:40:56 UTC - in response to Message 149807.  

Quite possibly.

From a technical veiwpoint though I don't understand the 2 machines doing the same tasks and then adding another 18 tasks to one of them and it then being under less load than the one that is just doing 2 of the 20 tasks the other is.

The only way I can envisage it is if 2 instances of the transitioner are more intensive than all the other tasks added together and the 2 transitioner instances runing on penguin are throttled right back.


You seem to forget that Kosh also acts as splitter. ;)

As for db_purge, this process haven't been listed before, but have been around for many months. Db_purge is responsible for archiving and removing old "done" wu from the database, to keep it small. Also, it doesn't need to run continuously.



ID: 149821 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 149822 - Posted: 11 Aug 2005, 0:42:54 UTC

I do not buy the answer that penguin is lightly loaded, so validation cannot go faster, even with more instances. Because, the rest of the system is operating so flawlessly, and clearly are not having trouble with access to files or databases.

To me, it is obvious, that 1 or 2 of the splitter processes, should be turned into validators, at least while the "awaiting validation" queue is growing continuously.

Do you just go on hoping something will change, and the queue will suddenly start going down? It isn't going to happen, without changing priorities. It will just keep growing and growning. A bunch of us have already depriorities seti crunching, but the backend still can't keep up.

In this day and age of boinc, and multiple projects, it is better to throttle the WU production, whenever there is a big queue in the system elsewhere, because that is the only place you have to do any throttling.

If you end up trickling out WUs, it will be at the maximum rate your back end hardware can assimilate them, which is the rate you want them crunched at.

Don



ID: 149822 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 149880 - Posted: 11 Aug 2005, 3:00:58 UTC - in response to Message 149822.  

I'm writing a tech note about all this. It'll be up in the next 24 hours. But yep, it's true penguin is barely doing anything. Database has plenty of throughput. Our "ancient" servers aren't even close to CPU/memory bound.

After some diagnosis today we pretty much confirmed our biggest current problem: Large directories. See - not so obvious. Stay tuned...

- Matt

I do not buy the answer that penguin is lightly loaded, so validation cannot go faster, even with more instances. Because, the rest of the system is operating so flawlessly, and clearly are not having trouble with access to files or databases.

To me, it is obvious, that 1 or 2 of the splitter processes, should be turned into validators, at least while the "awaiting validation" queue is growing continuously.


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 149880 · Report as offensive
ampoliros
Volunteer tester
Avatar

Send message
Joined: 24 Sep 99
Posts: 152
Credit: 3,542,579
RAC: 5
United States
Message 149965 - Posted: 11 Aug 2005, 4:36:07 UTC - in response to Message 149880.  

After some diagnosis today we pretty much confirmed our biggest current problem: Large directories.


So penguin isn't stressed by the processes themselves? It's the fact that it has to do all those lookups inside huge directories to do it's work?

So splitting only creates files/directories and deleting transverses directories and well... deletes tagged ones.

But validation needs to (a)grab a completed record (b)transverse the entire directory structure find 2 or more (all) successfully (and unsuccessfully) completed instances of that work unit (c) compare all 3, 4 or more results (d)lookup whether the work unit needs to remian open if there are unreturned results for that unit (e)mark the unit based on what needs to be done next. Only to have to come back to the same result later because only one, two, or three of four had been returned.

All this while other proccesses are transversing the same disks (to write or delete). Is that how it works? Seems to me that you would end up with a process that spends most of it's time looking through directories if it works this way.

Oh, and is it that the directories are too large or too deep? Either way, that's not going to be an easy fix (eg add more processes). Seems to me that that would means an overhaul of the data storage structure (tree).

Feel free to ignore my questions if you are either (a)working on a solution (c)on a hot date (c)tired of listening to us armchair quarterbacks. :) I'll just wait for the Tech News.

-Thanks

7,049 S@H Classic Credits
ID: 149965 · Report as offensive
EclipseHA

Send message
Joined: 28 Jul 99
Posts: 1018
Credit: 530,719
RAC: 0
United States
Message 150000 - Posted: 11 Aug 2005, 5:22:30 UTC

A suggestion to the UCB administrators.

On the "server status" page, why not include an "uptime" for each server? That way it can be seens how busy the server is.

This would eliminate this type of thread..
ID: 150000 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 150006 - Posted: 11 Aug 2005, 5:42:01 UTC - in response to Message 149965.  


Oh, and is it that the directories are too large or too deep? Either way, that's not going to be an easy fix (eg add more processes). Seems to me that that would means an overhaul of the data storage structure (tree).

Feel free to ignore my questions if you are either (a)working on a solution (c)on a hot date (c)tired of listening to us armchair quarterbacks. :) I'll just wait for the Tech News.

-Thanks

Left out (d) is playing a gig tonight.
ID: 150006 · Report as offensive
EclipseHA

Send message
Joined: 28 Jul 99
Posts: 1018
Credit: 530,719
RAC: 0
United States
Message 150015 - Posted: 11 Aug 2005, 5:53:42 UTC - in response to Message 150006.  


Oh, and is it that the directories are too large or too deep? Either way, that's not going to be an easy fix (eg add more processes). Seems to me that that would means an overhaul of the data storage structure (tree).

Feel free to ignore my questions if you are either (a)working on a solution (c)on a hot date (c)tired of listening to us armchair quarterbacks. :) I'll just wait for the Tech News.

-Thanks

Left out (d) is playing a gig tonight.


and (e) - got a pompom in my ear and another up my nose and make calls about things I know nothing about

ID: 150015 · Report as offensive
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 150023 - Posted: 11 Aug 2005, 6:42:51 UTC
Last modified: 11 Aug 2005, 6:44:55 UTC

Ingleside Thanks for pointing that out. Somehow I had managed to miss the fact that kosh also acts as a splitter that helps me make sense of it more.

azwoody it depends what you mean by uptime. If you mean the full output of the *nix command uptime which includes load averages then yes it would stop the question I asked from being asked, however it may start the questions like... Well if its so lightly loaded and so is x , y or z then why not add an extra A instance to it. If you just mean uptime as in the last reset date of the box then it wouldnt really give any more insight into what was happening.

Matt Lebofsky - It would be interesting to see that tech note when it is written. Thanks. I have to be honest and admit that I am interested in behind the scenes geeking / gawking and how things happen from the side of things us mere mortals do not get to see. Whilst we can look at source code and see what is being done, it is sometimes difficult to get a feel for things without any idea of the underlying data base / storage capacity and scales. As a pro developer regularly dealing with millions of datasets at a time it is easy to generalise without knowing the complexity of those files and datasets.

In general I think Ingleside's note about kosh also acting as a splitter, which I did not notice when writing this thread, has gone a long way to making my original question irrelevent and it is easy to see why the load on this box is lower than something that has to continuously chug through tapes and split / assemble / arrange and duplicate results based on tape based data.

Thanks all its helped me better understand what is happening.

Sarah



Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 150023 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Penguin doing too much?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.