Message boards :
Number crunching :
Penguin doing too much?
Message board moderation
Author | Message |
---|---|
Anigel Send message Joined: 5 Dec 99 Posts: 101 Credit: 643,544 RAC: 0 |
Is it just me or does it look like there is too much reliance on penguin at the moment in the server status list. Whilst penguin is the same spec as kosh, kosh is apparently just being used to run the 5th and 6th instances of the transitioner Whilst penguin by comparison is currently file_deleter1 file_deleter2 file_deleter3 file_deleter4 transitioner1 transitioner2 sah_validate1 sah_validate2 sah_validate3 sah_validate4 sah_assimilator1 sah_assimilator2 sah_assimilator3 sah_assimilator4 This means that penguin is the only validator and therefore with all this other load on the box, it is hardly surprising that we have a validator backlog of well over half a million units. Not knowing the full network situation, it would appear to make sense to run 2 instances of the validator on kosh in an attempt to reduce the validator backlog or alternatively to move some of these other instances off penguin so it has a higher ability to work through the validator backlog. Part of Teamseti For SetiBoinc status graphs visit Teamseti status graphs |
Harry.nl Send message Joined: 21 Apr 03 Posts: 53 Credit: 67,821 RAC: 0 |
They are probably rearranging the serverpower available. |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
Yesterday, Matt answered this question -- Message 149135 - Posted 9 Aug 2005 16:39:45 UTC - in response to Message ID 149132. It looks like "Penguin" could really use some help. To have one kinda small machine working 12 major processes seems silly. "Penguin" should be a 6-CPU with lots of RAM to handle all that. Oddly enough, penguin is probably the machine with the lightest load and least memory being used at this point. Of course, bottlenecks are abounds and adding more to this machine wouldn't really solve much. - Matt |
tekwyzrd Send message Joined: 21 Nov 01 Posts: 767 Credit: 30,009 RAC: 0 |
The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts. Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws. Douglas Adams (1952 - 2001) |
ML1 Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20 |
The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts. If you can junk all the unneeded files, the file system can then access the remaining files more quickly. If penguin is lightly loaded by its processes, then you can add more processes to push the loading harder elsewhere... Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
tekwyzrd Send message Joined: 21 Nov 01 Posts: 767 Credit: 30,009 RAC: 0 |
The only obvious difference I see is the addition of six deleters on penguin. I don't see how that's going to help with the backlog of unvalidated results but they're the experts. Well, if it's so lightly loaded why not add a couple of validators as well? Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws. Douglas Adams (1952 - 2001) |
Anigel Send message Joined: 5 Dec 99 Posts: 101 Credit: 643,544 RAC: 0 |
Just seems strange that the only machine doing any validation that has a queue of items to get through that currently number in excess of 630,000 should be the lightest loaded machine when it is theoretically also doing the same as kosh plus all the other extra tasks. I dont understand what type of bottleneck would allow kosh to function effectively with just 2 transitioner instances and penguin (same spec) to be operating at less load doing the same plus another 18 (with the addition of another 6 deleter instances on penguin) tasks as well. Either it is not working very effectively as a transitioner, or the other processes are working at such a low priority that they hardly ever get any cpu cycles. Part of Teamseti For SetiBoinc status graphs visit Teamseti status graphs |
ML1 Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20 |
Just seems strange that the only machine doing any validation that has a queue of items to get through that currently number in excess of 630,000 should be the lightest loaded machine when it is theoretically also doing the same as kosh plus all the other extra tasks. That all depends on how complicated the tasks are! The real bottlenecks are likely the database hardware and the fileserver hardware and possibly the ethernet connections into them. A wild guess is that all this is more a systems and efficiency balancing problem for the hardware that they have available. If they have their file systems crippled with too many files per directory, then literally everything will inexorably get dragged down to a halt... I'm sure they're working through how best to get things working better. Its a bit like tweaking the air/fuel/EMCU mixtures on a high performance engine whilst trying to convert from petrol to diesel and trying not to miss a single stroke! Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Anigel Send message Joined: 5 Dec 99 Posts: 101 Credit: 643,544 RAC: 0 |
Quite possibly. From a technical veiwpoint though I don't understand the 2 machines doing the same tasks and then adding another 18 tasks to one of them and it then being under less load than the one that is just doing 2 of the 20 tasks the other is. The only way I can envisage it is if 2 instances of the transitioner are more intensive than all the other tasks added together and the 2 transitioner instances runing on penguin are throttled right back. Part of Teamseti For SetiBoinc status graphs visit Teamseti status graphs |
htrae Send message Joined: 3 Apr 99 Posts: 241 Credit: 768,379 RAC: 0 |
Things are really getting changed around on the Server Status Page now...!!!. Also a new process......db_purge. http://setiathome.berkeley.edu/sah_status.html |
Anigel Send message Joined: 5 Dec 99 Posts: 101 Credit: 643,544 RAC: 0 |
Things are really getting changed around on the Server Status Page now...!!!. I see the additional 6 file_deleter instances have gone from penguin as well back to just the 4 again instead of the 10 after the outage Part of Teamseti For SetiBoinc status graphs visit Teamseti status graphs |
tekwyzrd Send message Joined: 21 Nov 01 Posts: 767 Credit: 30,009 RAC: 0 |
The number of results waiting for validation is up by almost 8500. Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws. Douglas Adams (1952 - 2001) |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Quite possibly. You seem to forget that Kosh also acts as splitter. ;) As for db_purge, this process haven't been listed before, but have been around for many months. Db_purge is responsible for archiving and removing old "done" wu from the database, to keep it small. Also, it doesn't need to run continuously. |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 |
I do not buy the answer that penguin is lightly loaded, so validation cannot go faster, even with more instances. Because, the rest of the system is operating so flawlessly, and clearly are not having trouble with access to files or databases. To me, it is obvious, that 1 or 2 of the splitter processes, should be turned into validators, at least while the "awaiting validation" queue is growing continuously. Do you just go on hoping something will change, and the queue will suddenly start going down? It isn't going to happen, without changing priorities. It will just keep growing and growning. A bunch of us have already depriorities seti crunching, but the backend still can't keep up. In this day and age of boinc, and multiple projects, it is better to throttle the WU production, whenever there is a big queue in the system elsewhere, because that is the only place you have to do any throttling. If you end up trickling out WUs, it will be at the maximum rate your back end hardware can assimilate them, which is the rate you want them crunched at. Don |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
I'm writing a tech note about all this. It'll be up in the next 24 hours. But yep, it's true penguin is barely doing anything. Database has plenty of throughput. Our "ancient" servers aren't even close to CPU/memory bound. After some diagnosis today we pretty much confirmed our biggest current problem: Large directories. See - not so obvious. Stay tuned... - Matt I do not buy the answer that penguin is lightly loaded, so validation cannot go faster, even with more instances. Because, the rest of the system is operating so flawlessly, and clearly are not having trouble with access to files or databases. -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
ampoliros Send message Joined: 24 Sep 99 Posts: 152 Credit: 3,542,579 RAC: 5 |
After some diagnosis today we pretty much confirmed our biggest current problem: Large directories. So penguin isn't stressed by the processes themselves? It's the fact that it has to do all those lookups inside huge directories to do it's work? So splitting only creates files/directories and deleting transverses directories and well... deletes tagged ones. But validation needs to (a)grab a completed record (b)transverse the entire directory structure find 2 or more (all) successfully (and unsuccessfully) completed instances of that work unit (c) compare all 3, 4 or more results (d)lookup whether the work unit needs to remian open if there are unreturned results for that unit (e)mark the unit based on what needs to be done next. Only to have to come back to the same result later because only one, two, or three of four had been returned. All this while other proccesses are transversing the same disks (to write or delete). Is that how it works? Seems to me that you would end up with a process that spends most of it's time looking through directories if it works this way. Oh, and is it that the directories are too large or too deep? Either way, that's not going to be an easy fix (eg add more processes). Seems to me that that would means an overhaul of the data storage structure (tree). Feel free to ignore my questions if you are either (a)working on a solution (c)on a hot date (c)tired of listening to us armchair quarterbacks. :) I'll just wait for the Tech News. -Thanks 7,049 S@H Classic Credits |
EclipseHA Send message Joined: 28 Jul 99 Posts: 1018 Credit: 530,719 RAC: 0 |
A suggestion to the UCB administrators. On the "server status" page, why not include an "uptime" for each server? That way it can be seens how busy the server is. This would eliminate this type of thread.. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Left out (d) is playing a gig tonight. |
EclipseHA Send message Joined: 28 Jul 99 Posts: 1018 Credit: 530,719 RAC: 0 |
and (e) - got a pompom in my ear and another up my nose and make calls about things I know nothing about |
Anigel Send message Joined: 5 Dec 99 Posts: 101 Credit: 643,544 RAC: 0 |
Ingleside Thanks for pointing that out. Somehow I had managed to miss the fact that kosh also acts as a splitter that helps me make sense of it more. azwoody it depends what you mean by uptime. If you mean the full output of the *nix command uptime which includes load averages then yes it would stop the question I asked from being asked, however it may start the questions like... Well if its so lightly loaded and so is x , y or z then why not add an extra A instance to it. If you just mean uptime as in the last reset date of the box then it wouldnt really give any more insight into what was happening. Matt Lebofsky - It would be interesting to see that tech note when it is written. Thanks. I have to be honest and admit that I am interested in behind the scenes geeking / gawking and how things happen from the side of things us mere mortals do not get to see. Whilst we can look at source code and see what is being done, it is sometimes difficult to get a feel for things without any idea of the underlying data base / storage capacity and scales. As a pro developer regularly dealing with millions of datasets at a time it is easy to generalise without knowing the complexity of those files and datasets. In general I think Ingleside's note about kosh also acting as a splitter, which I did not notice when writing this thread, has gone a long way to making my original question irrelevent and it is easy to see why the load on this box is lower than something that has to continuously chug through tapes and split / assemble / arrange and duplicate results based on tape based data. Thanks all its helped me better understand what is happening. Sarah Part of Teamseti For SetiBoinc status graphs visit Teamseti status graphs |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.