Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 16 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Nobody ask my question: How do we raise the request to evaluate the squeeze of the deadlines to the Seti powers? . . Well someone could pm Eric. Not sure who would be the one, Kittyman and MrKevvy have had success contacting him but I do not know who else. Maybe Richard would be a good candidate. Stephen ? ? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I nominate Mr. Kevvy. I find it's best to go to him with one issue at a time, and I've got one in the pipeline already (update Beta server to check that the Christmas 'Anonymous Platform' bug is properly fixed. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I nominate Mr. Kevvy. I find it's best to go to him with one issue at a time, and I've got one in the pipeline already (update Beta server to check that the Christmas 'Anonymous Platform' bug is properly fixed. Just send him a PM asking for his help. Let's see what we get. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I nominate Mr. Kevvy. I find it's best to go to him with one issue at a time, and I've got one in the pipeline already (update Beta server to check that the Christmas 'Anonymous Platform' bug is properly fixed. . . A good one. When that bug is put to rest they can roll out 7.15 properly in main and maybe that will exorcise some of the gremlins that are annoying us. . . I will even move my Beta testing machine back into Beta to help ... :) {even though there is NO Parkes data yet - I have to get that in :) } Stephen < fingers crossed> |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I nominate Mr. Kevvy. I find it's best to go to him with one issue at a time, and I've got one in the pipeline already (update Beta server to check that the Christmas 'Anonymous Platform' bug is properly fixed. . . Yes but who? We don't want to flood him with 50 messages asking the same thing ... . . Dare I nominate Juan ??? :) Stephen :( |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I nominate Mr. Kevvy. I find it's best to go to him with one issue at a time, and I've got one in the pipeline already (update Beta server to check that the Christmas 'Anonymous Platform' bug is properly fixed. Sure he not going to read all the thread I just ask him to rise to the power the 3 main options we discuss: - halve the actual value - fix to 25 days (like AP) - up to 28 or 30 days All while keep the shorties at 10 days. Did i forget something? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
Results returned and awaiting validation 0 39,153 13,810,892New database server (and ideally a matching replica). More cores, faster clock speeds, and more than double the RAM. That would then leave the present database server hardware available to replace other less powerful systems (the Scheduler, download & upload servers get my vote for upgrading with the displaced hardware). A couple of All Flash Arrays for storage would make an even bigger difference, but they'd cost a whole lot more than a couple of new database server systems. Grant Darwin NT |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
What is the process for Results to move from "Results returned and awaiting validation" to the next process? Is it matching returned Results? or Compute intensive validation? or Complex DB lookups? Does the top to bottom order on the Server Status page reflect the order of processing? Is this process documented where I can study it? EDIT: Does anyone have hard data on resource utilization of the servers that we can see? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
What is the process for Results to move from "Results returned and awaiting validation" to the next process?The general work flow- A data file is loaded, WUs are split off from it, and then go in to the Ready to send buffer (unless it's empty, then they go straight out to hosts as they request work). Results out in the field: is the number of WUs that have been sent out, and are waiting for a result to be returned. Results returned and awaiting validation: They are the results that have been returned by crunchers, but their constituent WU has yet to reach quorum (with the need to re-check any overflow result in case it's because of a dodgy driver/ RX5000 combination, it meas that there are way, way more resends (_2. _3. _4 WUs) than usual which is what has blown this figure out recently). Workunits waiting for validation: These are WUs that have reach quorum, and are waiting to be validated. Workunits waiting for assimilation: These are the WUs that have been validated, and are waiting to have data from their canonical task input into the master science database. Workunit/Files waiting for deletion: The number of files which can be deleted from disk, as the workunit has been assimilated, and there is no more use for it or its constituent results. Workunits/Results waiting for db purging: The number of workunits or results which have been deleted from disk and after 24 hours will be purged from the database. It is during this period that completed results can still be viewed in your personal account pages. Is it matching returned Results?There is a lot of disk reading & writing activity involved with pretty much each one of the steps (of course some more & some less than others). With the database no longer being cached in the database server RAM, the slow I/O rate of the file storage is causing the Assimilation/Deletion & Purge backlogs. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22536 Credit: 416,307,556 RAC: 380 |
The data used for validation is held in RAM (as far as I'm aware). When a task is returned after making sure it is not an "error result" the first step is to see if its twin is back yet or not, if it is then the two results are compared and if sufficiently similar then they are valid, one (normally the first one back) is declared the "canonical result" and heads off to the main database for storage and future analysis. If they don't agree then another identical task is sent out. The whole process may be repeated until the maximum number of replicate tasks have been sent out, returned and compared; the current maximum is 10, with rules around types of error. Completed and validated tasks are meant to be on display for 24 hours from final validation, this is so the user has a chance of seeing it in its completed state. Depending on server settings a new replica task may be placed in the send queue immediately, or may sit on the side for some time. There are some background queries running, such as updating credit, host error figures and the like. The worst process of all is the resending of a task which requires some quite complex queries that may need to be run a couple of times at the time of resend as a task is not sent to the same user more than once, and the trigger for the resend are taken into account. The top to bottom order on the Server Status Page does not reflect the actual sequence of actions as some are batched, some are running in parallel. For example while the validators run more or less continuously the file deleters tend to run in a batch mode. There is a figure at the top of the SSP that gives the queries per second - this normally sits in the 1000 to 1500 range, but I've seen it as high as 5000 - this figure is not a particularly good metric for server load as that will depend on what type the query is and how complex it is. Additionally there is at least one site (someone will chime in with a link to it) that shows the data flows using various tools. But together they may not reflect the critical server load, which when one part of the system is hitting a limit, so dragging other parts down (e.g. a sudden massive surge in IP traffic may hit the overall performance as it needs disk and CPU time to process) There are some answers in the project science pages, but they may not go far enough for you. I hope that helps to answer some of your questions Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Grant (SSSF) and rob smith Thank you. Depending on server settings a new replica task may be placed in the send queue Since it appears the "Results returned and awaiting validation" is the largest by number of any of the queues, wouldn't it make sense to have the "Resend" be the highest priority to cleanup the maximum number of WU's in that queue?? Keep this queue smaller and the 20 million number isn't reached, the data is kept in RAM, etc. Has this been suggested, or is it another half-baked idea? After all, if ALL the "Results out in the field" were returned and matched perfectly, there might still be many in that queue for "Resend". |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
Given all the present bottlenecks there are probably delays in checking when a WU is due for return/re-issue & then making it available.Depending on server settings a new replica task may be placed in the send queue That aside, i'd have thought once a WU has been re-issued, it would be made available then & there for download. However if there are already WUs in the Ready-to-send buffer i figure it would go to the end of the queue (from the time it takes from when a file starts being split to when you see WUs from it, the Ready-to-send buffer is First in First out). It would probably be possible to have WUs fed to the feeder in order of initial date of issue to speed the return of resends up, but that would be at the expense of yet more database querying. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I don't think that proposal will make much difference. Think about it in terms of time, rather than absolute numbers. The big one, 'Results returned and awaiting validation', is currently just over 14 million. Volunteers are currently returning just over 140 thousand per hour. So 'awaiting validation' would require 4 days of 'resend only' continuous work to clear. Even when the splitters aren't being throttled, the 'ready to send' queue is rarely above 4 hours, and at the moment it's about 2.5 seconds. Allocating resends immediately, rather than waiting for them to move up the queue from back to front, will save at most 4 hours against that 4 day backlog - trivial, and hardly worth doing. What would help most would be reducing the time we have to wait for a resend task to be created - and reducing the deadlines would take a big chunk out of that. |
rob smith Send message Joined: 7 Mar 03 Posts: 22536 Credit: 416,307,556 RAC: 380 |
I just did a very rapid count of tasks sitting in my "in progress" queue. I took the first 100 tasks on my list and had a look at the deadline dates, all are tasks sent to my host today: 71% have a 53 day deadline, 12% have deadlines in excess of 70 days (now I didn't expect that), and 4% have 20 day deadlines (more or less as expected - they are all re-sends which have shorter deadlines). Binning results into ten-day groups we get: 21-30 days = 4% 31-40 days = 0 41-50 days = 8% 51-60 days = 72% 61-70 days = 4% 71-80 days = 11% 81-90 days = 1% A quick comment, there are a substantial number of tasks with deadlines over 60 days - first step, cap the maximum deadline at mode+5% of mode (mode = 51-60 days, so cap at no more that 59 days, this would move >15% of the tasks back into the mode group. Leave the very short deadlines as they are, they represent less than 5% of the total. See what that does to the pending queue size, my guess is it would reduce by about 10% (remember there would be an increase in re-sends, which are currently sitting at about 4%) If that reduction isn't enough look at pulling the mode group deadline down into to around 40 days, but leave the short times as they are - scale the max deadline down so it remains the same ratio to the mode group (48 days?). That should start to have an impact on pending queue size, probably pulling it down by at least 10%, maybe 25% - but the resend pile will grow somewhat. Remember this is based on a small sample, on one day, at one time, so may not be representative of what everybody sees in their in-work queue, to get a better picture there needs to be more data over a longer period of time and over more hosts. Also that changing deadlines will not have an instant effect on the size of any of the queues as it will take time for the majority of hosts to start to see the effects, my first guess is no real change would be seen for at least a month as there are so many tasks on so many hosts that take their time to return result, and of course even longer to fail to return a result - that backlog has to be worked through. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Grant (SSSF) Given all the present bottlenecks there are probably delays in checking when a WU is due forAgreed. I imagine the feeder queries the DB at some interval looking for Results Ready to Send and then puts them on the RTS queue. It would probably be possible to have WUs fed to the feeder in order of initial date of issue toThe existing query could have the results returned in the "time of first issue" sort order. No extra query. By using a "Resend" instead of a "Fresh" Result, the "Results out in the field" would grow by only one instead of two. And when it is returned and validated, would result in decreasing the "Results returned and awaiting validation" by three (or more) instead of only two. |
rob smith Send message Joined: 7 Mar 03 Posts: 22536 Credit: 416,307,556 RAC: 380 |
Agreed. I imagine the feeder queries the DB at some interval looking for Results Ready to Send and then puts them on the RTS queue. Not quite - the splitters place the work unit into the ready to send table - only one entry is made. This is a rotating FIFO buffer If there are no tasks being resent when a new batch of 200 tasks is required the top 100 work units are pulled from the RTS and made into tasks, which are then allocated and sent out. If there are any tasks to be re-sent they are loaded into the dispatch buffer first, which is topped up with new tasks as required. It would probably be possible to have WUs fed to the feeder in order of initial date of issue to speed the return of resends up, but that would be at the expense of yet more database querying. While possible it would be far from practical to do so - re-sends by their very nature tend to become available at very random times, so very often totally out of sequence (from the evidence I've seen the majority of re-sends come from computational errors, and inconclusive results being returned.) A re-send is a new task, copied from the work unit (this is where the use of "result" to mean several different things can get very confusing), so there is only one additional (main) entry in that part of the database, - two tasks are only created at the time of first distribution. If you look at these to task names you will see that they are from different "tapes", and one is a re-send, as shown by the "_2" at the end: 23fe20aa.3924.9065.10.37.58_2 and the other is the second of the two initial tasks generated from the work unit, as shown by the "_1" at the end: 23fe20ad.20097.476.11.38.24.vlar_1 Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@rob smith Not quite - the splitters place the work unit into the ready to send table - only one entry is made. This is a rotating FIFO bufferNOT in a DB then? Or is the Server Status page wrong where it states "feeder: Fills up the scheduler work queue with workunits ready to be sent. " The "Feeder" runs on Synergy, and the splitters don't. While possible it would be far from practical to do so - re-sends by their very nature tend to become available at very random times ...And a Resend should just be a Resend no matter when it is determined to be one. Since they increase the "Results out in the field" queue by only 1, and decrease the "Results returned and awaiting validation" queue by more than 2 when validated, they should be the priority to be processed next. I think that is what you said above when they are put into the top of the next batch to go out, so we agree. The question is about how they are picked by the feeder. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
When the 'Ready to send' queue is down to 14 (as on current show), where the feeder picks from is a moot point - it'll pick all 14, whatever they are, and that's the most you'll get in a single request. My last got resends only - all three of them - which won't make a dent in the 'returned, waiting for validation' queue. The bigger problem is that there aren't any resend tasks available for those 14 million tasks, because they're all waiting for a current task to either be returned by its current host, or to pass its individual deadline. That's the problem we've got to solve. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm still in favor of a much shorter Work Cache. As it stands, the Server appears to be more interested in sending tasks to machines that won't use them for another Nine Days instead of sending tasks to much more capable machines that will send the completed tasks back in five minutes. My one machine is still not receiving enough work to keep it running, and it doesn't make any sense to me to send work to slower machines that have Days worth of work already. I have repeatedly suggested a One Day Cache, however, at this point any reduction would help. The current policy of sending tasks to machines that won't run them for over a week is just exacerbating the problem of Pending tasks. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I just did a very rapid count...I don't think we should get too hung up of the details of individual task deadlines. The curve which Joe, I, and others mapped out involved a lot of observation over what felt like months. It was robustly researched, and convincing enough for the project to adopt it. But that was 12 years ago. It was valid (deliberately) for stock CPU apps only - no GPUs back then. And it was true - two major science revisions ago. There are, now, distinct outliers. I've just reported Task 8591013178. Deadline 44.33 days, so near the top of your table. But with AR=0.135278, it's a (rare) observation just outside VLAR, and it took twice as long, on my GPU, as I expected. On a CPU, it would have whistled through. So, please forget about engineering this within an inch of its life. The project hasn't got time for that. My 'halve it' suggestion was made because it'll involve the minimum possible code change - two bytes. /2 All we need to do is find out where to put it... Edit - found a copy of the splitter_pfb folder which I downloaded from svn just before Christmas (2019!). Looks like they haven't been recording subsequent changes in their repository - tut, tut. There's a file header: mb_wufiles.cpp,v 1.1.2.6 2007/08/10 18:21:13 korpelabut it'll do as a start. This looks like the position just before Joe started work 431 db_wu.delay_bound=std::max(86400.0*7,db_wu.rsc_fpops_est/4e+7);'delay_bound' is boincspeak for what we know as a deadline - "minimum 7 days: the rest according to flops". The current version will be different, but that's the only place delay_bound is mentioned in the splitter code. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.