Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 16 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Results returned and awaiting validation 0 39,153 13,810,892New database server (and ideally a matching replica). More cores, faster clock speeds, and more than double the RAM. . . Well 2 of those exist. The new scheduler/upload/download server is supposed to be Muarae2, but that stalled in Beta. Is it still deployed there? And a new storage array with 32 SOTA drives was being built. Not sure where those projects stand. Stephen ? ? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
I have been thinking about the situation we have here. In round numbers, 6 million WU's in the field, and 14 million returned and waiting validation. Using a thought experiment, I think our problem is not the deadline, but the speed of validation is too slow. Try this: imagine empty queues to start, with the validator not running. Split 10 million Results into 20 million WUs. Process 4 million pairs (primary and secondary) to completion and return them. Process only the primary of each pair of the remaining six million pairs. The result will be what we now have in our queues. It doesn't matter if the 6 million still in the field are being processed or sitting in a cache for the past two months. You can think of the splitting, processing, validating, as standing waves, where each feeds into the next. Splitters work at 80/second, processors at 40/second, and validators at less than 40/second. Start the validators running, feed them more than they can process, and that queue will grow. Ditto splitters and processors if they are asking for more WUs. Solutions include faster validators, more validators, smarter validation code, freeing more resources on the current validators (move work off to other systems). Comments? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
When a WU is first created, the'primary' and 'secondary' tasks, as you call them - _0 and _1 - are also created. If you pick any WU at random, you will see that both tasks are sent out to a cruncher within 1 or 2 seconds of each other (somebody will find a counter-example, but that's the general practice). What happens after that is entirely up to us, the crunchers. No attempt at validation can take place until both _0 and _1 have been completed and returned. The speed of the validator plays no part in the process until it has something to validate. If one volunteer fails to return their work, the whole WU - both tasks - is stalled until the deadline is reached: neither the project, nor the other cruncher, can speed up the missing volunteer. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
Reducing the deadline will result in some number of current slow computers to cease to contribute to SETI (even slow computers process WUs). Aren't the splitters already faster than processors? We artificially throttle the splitter processing because we aren't validating the processed WUs quickly enough. If the "goal" is a smaller waiting validation queue without regard to the overall number of WUs processed, then throttling back the processing will do that. The big systems will like it, but the participation in SETI will decrease. Sitting in a user's cache is not causing this problem. This problem is a constipated validation process. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Agreed - sitting in an (active) cache isn't the problem. Unless somebody with laboratory security credentials (i.e. staff) can run some queries on the actual database, we won't know for certain. But if the choice is between a large number of very slow computers, slowly but surely plodding their way through their tasks just fast enough to return each task just before its seven-week deadline: or a number of shiny new Christmas presents, with a bundle of tasks downloaded in a burst of early enthusiasm but then abandoned by uninstalling BOINC - my money is on the latter. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Richard Haselgrove I agree with what you wrote. Here is my problem: there are about 6 million WUs in the field. There are 14 million in the waiting validation queue. If ALL 6 million in the field are the _1, and ALL 6 million _0 have been returned, are the 8 million other WUs in the waiting validation queue ALL resends? Or are they actually both the _0 and _1 for 4 million pairs that have NOT been validated (less a few resends)? You can adjust the proportions, but this is a worst case easy-to-understand scenario. If I have misunderstood, please forgive me and correct my understanding. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
Try this: imagine empty queues to start, with the validator not running. Split 10 million Results into 20 million WUs That is not right - in this context "result" is the work unit that is split into two tasks - someone must get into the system and sort out the nomenclature so confusion does not reign. Process 4 million pairs (primary and secondary) to completion and return them. As Richard says, this is not how it happens -the first two tasks sent out are a pair of twins, they are not "primary" and "secondary, they are equal. Process only the primary of each pair of the remaining six million pairs. The result will be what we now have in our queues. It doesn't matter if the 6 million still in the field are being processed or sitting in a cache for the past two months. No, it cannot work like that, validation requires two tasks from the same work unit that match each other, anything else could well result in a whole load of garbage arriving in the main database (we've just had a near-miss due to an major issue with AMD GPUs) You can think of the splitting, processing, validating, as standing waves, where each feeds into the next. No, the splitters only work the "tapes" (historical name, these days the date comes in as ~50GB files either on disks or down the wires from the telescopes). They split the massive data into manageable parts called "results", or the more usable "Work Units". They are throttled by a number of factors, demand being only one of them - they only need to run at something like 39 work units per second to keep up with demand, but they actually work in fairly short bursts of much more than that, but the length and frequency of those bursts can be varied from 0 up to some value that equates to something around 80 work units per second (may be more). The validation process, as Richard says is more than fast enough, even on the current hardware. It is solely controlled by the availability of at two returned tasks for a given work unit. Normally these are the "_0" and "_1" pair, but if one of those two doesn't appear in time or is returned as an "error task", or the two don't agree the "_2" will be sent out, whereupon we have to wait for that to be returned. (And so on up to the maximum number allowed). Having looked at the validation code in some detail I would say it is very efficiently written, if not very well structured or documented. Richard's final comment is one that is well worth remembering in these discussions: neither the project, nor the other cruncher, can speed up the missing volunteer. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
Sitting in an INactive cache is not causing much of this problem, either. AT MOST, it could represent 6 million of the 14 million in the waiting validation queue. Let's look at getting the 8 million presumably good WUs validated. Then there is enough room for all the producers, big and fast, and slow but many. Eventually the inactive caches will timeout, and the WUs will be available for resend. It has taken many years for the data to arrive on earth - we can wait a few more weeks to process it. Earlier I asked if anyone had data on resource utilization, but I have not seen any response. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
In the last couple of months, we have been hit by both major GPU manufacturers releasing drivers which don't play reliably with the science applications we use here. If you look in the topmost sticky thread (the AMD one), you'll see an oblivious volunteer who is perfectly happy with his 'stable' new toy which has a quality rating of: State: All (2403) · In progress (300) · Validation pending (529) · Validation inconclusive (545) · Valid (986) · Invalid (43) · Error (0) He's turned the wick up to cache the maximum number of tasks possible, and he's producing - cr*p. And he seems happy with it. It's machines - and drivers - and users - like that who (I suspect) are causing the size of the database to bloat with extra resend and confirmation tasks: 545 for that user's inconclusives alone. That's our biggest problem at the moment, from my cursory survey of my own pendings a few days ago: all the extra re-checks made necessary by bad or missing volunteers. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
There are 14 million in the waiting validation queue. If ALL 6 million in the field are the _1, and ALL 6 million _0 have been returned, are the 8 million other WUs in the waiting validation queue ALL resends? This really shows the nomenclature problem quite clearly - the use of one word to mean different things depending on the context. There are 14 million TASKS in the field, there are 6 million WORK UNITS in the field. The difference is 2 million TASKS "unaccounted for" - in reality these TASKS will be re-sends (either actually sent out, or sitting around waiting to be sent out), which looking at my figures is just about the number I would expect., as I said in my previous post there may be several re-sends (TASKS) associated with one WORK UNIT, thus the numbers can get out of step. I would like to see the base figure (6 million WORK UNITS) getting down to a more reasonable size, say around 3-4 million, but even then there would be more than twice that figure of TASKS in the field due to re-sends. (Exactly twice would be a utopian state, as that would imply there were no re-sends out there.) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ rob smith Sorry if my "split" upset you. Perhaps I should have written "divide". And remember, this is an imaginary thought experiment to illustrate that the WUs sitting in caches are not the problem. The problem is validators are not processing fast enough to keep up with the processors. The validation process, as Richard says is more than fast enough, even on the currentAhh! Someone with actual resource utilization data! How can I get it for my analysis? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
The problem is validators are not processing fast enough to keep up with the processors.The validation process (again from my own survey of pendings) is not slow: it's stopped, because it's waiting for the necessary data to be returned so it has something to validate. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
There are, now, distinct outliers. I've just reported Task 8591013178. Deadline 44.33 days, so near the top of your table. But with AR=0.135278, it's a (rare) observation just outside VLAR, and it took twice as long, on my GPU, as I expected. On a CPU, it would have whistled through. Richard clearly shows one issue of doing radical changes to deadline without careful thought - different processors and applications handle the same task in very different periods of time. Hence my suggestion of taking a small step and waiting to see what happens, not just hacking the time down radically and finding another issue further along the line. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
...If one volunteer fails to return their work, the whole WU - both tasks - is stalled until the deadline is reached: neither the project, nor the other cruncher, can speed up the missing volunteer.This seems to be in conflict. The longer the task sits in the cache, the longer the task will be listed as Pending. Logic would say the quicker the task can be completed the quicker the Work Unit will be Validated. If you send a Resend to a machine that takes weeks before running the resend, the Work Unit will be listed as Pending for weeks. Send the Resend to one of my machines and it will be back within hours, not weeks. It would seem a machine with a slow turn around time will result in a larger number of Pending tasks the way I see it. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ rob smith There are 14 million TASKS in the field, there are 6 million WORK UNITS in the field. Ahh, please reread my sentence. It was "14 million in the waiting validation queue" Thanks for the correction of "task" versus "work unit". Even so, that still means there 8 million TASKS reported and waiting validation. They can NOT all be inconclusives, or there would be more tasks in the field. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
I doubt that you will find any externally available metric that will help - I had to resort to using a clock-cycle estimating tool to see how fast the code for the validators could possibly run, and had to make assumptions about how efficiently the source code (C++) was converted into CPU instructions, I did try a variety of levels of optimisation, and thankfully the spread between lowest and highest counts wasn't as bad as some I've seen, even the worst left was more than fast enough to too fast to worry about. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ TBar I agree with your logic about quicker send and quicker process will result in quicker return to the validation queue. However, sitting in a queue which has plenty of free space is not a problem per se. Sitting in a persons cache for weeks is not a problem so long as the queue of the tasks in the field is not causing a problem. If it causes the waiting validation queue to grow to be a problem, then it IS causing a -down stream- problem. It messes with the RAC, but not the science. IF, and this is IF, the system did not misbehave when the sum of some queues reached about 20 million, but only when it hit 50 million, we wouldn't be having this issue now. We would have it when the queues hit 50 million. I agree that resends should be higher priority than fresh work. See prior posts for reasons. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
It depends what you mean by the word 'active'. I'd need to check, but I don't think any of my machines - even the ones processing on a single, low-power, GPU like a 1050 Ti - has a turnround average of as much as one day. I think that my machines can be classed as 'active', and are not part of the problem....If one volunteer fails to return their work, the whole WU - both tasks - is stalled until the deadline is reached: neither the project, nor the other cruncher, can speed up the missing volunteer.This seems to be in conflict. The longer the task sits in the cache, the longer the task will be listed as Pending. Logic would say the quicker the task can be completed the quicker the Work Unit will be Validated. If you send a Resend to a machine that takes weeks before running the resend, the Work Unit will be listed as Pending for weeks. Send the Resend to one of my machines and it will be back within hours, not weeks. It would seem a machine with a slow turn around time will result in a larger number of Pending tasks the way I see it. Look at the table in my message 2033755 (tasks which have been pending for more than four weeks). See how many of them say 'last contact in January', 'no contact since that allocation', 'never recontacted'. Those are the ones I consider INactive caches. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
...We are both victims of the poor, imprecise nomenclature :-( Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Richard Haselgrove The validation process (again from my own survey of pendings) is not slow: it's stopped,I believe what you see is your truth. I ask you to consider whether we could have 14 million tasks, most of which require a resend, to be reasonable without having a much larger count of tasks in the field. From prior posts, the resends are placed at the top of the queue for sending work out. If you are getting work with a long deadline, isn't that new work? EDIT I just download a new task. It has a deadline of 4/20. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.