Message boards :
Technical News :
Emergence (Aug 13 2013)
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do. I see... Well, that works on MB also without any issues, though I don't know wether the validator waits for the third result or not before it vaildates the first two and how it validates the third one. The only thing that could be optimized there is activating the feature, that sends message to the host with the third (now unnecessary) result, to abort this task. I know BOINC can do that, I've seen that on Collatz for example, if I had a resend and the late wingman reported his result before I started to crunch it, the server told my computer to abort it and on the next scheduler request it was reported as "cancelled by server" or so. Much better than waisting resources on such task. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
And I've seen all 3 validate at times also. I would imagine that would work if say.. _0 was the late one that started this problem. _1 and _2 validate, 24 hours has not passed yet, _0 reports and manages to sneak in there before db_purge gets called for that WU. I guess it also depends on when the files get deleted for the WU. Do the files get deleted before db_purge? How soon after validation do they get deleted? Does deletion depend on the backlog for file_deleter? So this raises the question.. in the above case, _0 can report and get validated after _1 and _2 already did, as long as the files haven't been deleted yet. After the files get deleted, but db_purge hasn't run yet, it should get "too late to validate." Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter. Files should be deleted last, and only after db_purge. Maybe it does actually go in that order already. There just seems to be some kind of logic fault in the validation process that is supposed to see if all of the non-late tasks have reported before moving forward. It just kind of seems like it checks the first two that it can and then calls it done, without considering any deadlines that are still green, or even the late task that manages to report before db_purge. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
As I understand the describtion on the server status page, files are deleted as soon as the canonical result has been choosen and entered into the science database, i.e. assimilated. Files are not needed after that step, so it's logical to delete them here for to free up some disk space ASAP (the entry in the BOINC database is actually not needed either, but it's left there, so we can check our results). The issue is, that this step happens sometimes when one result (which is not late) is not returned yet and that gets stuck than, i.e. in your example if _0 reports back before _2, it validates against _1 and _2 gets stuck than. |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter. Since db_purge removes all info about the wu, results and files from database, it's impossible to run file_deleter afterwards, since file_deleter wouldn't know which files it should delete. While it's possible to wait with file_deletion until just before db_purge, this would unnessessarily tie-up disk-resources by files what woldn't be used by the project any longer. The logical order it should work if where's no bugs anywhere would be like this: Validate wu -> Assimilator -> wait for deadline for any unreturned tasks -> Possibly Validate extra results against the Canonical_result -> file_deleter -> wait for N hours before doing db_purge. By adding some extra overhead, it's possible to run file_deletion twice, if so it would be: validate wu -> Assimilator -> file_delete all input-files and result-files except canonical and un-reported not reached their deadline -> wait for deadline ... While it's possible to do Assimilation after waited until deadline, this would be a really bad idea for any BOINC-projects where next wu depends on result of the last. For the majority of wu's where's no more results to wait for and would regardless go directly from validation through assimilation to file_deletion. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter. IIRC, anyone who is interested enough to do so and knows how can still download a validated WU until it's purged, right? So that must mean the deleter doesn't run until the same time as the purge. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
IIRC, anyone who is interested enough to do so and knows how can still download a validated WU until it's purged, right? So that must mean the deleter doesn't run until the same time as the purge. No, the WU file is not available during the 1 day delay before the records are purged. See the FileDeleter documentation. Joe |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Well I will say that putting more minds on this issue is starting to point in the direction of some broken or missing logic to check for tasks that still have a green deadline and to halt all actions on the WU until there are no more green deadlines (either _2 goes red, or _2 reports before the late _0 or _1 does), then validate, assimilate, delete, tell the late one "too late to validate" (if it even reports before the 24-hour db_purge) and move on. So it seems the problem is going to be in.. the validator? Or the step just before it that says "yes, you can try to validate these results now?" I think we're getting closer. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
That is a reasonable amount of data to find a place for |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
I must say 300 MB per second average is very impressive for data to be transferred to media. The advantage of a SSD is that a single consumer based SSD can sustain writes of 400MB/s- no array required. With a mechanical HDD the rate drops off significantly as you get to the end of the drive. However the really important thing about SSDs is their random input/output (I/O) abilities- and that is what counts with databases, not sequential I/O. When it comes to random I/O HDDs are pittifull, SSDs are in a league of their own. EDIT- Keep in mind these graphs are for consumer devices- a enterprise HDD is faster, but not that much. An enterprise based SSD is much faster, and has much higher write endurance. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Enterprise PCIe based Solid State Storage benchmarks. Note that for consumer based drives the number of I/Os is in the 10s of thousands. For the enterprise PCIe based drive, it's in the 100s of thousands (an enterprise HDD is in the hundreds, between 150 & 450. ie almost 100,000 times slower than the fastest PCIe based SS storage). PCIe Solid State storage. Enterprise HDD. Grant Darwin NT |
Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0 |
I do not know enough about your database structure to say if this might be of interest, but it might be worth looking into. Rather than try to put large parts (or all) of the database on SSD's, could you keep everything on HDD's and only cache on SSD's? Some high end databases have some pretty extreme performance improvements by simply caching to SSD, rather than a full SSD solution. However, if the data complexity is growing at a rate that storage I/O just can't keep up, maybe the project would be better off trying to push more analysis of existing data to the field rather than just letting us continue to pile more barely analyzed data on the database? I know it's been said that trying to organize more in depth analysis in the field would be something of a nightmare, but it sounds like there's a problem brewing no matter what way you turn. Maybe it's time to try to figure out a way to push different analysis to the field, so that we can start reducing the size of the database, rather than constantly growing it? Or at least reduce it's growth rate to a point that the hardware and finances can keep up? |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
.. or if it is possible, do the caching in RAM, which is what I believe they've been doing or trying to do for a while now. You don't need the entire DB to be available for instant access. Just what they call the "active" tables. I think I've heard before that all the WUs that you can see on your tasks page, then project wide (so on the server status page, everything that is 'out in the field' and 'awaiting validation') can manage to fit into a RAM cache at a little over 100GB of RAM. The data is on disk, but it gets cached/mirrored into RAM. If 128GB is no longer enough RAM for it, then maybe swap some more memory modules around or get larger modules and increase the server up to 192 or 256GB? Point is, mechanical drives are definitely more reliable and have more capacity for less cost, but their downside is relatively slow transfer speed and access speed. SSDs are great for transfer rate and access speed, but the cost of an SSD for the capacity, plus that inherent limited write cycle trait would make you want to stray away from SSD for primary storage. So I agree.. SSD for a cache would be a better idea than moving entirely over to SSD. RAM would be even better than SSD for that scenario. Just my two cents. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0 |
Aye if possible, a RAM expansion for caching would be best, but I would imagine more system RAM being better used for analysis, rather than caching. An actual RAM cache the same size as a modern cache SSD would be awesome, but REALLY pricey. But I'm shooting in the dark here because I don't know enough. I can only toss out an idea and let the folks who know what the servers need for analysis look at it and see if there's any potential use in such a solution. SSD's as a caching solution aren't always terribly useful. Depends on the size of the SSD, and how repetitive the data is that is being pulled for analysis, as well as a few other potential bottlenecks or exceptions. |
PAJTL_Computing Send message Joined: 24 Apr 00 Posts: 2 Credit: 160,462,107 RAC: 157 |
HI All, Just my two cents worth. I think what we need is another two copies of the science database. These copies would only hold the data for a current time frame [weekly outage?] and therefore only current work loads and IO. At the predefined interval the data is copied from these working science database copies to the main and therefore the replica science database. So to make it clearer, working and working replica handle day to day IO, Main and main replica science database with only IO from analysis. Once a week or defined period the data is "moved" from working databases to main databases. This has the added advantage that minimal programing is needed to existing programs, jobs or website and you would be able to just "copy" the existing main science database table structure. Of course you would require two new servers to handle the working and working replica databases, and these could be obtained for free? from the co-location facility http://ist.berkeley.edu/services/catalog/datacenter/serverdonation/terms In my opinion RAM expansion would not provide a large benefit as a complete table from the database would have to be cached and the current size of tables would prohibit this. Thanks |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.