Emergence (Aug 13 2013)

Message boards : Technical News : Emergence (Aug 13 2013)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1407585 - Posted: 25 Aug 2013, 11:08:58 UTC - in response to Message 1407551.  

The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do.

I see... Well, that works on MB also without any issues, though I don't know wether the validator waits for the third result or not before it vaildates the first two and how it validates the third one. The only thing that could be optimized there is activating the feature, that sends message to the host with the third (now unnecessary) result, to abort this task. I know BOINC can do that, I've seen that on Collatz for example, if I had a resend and the late wingman reported his result before I started to crunch it, the server told my computer to abort it and on the next scheduler request it was reported as "cancelled by server" or so. Much better than waisting resources on such task.
ID: 1407585 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1407828 - Posted: 26 Aug 2013, 7:09:03 UTC - in response to Message 1407582.  
Last modified: 26 Aug 2013, 7:11:00 UTC

And I've seen all 3 validate at times also.

I would imagine that would work if say.. _0 was the late one that started this problem. _1 and _2 validate, 24 hours has not passed yet, _0 reports and manages to sneak in there before db_purge gets called for that WU.

I guess it also depends on when the files get deleted for the WU. Do the files get deleted before db_purge? How soon after validation do they get deleted? Does deletion depend on the backlog for file_deleter?

So this raises the question.. in the above case, _0 can report and get validated after _1 and _2 already did, as long as the files haven't been deleted yet. After the files get deleted, but db_purge hasn't run yet, it should get "too late to validate."

Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter.

Files should be deleted last, and only after db_purge. Maybe it does actually go in that order already. There just seems to be some kind of logic fault in the validation process that is supposed to see if all of the non-late tasks have reported before moving forward. It just kind of seems like it checks the first two that it can and then calls it done, without considering any deadlines that are still green, or even the late task that manages to report before db_purge.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1407828 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1407898 - Posted: 26 Aug 2013, 11:48:38 UTC - in response to Message 1407828.  

As I understand the describtion on the server status page, files are deleted as soon as the canonical result has been choosen and entered into the science database, i.e. assimilated. Files are not needed after that step, so it's logical to delete them here for to free up some disk space ASAP (the entry in the BOINC database is actually not needed either, but it's left there, so we can check our results). The issue is, that this step happens sometimes when one result (which is not late) is not returned yet and that gets stuck than, i.e. in your example if _0 reports back before _2, it validates against _1 and _2 gets stuck than.
ID: 1407898 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 1407919 - Posted: 26 Aug 2013, 13:05:54 UTC - in response to Message 1407828.  

Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter.

Since db_purge removes all info about the wu, results and files from database, it's impossible to run file_deleter afterwards, since file_deleter wouldn't know which files it should delete. While it's possible to wait with file_deletion until just before db_purge, this would unnessessarily tie-up disk-resources by files what woldn't be used by the project any longer.

The logical order it should work if where's no bugs anywhere would be like this:
Validate wu -> Assimilator -> wait for deadline for any unreturned tasks -> Possibly Validate extra results against the Canonical_result -> file_deleter -> wait for N hours before doing db_purge.

By adding some extra overhead, it's possible to run file_deletion twice, if so it would be:
validate wu -> Assimilator -> file_delete all input-files and result-files except canonical and un-reported not reached their deadline -> wait for deadline ...

While it's possible to do Assimilation after waited until deadline, this would be a really bad idea for any BOINC-projects where next wu depends on result of the last.

For the majority of wu's where's no more results to wait for and would regardless go directly from validation through assimilation to file_deletion.
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 1407919 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1407955 - Posted: 26 Aug 2013, 15:14:28 UTC - in response to Message 1407919.  

Logically, I would think it makes sense to go in the order of validate > [check for others that have not reached the deadline] > assimilate > db_purge > file_deleter.

Since db_purge removes all info about the wu, results and files from database, it's impossible to run file_deleter afterwards, since file_deleter wouldn't know which files it should delete. While it's possible to wait with file_deletion until just before db_purge, this would unnessessarily tie-up disk-resources by files what woldn't be used by the project any longer.

The logical order it should work if where's no bugs anywhere would be like this:
Validate wu -> Assimilator -> wait for deadline for any unreturned tasks -> Possibly Validate extra results against the Canonical_result -> file_deleter -> wait for N hours before doing db_purge.

By adding some extra overhead, it's possible to run file_deletion twice, if so it would be:
validate wu -> Assimilator -> file_delete all input-files and result-files except canonical and un-reported not reached their deadline -> wait for deadline ...

While it's possible to do Assimilation after waited until deadline, this would be a really bad idea for any BOINC-projects where next wu depends on result of the last.

For the majority of wu's where's no more results to wait for and would regardless go directly from validation through assimilation to file_deletion.

IIRC, anyone who is interested enough to do so and knows how can still download a validated WU until it's purged, right? So that must mean the deleter doesn't run until the same time as the purge.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1407955 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1407994 - Posted: 26 Aug 2013, 17:13:14 UTC - in response to Message 1407955.  

IIRC, anyone who is interested enough to do so and knows how can still download a validated WU until it's purged, right? So that must mean the deleter doesn't run until the same time as the purge.

No, the WU file is not available during the 1 day delay before the records are purged. See the FileDeleter documentation.
                                                                   Joe
ID: 1407994 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1408270 - Posted: 27 Aug 2013, 7:17:13 UTC

Well I will say that putting more minds on this issue is starting to point in the direction of some broken or missing logic to check for tasks that still have a green deadline and to halt all actions on the WU until there are no more green deadlines (either _2 goes red, or _2 reports before the late _0 or _1 does), then validate, assimilate, delete, tell the late one "too late to validate" (if it even reports before the 24-hour db_purge) and move on.

So it seems the problem is going to be in.. the validator? Or the step just before it that says "yes, you can try to validate these results now?" I think we're getting closer.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1408270 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1418737 - Posted: 21 Sep 2013, 23:51:01 UTC - in response to Message 1404913.  


When I can find somewhere to dump 1400GB of data to, I'm going to rebuild the array with a 64KB stripe (actually, I think the newer firmware for my Areca card now supports 128KB).

That is a reasonable amount of data to find a place for
ID: 1418737 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13822
Credit: 208,696,464
RAC: 304
Australia
Message 1418759 - Posted: 22 Sep 2013, 0:59:43 UTC - in response to Message 1404913.  
Last modified: 22 Sep 2013, 1:03:40 UTC

I must say 300 MB per second average is very impressive for data to be transferred to media.

300MB/sec isn't really all that unrealistic. Even with 7200RPM mechanical disks, if you put 3-4 in a RAID array so that you are writing to at least 3-4 disks at any given time, you can achieve that number.

The advantage of a SSD is that a single consumer based SSD can sustain writes of 400MB/s- no array required. With a mechanical HDD the rate drops off significantly as you get to the end of the drive.





However the really important thing about SSDs is their random input/output (I/O) abilities- and that is what counts with databases, not sequential I/O.


When it comes to random I/O HDDs are pittifull, SSDs are in a league of their own.
EDIT- Keep in mind these graphs are for consumer devices- a enterprise HDD is faster, but not that much. An enterprise based SSD is much faster, and has much higher write endurance.
Grant
Darwin NT
ID: 1418759 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13822
Credit: 208,696,464
RAC: 304
Australia
Message 1418763 - Posted: 22 Sep 2013, 1:18:04 UTC - in response to Message 1418759.  


Enterprise PCIe based Solid State Storage benchmarks.
Note that for consumer based drives the number of I/Os is in the 10s of thousands. For the enterprise PCIe based drive, it's in the 100s of thousands (an enterprise HDD is in the hundreds, between 150 & 450. ie almost 100,000 times slower than the fastest PCIe based SS storage).

PCIe Solid State storage.


Enterprise HDD.

Grant
Darwin NT
ID: 1418763 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 156
Credit: 18,451,329
RAC: 0
United States
Message 1422380 - Posted: 30 Sep 2013, 14:30:21 UTC - in response to Message 1402757.  

I do not know enough about your database structure to say if this might be of interest, but it might be worth looking into.

Rather than try to put large parts (or all) of the database on SSD's, could you keep everything on HDD's and only cache on SSD's?

Some high end databases have some pretty extreme performance improvements by simply caching to SSD, rather than a full SSD solution.

However, if the data complexity is growing at a rate that storage I/O just can't keep up, maybe the project would be better off trying to push more analysis of existing data to the field rather than just letting us continue to pile more barely analyzed data on the database?

I know it's been said that trying to organize more in depth analysis in the field would be something of a nightmare, but it sounds like there's a problem brewing no matter what way you turn. Maybe it's time to try to figure out a way to push different analysis to the field, so that we can start reducing the size of the database, rather than constantly growing it? Or at least reduce it's growth rate to a point that the hardware and finances can keep up?
ID: 1422380 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1422419 - Posted: 30 Sep 2013, 16:06:32 UTC

.. or if it is possible, do the caching in RAM, which is what I believe they've been doing or trying to do for a while now. You don't need the entire DB to be available for instant access. Just what they call the "active" tables. I think I've heard before that all the WUs that you can see on your tasks page, then project wide (so on the server status page, everything that is 'out in the field' and 'awaiting validation') can manage to fit into a RAM cache at a little over 100GB of RAM. The data is on disk, but it gets cached/mirrored into RAM.

If 128GB is no longer enough RAM for it, then maybe swap some more memory modules around or get larger modules and increase the server up to 192 or 256GB?



Point is, mechanical drives are definitely more reliable and have more capacity for less cost, but their downside is relatively slow transfer speed and access speed. SSDs are great for transfer rate and access speed, but the cost of an SSD for the capacity, plus that inherent limited write cycle trait would make you want to stray away from SSD for primary storage.

So I agree.. SSD for a cache would be a better idea than moving entirely over to SSD. RAM would be even better than SSD for that scenario.

Just my two cents.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1422419 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 156
Credit: 18,451,329
RAC: 0
United States
Message 1422800 - Posted: 1 Oct 2013, 22:39:05 UTC - in response to Message 1422419.  
Last modified: 1 Oct 2013, 22:40:26 UTC

Aye if possible, a RAM expansion for caching would be best, but I would imagine more system RAM being better used for analysis, rather than caching. An actual RAM cache the same size as a modern cache SSD would be awesome, but REALLY pricey.

But I'm shooting in the dark here because I don't know enough. I can only toss out an idea and let the folks who know what the servers need for analysis look at it and see if there's any potential use in such a solution. SSD's as a caching solution aren't always terribly useful. Depends on the size of the SSD, and how repetitive the data is that is being pulled for analysis, as well as a few other potential bottlenecks or exceptions.
ID: 1422800 · Report as offensive
PAJTL_Computing
Volunteer tester

Send message
Joined: 24 Apr 00
Posts: 2
Credit: 160,462,107
RAC: 157
Australia
Message 1449803 - Posted: 4 Dec 2013, 6:10:18 UTC

HI All,
Just my two cents worth.

I think what we need is another two copies of the science database. These copies would only hold the data for a current time frame [weekly outage?] and therefore only current work loads and IO.

At the predefined interval the data is copied from these working science database copies to the main and therefore the replica science database.

So to make it clearer, working and working replica handle day to day IO, Main and main replica science database with only IO from analysis.

Once a week or defined period the data is "moved" from working databases to main databases.

This has the added advantage that minimal programing is needed to existing programs, jobs or website and you would be able to just "copy" the existing main science database table structure.

Of course you would require two new servers to handle the working and working replica databases, and these could be obtained for free? from the co-location facility
http://ist.berkeley.edu/services/catalog/datacenter/serverdonation/terms

In my opinion RAM expansion would not provide a large benefit as a complete table from the database would have to be cached and the current size of tables would prohibit this.

Thanks
ID: 1449803 · Report as offensive
Previous · 1 · 2

Message boards : Technical News : Emergence (Aug 13 2013)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.