Proposal/idea for a near-real-time, low-resource, inexpensive NTPCkr

Message boards : Number crunching : Proposal/idea for a near-real-time, low-resource, inexpensive NTPCkr
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1761382 - Posted: 1 Feb 2016, 19:42:28 UTC

This thread is intended as a discussion to bat ideas around and get feedback. I've posted it in Number Crunching, rather than SETI@Home Science where the other NTPCkr thread is, because this is a computing rather than astrophysical topic, and this forum seems to be where the most technically savvy folks with the greatest knowledge of SETI@Home's infrastructure can be found.

I hope that it proves fruitful, as I think that a working NTPCkr or equivalent is very important to the future of this project. If is seems to pass muster, or if any flaws found with it seem to be reasonably solvable, it could certainly be proposed to the SETI@Home team for consideration.

Note: Please do not reply quoting this entire post unless you are inserting commentary throughout! I will become rather annoyed. :^p

First, we need some preliminaries, of course, for new volunteers:

What is NTPCkr?
NTPCkr is Near-Time Persistency Checker. Simply, it's a server/computer that takes our crunched results and quickly matches them up (correlates them) with the billions of other results on file to see if multiple interesting signals have been found at one point in the sky. Ideally, we also could access it through our browsers and see where work units we've crunched are located on a sky map, and whether other interesting results have been found there.

Why is NTPCkr so important to SETI@Home?
NTPCkr's importance is two-fold:

1) Science: By virtue of being near-real-time, NTPCkr offers to project scientists much faster responsiveness should a transient signal be detected, so that it can be investigated further in a minimal amount of time before it is gone. Responsiveness is now easier and thus this has become a priority due to SETI@Home's access to the Green Bank Telescope, which can be pointed at interesting locations, in contrast to Arecibo's constraint of only listening at whatever location the telescope happens to be "pointing" to for other research.

2) Project participation and long-term success: One of the statistics I've seen the project team use is the unfortunate 8% retain rate of SETI@Home volunteers. This means that 92% of everyone who ever did any project work is now gone, or at least isn't doing any more work. Over the years I've seen several BOINC and other distributed computing projects fail or stagger because of lack of participant enthusiasm, and the biggest contributing factor is always that participants don't think that their work is valuable to the project and being used. With the flood of new Breakthrough Listen data from Green Bank and possibly elsewhere requiring maximum participation, providing a captivating way of showing volunteers that their results are being analyzed and that they possibly contain interesting data would not only prevent them from leaving, but may also convince those who have left to return.

What's passing for NTPCkr now?
Currently, NTPCkr requires the entire science database to be exported and then the job run on a supercomputer or cloud computing facility. This takes weeks, so it should be just called "PCkr" as it's anything but Near-Time. (Perhaps Far-Time, which would make it FTPCkr, so would ideally be run on the IBM computing cluster in Poughkeepsie... and props to anyone who gets the reference.) The cost estimate for a NTPCkr owned by and local to the SETI@Home facility to replace this was initially $30-40K. I hope to demonstrate that it can be done for far less cost.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Synopsis
Doing a complete run of the entire SETI@Home science database to find persistent signals more than once is unnecessary and inefficient. Instead, only one initial run is required, because, as the result is a database itself, further runs can build on it incrementally (much like backing up a hard disk only copying new/changed files since the last backup) only adding new results since the last run, rather than starting from scratch. These further runs can be done on a much smaller/cheaper computer at the SETI@Home facility than initially estimated, and eventually this process would catch up to the newest results in the science database, so would fulfill the "Near Time" requirement of correlating results almost as soon as they are received.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Methodology


paddym is the SETI@Home science server for the SETI@Home science database of MB (Multi-Beam) results which NTPCkr would be looking at. While I don't know the exact number of results it has, we know that the total credit in SETI@Home's results is about 3.2 x 1011. This grows by about 1.5 x 108 credits per day. Thus assuming the MB/AP credit ratio hasn't changed too drastically averaged over time, the science database contains 3.2 x 1011/1.5 x 108 ≈ 2,130 days worth of work. We can thus estimate the number of results paddym holds, knowing that about 92,000 results are returned per hour (which means 25 writes/second to the science database.) This works out to (2.2 x 106 per day x 2,133 days' worth) / 2 results per work unit ~ 2.3 x 109 results in total.



The current time/date is noted as the checkpoint time for use later, and the SETI@Home science database is exported and sent to a cloud or supercomputing facility to run the NTPCkr job. Note that in this time the SETI@Home science database has grown as it has been collecting results continuously (I show new results as halftone at the bottom.) This initial run needs to be done elsewhere as it is computationally unfeasible to perform with SETI@Home's in-house computing resources (it also can't be distributed to volunteers as each would need enough of a percentage of the science database in local storage that there is not enough network bandwidth to distribute it.) The actual time requirement for the job to run is 35 days. The processing requirement for this job is about 5,000 CPU days... ie a quad-core desktop would take about 5,000/4 = 1,250 days — 3 years and 3 months — to complete it.



The resulting NTPCkr database from the run is returned to SETI@Home and installed in the NTPCkr computer. With the 35 day run time, plus an estimated week of transport and setup each way, 49 days have elapsed, with the SETI@Home science database growing continuously during this time.



The cloud or supercomputing facility is no longer required. The SETI@Home science database is then copied to NTPCkr so that the complete database is available locally as otherwise the correlations would be too slow.



NTPCkr is now started up and runs the same correlation job that was run on the cloud or supercomputing facility, correlating its local replica copy of the science database. The critical difference is that NTPCkr only runs the job on results that arrived between the checkpoint time/date noted above and the current time/date.



NTPCkr has now finished additions to its result database. However, it hasn't caught up with the science database yet, because in the time it took to do this plenty of new results arrived.



NTPCkr now sets the checkpoint time as the previous current time, and runs the job again. However from this point forward copying the entire science database over is no longer required. Instead, NTPCkr directly queries paddym for the new results since checkpoint and adds them to its local copy of the science database, then correlates them into the NTPCkr database.
This process repeats with NTPCkr converging on the state of the science database with each iteration. The time to convergence can be easily determined... no calculus required!
We know in the initial run that 5,000 CPU days correlated 2,133 days' worth of science database results, so results accumulate at the rate of 5,000/2,133 = 2.34 CPU days per day. Thus, a NTPCkr with only 3 or more CPU cores could keep with the science database results as they arrive.
We need to allow for NTPCkr outages requiring catch-up to the last checkpoint, sub-optimal utilization and other issues, and that the NTPCkr computer is also expected to serve out local and web queries. I would therefore expect a minimum of a six-core Xeon.
There is initially a 49-day offset between the science database and the NTPCkr database. This is 49 x 2.34 = 114.66 CPU days of computation.
Every day the science database adds 2.34 CPU days of results to correlate, and NTPCkr's six-core Xeon reduces this by 6 CPU days, a net reduction of 6-2.34 = 3.66 days' offset.
NTPCkr would thus catch up to the science database in 114.66/3.66 ≈ 31.3 days.
During this time it can be limited to querying the science database (6 / 2.34) x 25 = 64 queries/sec. (The 25 is the number of new results/second written to the science database.) This should not present an issue, considering the BOINC database receives 20x as many queries/sec. with fewer resources.



After the 31.3 days, NTPCkr would then be near-real-time, checkpointing perhaps once per minute or whatever is determined to be a proper interval. At this point, it could be linked with muarae1 to serve web queries from anyone interested in checking results for correlation. if NTPCkr crashed or was otherwise cut off from the science database, it would catch up in 2.34 / 6 = 0.39 of the time offset, so for example if it was down for a day it would catch up in just over nine hours.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Proposed Costing/parts list

The initial NTPCkr spec. indicated 10TB of SSD and 512GB RAM. Given the differing methodology, with only 64 queries/sec. to the science database it doesn't seem that SSDs would be required. (This would also be 180 MBits/sec. even if complete work units were being extracted, which they aren't, so the built-in GBit networking on the mainboard would be more than sufficient.) A single six-core CPU has been demonstrated to be enough, but eight-cores are only a couple of hundred dollars more so would be worth the difference. As well to facilitate further upgrades and mitigate possible lack of resources, a dual-CPU board would be ideal. Not knowing how the requirements would change, the original 512GB RAM spec. should also be kept.

SuperMicro rackmount with dual-Xeon LGA2011 mainboard and 950W PSU: $1,850
Xeon eight-core CPU: $700
512GB memory (32GB DDR4 ECC DIMM server memory @ $300 x 16) = $4,800 (well over half the cost... it appears that the price-fixing convictions haven't stopped RAM from being way too expensive!)
12TB hard disk (3 TB NAS SATA III HD @ $120 x 5 [one as a parity drive]) = $600

Total: $7,950. (This is between one-fifth and one-quarter the cost of full-correlation NTPCkr estimated at $30-40K. It also matches the actual cost of equivalent SETI@Home servers that I know of.)

Time estimate for raising this amount: One week with matching fundraising.
ID: 1761382 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1761389 - Posted: 1 Feb 2016, 20:05:57 UTC - in response to Message 1761388.  
Last modified: 1 Feb 2016, 20:07:21 UTC

Too late. Already on the way.


I doubt that is going to be used long-term as it would be prohibitively expensive (I had a gander at Amazon's rates recently.) Also, as indicated, even with a run of a few weeks it's anything but "Near Time" ie you can click on your recently completed work after a few minutes to see how it matches up with others' in NTPCkr.
ID: 1761389 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1761546 - Posted: 2 Feb 2016, 8:57:36 UTC
Last modified: 2 Feb 2016, 9:00:58 UTC

I have suggested pretty much what I think was described in the first post of this thread.. numerous times over the past couple of years.

There is at least one thing that I think is worth adding though. If I'm remembering right, the "renting some computing time in one of the multiple Clouds that are available (MS, Amazon, IBM, ...doesn't HP have one, too?) is a two-part problem: 1) that costs money....a lot of it; 2) for data-integrity concerns, and to prevent critics/skeptics from making false claims of falsifying data, analysis needs to be kept in-house.

I have read from Matt more than once (and from both Richard and Josef, as well) that the main problem with why we aren't running NTPCkr right now is because of the massive I/O load and the massive queries to the DB, and it ends up putting the DB in a "deadlock" and normal function ceases until those massive queries finish running.


The alternative solution that I proposed multiple times is that even though it will not be blazing fast.. take any one of the weekly complete backups of the DB and put it on an isolated machine...and then let NTPCkr run for however long it needs to run in order to get "caught up". Then once it is "caught up", give it a DB-update for everything that has happened since it started this task. It will probably take it another 2-3 weeks (or months) to chew through that. Then do that once or twice more, and once it is basically caught-up, THEN it can run in "near real-time" just fine.


And here is the exact answer to the specs for an 'ideal server' for in-house analysis. "10 Terabytes of solid state storage and half a terabyte of RAM." Reason being: HUGE DB queries need TONS of RAM.

If you ever wanted to cringe at a price tag... go look at server boards that can handle 512-1024 GB of RAM. THEN...go look up how much it would cost to get the ECC Registered RAM to get at least 512GB of it (in the module size needed for the board.. probably 8 or 16GB modules, times however many). That's painful.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1761546 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1761584 - Posted: 2 Feb 2016, 10:52:41 UTC - in response to Message 1761546.  

If you ever wanted to cringe at a price tag... go look at server boards that can handle 512-1024 GB of RAM. THEN...go look up how much it would cost to get the ECC Registered RAM to get at least 512GB of it (in the module size needed for the board.. probably 8 or 16GB modules, times however many). That's painful.


Both of those were in the price spec. at the bottom. As Chris noted the RAM would be in 16x32GB sticks. Here is an example dual-Xeon board rackmount for the quoted cost. It actually has 1TB of max. RAM capacity in 16 DIMM slots whereas the one I had seen originally had 512GB, and also has 10GBit Ethernet.

I guess someone from the SAH team would be the only ones who could say one way or another if the capped "catch-up" rate of 64 queries/sec. to the science server would cause an issue, but as indicated the BOINC database can be seen to get 1,200+ queries/sec. on the server status page (when I looked at it right now it was 1,360.)
ID: 1761584 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1761617 - Posted: 2 Feb 2016, 12:15:29 UTC - in response to Message 1761584.  

If you ever wanted to cringe at a price tag... go look at server boards that can handle 512-1024 GB of RAM. THEN...go look up how much it would cost to get the ECC Registered RAM to get at least 512GB of it (in the module size needed for the board.. probably 8 or 16GB modules, times however many). That's painful.


Both of those were in the price spec. at the bottom. As Chris noted the RAM would be in 16x32GB sticks. Here is an example dual-Xeon board rackmount for the quoted cost. It actually has 1TB of max. RAM capacity in 16 DIMM slots whereas the one I had seen originally had 512GB, and also has 10GBit Ethernet.

I guess someone from the SAH team would be the only ones who could say one way or another if the capped "catch-up" rate of 64 queries/sec. to the science server would cause an issue, but as indicated the BOINC database can be seen to get 1,200+ queries/sec. on the server status page (when I looked at it right now it was 1,360.)

Noted. I.. admittedly began to skim about halfway through all the pictures for cloud diagrams and stuff and missed the spec-list at the end.

As I said though, getting 512GB of RAM is not cheap. And Matt said before that it would need to be 10TB of SSDs (for the I/O and near-zero latency).

As far as the DB being able to "handle" 1200 queries/sec.... all of those queries are simple, quick look-ups, like one single machine reporting a small pile of completed tasks and asking for new work (reporting, and requesting would be two separate queries, not to mention other queries that check to make sure the tasks being assigned are not being given to another machine that is already owned by that user account, etc). Those are light-weight, inexpensive (for CPU and I/O) queries.

The type of queries that NTPCkr needs to run.. needs access to the entire database..for one single query, that could take hours or even days to finish running. The little tiny queries are only accessing a few records in one table, for about the amount of time it takes for disk access to read that record and then that query is done. The NTPCkr queries have to read every single record (in a DB that has billions of rows).

That's why going through the backlog is going to take a while, and once it gets caught-up, the NTPCkr server can act as a secondary replica and be fed "what has changed in the last 60 seconds" to chew on, without hammering the main DB.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1761617 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1761686 - Posted: 2 Feb 2016, 16:31:13 UTC - in response to Message 1761617.  

Well I would be interested if and when it's decided.
ID: 1761686 · Report as offensive
Profile Dr Grey

Send message
Joined: 27 May 99
Posts: 154
Credit: 104,147,344
RAC: 21
United Kingdom
Message 1761725 - Posted: 2 Feb 2016, 23:59:08 UTC

Considering the cost of RAM, have things moved forward enough now that its reasonable to run an application like this from an SSD? I read an article earlier where 3 Samsung M.2 950 pro SSDs were run in RAID on an Asus board getting insane bandwidth of around 3 GB/s. That's about a 1/4 of the speed of DDR3 but at a much smaller fraction of the cost.
ID: 1761725 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1761776 - Posted: 3 Feb 2016, 4:54:05 UTC

After Matt said something in his short youtube video about the specs I thought, "This is doable." We've raised money before.

Then I re-listened carefully.

"...no room to expand..."

I assume that if we gave them whatever it is they thought they needed, they still couldn't run it.

Getting it all to an SSD JBOD RAID with that much RAM... it could be SLOW RAM. Heck, with an I/O bottleneck it could be slow processors, too. The important thing would be the first part -> Get the data from the science DB onto a separate DB.

I think Mr. Kevvy has a good plan, but then... what do I know?

Honestly, I think there is a general lack of urgency. You would think that someone would have just "floated a trial balloon" to see if the hardware was doable if they really wanted it.
ID: 1761776 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1762076 - Posted: 4 Feb 2016, 3:39:22 UTC
Last modified: 4 Feb 2016, 3:42:37 UTC

The type of queries that NTPCkr needs to run.. needs access to the entire database..for one single query, that could take hours or even days to finish running.


It doesn't seem that way... again it takes 5,000 CPU days to correlate 2,113 actual days' results, so 2.64 or more cores could do it in real time. The queries must be returning smaller results and I expect that as much as possible of that database is in RAM thus the huge memory requirement. I would imagine that the sky map is "pixellized" so that each object has a hard-and-fast location that can be indexed for instantaeous retrieval, rather than everything having a long decimal RA + Dec. location that requires extensive searching to find nearby objects that may be matches... it would just be too slow.

This would be an ideal question for the SAH team to provide more info on...

Well I would be interested if and when it's decided.


I will definitely pass the word. :^)

Considering the cost of RAM, have things moved forward enough now that its reasonable to run an application like this from an SSD? I read an article earlier where 3 Samsung M.2 950 pro SSDs were run in RAID on an Asus board getting insane bandwidth of around 3 GB/s. That's about a 1/4 of the speed of DDR3 but at a much smaller fraction of the cost.


The original spec. for NTPCkr was for 512GB of RAM and 10TB of SSD, and as you noted since RAIDed SSD is so fast I'm wondering why both. Then again that spec. is for that NTPCkr to do it all.

I think Mr. Kevvy has a good plan, but then... what do I know?


Enough that I say thank you. :^)

Honestly, I think there is a general lack of urgency. You would think that someone would have just "floated a trial balloon" to see if the hardware was doable if they really wanted it.


I agree in that I think they are just too busy elsewhere and will be so for the immediate future with all of the changes happening. The "cloud" NTPCkr happened because an intern worked on it, but there is a substantial investment in development for a local one that I doubt can be fulfilled right now. Still, may as well float this idea if it can be done for a small fraction of the cost... may be an ideal project for the next intern or post-grad.
ID: 1762076 · Report as offensive

Message boards : Number crunching : Proposal/idea for a near-real-time, low-resource, inexpensive NTPCkr


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.