LotzaCores and a GTX 1080 FTW

Message boards : Number crunching : LotzaCores and a GTX 1080 FTW
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1791668 - Posted: 29 May 2016, 11:37:20 UTC - in response to Message 1791664.  

Look in the Event Log again. Do you see "This computer has reached a limit of tasks in progress", or words to that effect?

That again is in place to protect the servers, and apportion the work more evenly among the ~150K volunteers. If there was no restriction, and everybody turned every knob up to 11 (as they did at one point), the fastest hosts would be caching thousands of tasks. Each task requires an entry in the database at Berkeley: thousands of tasks times thousands of computers times many days equals a hugely bloated and inefficient database.

I'll be shouted down for this, but no permanently-connected SETI host *needs* a cache longer than about six hours, to cover 'Maintenance Tuesdays'. Make that 12 hours, to cover the congestion period after the end of maintenance, but no more.
ID: 1791668 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791672 - Posted: 29 May 2016, 11:55:16 UTC - in response to Message 1791668.  
Last modified: 29 May 2016, 12:22:39 UTC

Richard, thanks for the explanation, I will check it out. I wasn't sure if it had anything to do with it being a new installation and the project didn't yet 'trust' this new host that is throwing bunches of tasks back at it. I suppose you're correct, as long as the cache is sufficient for any (normal) downtime, that should be enough. It's just a little weird to look at that screen, and see that half of all the tasks listed are processing at one time. Kinda neat, actually. :-)

*edit* Richard, I was just thinking about this, remembering a comment on another thread about how us just being here on the forum makes us the less than 1%ers, as opposed to the set and forget crowd who are the vast majority of users. If there are relatively few of us (in the scheme of things) who would even bother to turn it up to 11, (and by putting in the current limitation it appears to have been turned down to 1), would changing that value up to say 3(?) (raise limit to maybe 200 instead of 100 for those who specifically can use it) make that big a difference? I guess the question is, how close to the edge is the database running currently, and if close, does that mean that we will have issues in the near future, because of all the new data that is being generated? And does that then mean that we need to upgrade(?) our database for such things? Especially if we somehow get a bump in the number of users, which would certainly effect the # of tasks out in the field.

I don't have a super good grasp of most of the background processes of the project or their capabilities and how close to the edge any of them are running, so all of these statements/questions are coming from a place of ignorance, that's why I am asking them. Thanks for tolerating my lack of knowledge, I can imagine it gets old repeating the same things to the masses over and over again.

ID: 1791672 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1791673 - Posted: 29 May 2016, 11:56:38 UTC

You are limited to 100 tasks per device (CPU or GPU) for you current runtimes I calculate you have about a 7.8h cache.
ID: 1791673 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791674 - Posted: 29 May 2016, 12:02:19 UTC - in response to Message 1791673.  

You are limited to 100 tasks per device (CPU or GPU) for you current runtimes I calculate you have about a 7.8h cache.

Well, it's not 12 hours, but it should probably be enough to hopefully get me thru most Tuesdays fairly unscathed? When I install the GPU, I will be dedicating a few cores to properly feed it, which will help extend it a little bit as well. Thanks for doing the math for me!

ID: 1791674 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1791681 - Posted: 29 May 2016, 12:30:46 UTC

The limit is NOT 100 per 24 hours, but 100 tasks reported as "in progress". This has been the case for a few years.
The only time you are capped per day is when a computer returns a number of error results, in which case the concurrent limit is reduced - as I had a couple of weeks back when I screwed up an update and had an invalid driver installed on one of my rigs. Thankfully I caught that quickly and the cap was removed within a few hours as the computer concerned returned enough valid results.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1791681 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791683 - Posted: 29 May 2016, 12:42:04 UTC - in response to Message 1791681.  

Rob, yep, I know, but as I am running 48 tasks at any given time, when I get to that limit (100 x 2 CPU's) I am running 25% of all my available tasks at one time, and as was mentioned, have less than an 8 hour cache. Which _should_ be enough to cover most eventualities, but it would be nice to have a little bigger buffer.

ID: 1791683 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1791687 - Posted: 29 May 2016, 12:57:56 UTC - in response to Message 1791683.  

actually, my math was wrong on that, I was thinking 32 cores.

with 48 you could say 1/2 of the 100 allotted tasks, of those 48 'In Progress' could assume they are 50% completed.

So you are probably closer to 3.75 - 4 hours.

Don't worry though, a 1080 in there might even be less than that for a cache :)
ID: 1791687 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791691 - Posted: 29 May 2016, 13:11:12 UTC - in response to Message 1791687.  

Well, actually it is 24 tasks per CPU, so with 2 CPU's it should be 200 tasks in the hopper waiting to crunch on all 48 cores, so I believe your initial calcs were probably pretty close. And it will be interesting to see what effect the 1080 has on it.

ID: 1791691 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1791692 - Posted: 29 May 2016, 13:20:13 UTC - in response to Message 1791691.  

BOINC only counts the CPU cores, and doesn't know anything about the packaging. It wouldn't even know the difference between 24 hyperthreaded cores in one package, or 24 single-threaded cores in each of two packages.

Better to think of the "in progress" limit as '100 tasks assigned to CPU applications'.
ID: 1791692 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791696 - Posted: 29 May 2016, 13:33:23 UTC - in response to Message 1791692.  
Last modified: 29 May 2016, 13:40:09 UTC

So to make sure I have this correct, say my name was Gates and I had an 8 way system set up running BOINC, with 12 cores per CPU, 24 w/HT, I would actually have about 1/2 the cores sitting around idling, because the CPU side of things only gets 100 tasks max, regardless of cores? And the same goes for the GPU side, regardless of how many GPU's are installed in a single system? (though obviously you can't hit the GPU tasks in progress limit of the GPU cache, at least not in any system that _I_ can afford.. lol) Just want to make sure I have this correct for future reference.

ID: 1791696 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1791700 - Posted: 29 May 2016, 13:50:29 UTC - in response to Message 1791696.  

That's right for CPUs, but not for GPUs. BOINC does count those separately, so with 2 GPUs you could have 198 tasks sitting around doing nothing (or 196, if two are running on each GPU, and so on). Remember, as soon as you return a completed task, you are eligible to receive another one in return.

And also remember that all server-side limits are entirely at the project's discretion, and liable to change without notice (or even notification). If, for example, Breakthrough Listen got a lot of publicity: maybe one of their other programs caught an interesting candidate signal. And say the public at large mistook the announcement for a SETI@Home discovery, and arrived here in their droves. I'd actually expect the project staff to *reduce* the 'per device' limits, to make sure there was enough work to go round.

At least in the short term. It makes sense when you look at it from the project's end of the telescope.
ID: 1791700 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791711 - Posted: 29 May 2016, 14:45:10 UTC - in response to Message 1791700.  

Good info to know, thanks. Last thought, is the problem the size of the database in terms of it's software abilities, it's hardware limitations, storage limitations? Do you happen to know what the bottleneck is and any thoughts on what would be needed to address and correct it, so we wouldn't run into problems with it again for many years? Money, obviously, but on what? Thanks!

ID: 1791711 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1791728 - Posted: 29 May 2016, 15:48:23 UTC - in response to Message 1791711.  

Keeping it lean, mean, and efficiently indexed, I guess. Remember that we're talking about the BOINC task database: at the time of writing, that's recording 76,899 results reported each hour [I'm suspicious of that figure - it doesn't seem to have changed since we started running guppies - but let's run with it for now]. That equates to about 13 million rows being deleted between each weekly maintenance session, and the same number of different rows being being added at the same time. Just in the task table - add about half that, say 6 million rows being added and deleted to and from the workunit table each week as well.

That leads to a very 'holey' database: I think Matt has written (though many years ago) that the greatest part of the weekly maintenance time is taken up in compacting the records and re-indexing them so that the response time doesn't get too sluggish. I don't know if you've ever worked with a database of that size changing that quickly - I certainly haven't - but it frightens me. I reckon the staff in our lab can probably count themselves among the best in the world at keeping that show on the road, with so few staff and so little money. Unless you happen to have a world-class database consultant you can lend them - for free - I think it's probably best to leave them to do what they know best.
ID: 1791728 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1791730 - Posted: 29 May 2016, 15:53:34 UTC - in response to Message 1791711.  

Good info to know, thanks. Last thought, is the problem the size of the database in terms of it's software abilities, it's hardware limitations, storage limitations? Do you happen to know what the bottleneck is and any thoughts on what would be needed to address and correct it, so we wouldn't run into problems with it again for many years? Money, obviously, but on what? Thanks!

I think All of the Above is probably the most accurate answer regarding the db limitations. I don't believe anything has changed since you were asking about the limits a few weeks ago.
It was mentioned, maybe last year, that other database options were being explored. Breakthrough Listen may have interrupted that process. It was also mentioned they were looking to reduce their server footprint. So smaller more powerful servers may go hand in hand with new db software.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1791730 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1791732 - Posted: 29 May 2016, 15:57:34 UTC - in response to Message 1791728.  

Keeping it lean, mean, and efficiently indexed, I guess. Remember that we're talking about the BOINC task database: at the time of writing, that's recording 76,899 results reported each hour [I'm suspicious of that figure - it doesn't seem to have changed since we started running guppies - but let's run with it for now]. That equates to about 13 million rows being deleted between each weekly maintenance session, and the same number of different rows being being added at the same time. Just in the task table - add about half that, say 6 million rows being added and deleted to and from the workunit table each week as well.

That leads to a very 'holey' database: I think Matt has written (though many years ago) that the greatest part of the weekly maintenance time is taken up in compacting the records and re-indexing them so that the response time doesn't get too sluggish. I don't know if you've ever worked with a database of that size changing that quickly - I certainly haven't - but it frightens me. I reckon the staff in our lab can probably count themselves among the best in the world at keeping that show on the road, with so few staff and so little money. Unless you happen to have a world-class database consultant you can lend them - for free - I think it's probably best to leave them to do what they know best.

And folks wonder why the weekly outage takes so long to accomplish.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1791732 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1791734 - Posted: 29 May 2016, 16:08:53 UTC

Thanks for the detailed info, and the reminder. I know they are doing an amazing job, sometimes it's a wonder it's still working at all, but talent and dedication are good traits to have, and our team has them in spades. Well, nothing but kudos to them, and I'll just happily crunch away knowing that it is in very good hands, and whatever changes are made, are done with the best interests of the project as a whole, regardless of the impact on the individual, which is how it should be.

ID: 1791734 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1791746 - Posted: 29 May 2016, 16:42:04 UTC - in response to Message 1791663.  
Last modified: 29 May 2016, 16:47:10 UTC

Well, just got up, so I went down, paused and then exited BOINC, uninstalled and then reinstalled Lunatics, and it seemed to start right where it left off, with no drama. Only thing slightly unusual was that windows security asked if it was alright to allow BOINC thru the firewall, I've never seen that one before.

I am running Hynix 1866 memory, in singles per bank to allow the system to utilize it at the full speed, as I read that more sticks = slower speeds. And 32 gig is more than enough for what I am running on this.

So far, looking at temps, it appears that they may have crept up a few degrees, maybe an average of 3-5, but it looks like they are still for the most part at 50 or below except on 3-4 cores out of 24 on each CPU. But, I suppose that will vary depending on they type of WU is being processed.

I have seen several benchmarks over the years that have shown an increased latency as the number of DIMMs per memory channel is increased. So I try to stick to one DIMM per channel as well.

Normal AR tasks with the AVX app appear to be running slightly faster. With run times as low as 2hr50min. Once you start using GPUs that could change. The GPUs could either choke the CPU tasks or be restricted by already saturated memory i/o. Then again maybe nothing will change. This system system has been defying all previous attempts at logical behavior thus far.

Is the increase in temp you mentioned after using the AVX app? That would normally be reasonable behavior.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1791746 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1791754 - Posted: 29 May 2016, 17:17:17 UTC - in response to Message 1791654.  
Last modified: 29 May 2016, 17:19:02 UTC


I suspect that AVX may prove to be slower on that system. With 48 tasks at once that is a lot to stuff down the memory pipeline all at once.
It may not be the most correct way to say it, but I think higher level SIMD instructions tend to be more memory intensive.
I was already very surprised by the performance of the E5 v2 CPUs versus the E5 previous generation. So I'm split 50/50 on how AVX will compare to SSE3 & will have to find out if they are using DDR3 1600 or 1866 memory.

AVX apps proved to be the most efficient on my i5-4670K systems with DDR3 1600 memory.



. . FWIW On my i5 6400 with DDR4 2333 ram AVX works a treat, almost halving the runtimes and not running that hot, stays mainly in the 50,s. But efficiency drops off sharply if I run crunching on all 4 Cores (all four cores flat line at 100% and runtimes increase). So I just run 3 and live with a happy PC.

Mid to low 50C's here on an i7 3820 in Turbo mode(3.81GHz), with 4 wus running on the cpu all at once, that Alphacool Eisberg 240 works pretty good on the Asus RIVE(X79 chipset, I also have an EVGA X79 DARK, complete with a 3820 too, I did order a 4820K recently via the Paypal Credit line that I have on ebay, I'd gotten an upgrade in My credit limit by $398.83). Plus I have 2 wus running on the PNY LC 580, sure it's not a 980 or a 970 or even a 1080, but it will do for a bit, then I'll upgrade, to an EVGA 1070 Hybrid or I'll make one if EVGA doesn't.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1791754 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1791787 - Posted: 29 May 2016, 19:12:31 UTC - in response to Message 1791730.  

Good info to know, thanks. Last thought, is the problem the size of the database in terms of it's software abilities, it's hardware limitations, storage limitations? Do you happen to know what the bottleneck is and any thoughts on what would be needed to address and correct it, so we wouldn't run into problems with it again for many years? Money, obviously, but on what? Thanks!

I think All of the Above is probably the most accurate answer regarding the db limitations. I don't believe anything has changed since you were asking about the limits a few weeks ago.
It was mentioned, maybe last year, that other database options were being explored. Breakthrough Listen may have interrupted that process. It was also mentioned they were looking to reduce their server footprint. So smaller more powerful servers may go hand in hand with new db software.

As a matter of fact, regarding the database performance issues.. Matt recently said
[...]we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

So that sounds like they must have found a solution/workaround for the low I/O performance that has been plaguing the database for years. Basically.. all the whitepapers for the hardware, software, and the db itself all said that it should be just fine above a certain I/O threshold, and in theory, they should have been well above that, but in practice, they couldn't get anywhere near the I/O requirements. I remember hearing that it might have been the RAID controller (that was like.. two years ago?) but never heard anything in the form of progress on that front since then--understandably so, because their time for troubleshooting something that isn't technically broken is very limited.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1791787 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1791789 - Posted: 29 May 2016, 19:23:28 UTC - in response to Message 1791787.  
Last modified: 29 May 2016, 19:23:39 UTC

Not sure if related to those DB performance changes at all, but while looking into some other Boinc related odd behaviour changes (some things started behaving 'more Windowsy'), I stumbled on that the recent Linux Kernel changes include a near complete redo of block layer IO device interfaces, supporting better scaling/threaded IO. Quite possible some sortof filesystem and/or kernel change combination might be part of that.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1791789 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

Message boards : Number crunching : LotzaCores and a GTX 1080 FTW


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.