LotzaCores and a GTX 1080 FTW

Message boards : Number crunching : LotzaCores and a GTX 1080 FTW
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

AuthorMessage
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50143
Credit: 951,149,780
RAC: 233,514
United States
Message 1791732 - Posted: 29 May 2016, 15:57:34 UTC - in response to Message 1791728.  

Keeping it lean, mean, and efficiently indexed, I guess. Remember that we're talking about the BOINC task database: at the time of writing, that's recording 76,899 results reported each hour [I'm suspicious of that figure - it doesn't seem to have changed since we started running guppies - but let's run with it for now]. That equates to about 13 million rows being deleted between each weekly maintenance session, and the same number of different rows being being added at the same time. Just in the task table - add about half that, say 6 million rows being added and deleted to and from the workunit table each week as well.

That leads to a very 'holey' database: I think Matt has written (though many years ago) that the greatest part of the weekly maintenance time is taken up in compacting the records and re-indexing them so that the response time doesn't get too sluggish. I don't know if you've ever worked with a database of that size changing that quickly - I certainly haven't - but it frightens me. I reckon the staff in our lab can probably count themselves among the best in the world at keeping that show on the road, with so few staff and so little money. Unless you happen to have a world-class database consultant you can lend them - for free - I think it's probably best to leave them to do what they know best.

And folks wonder why the weekly outage takes so long to accomplish.
"The secret o' life is enjoying the passage of time." 1977, James Taylor
"With cats." 2018, kittyman

ID: 1791732 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1791734 - Posted: 29 May 2016, 16:08:53 UTC

Thanks for the detailed info, and the reminder. I know they are doing an amazing job, sometimes it's a wonder it's still working at all, but talent and dedication are good traits to have, and our team has them in spades. Well, nothing but kudos to them, and I'll just happily crunch away knowing that it is in very good hands, and whatever changes are made, are done with the best interests of the project as a whole, regardless of the impact on the individual, which is how it should be.

ID: 1791734 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 190,596,941
RAC: 14,745
United States
Message 1791746 - Posted: 29 May 2016, 16:42:04 UTC - in response to Message 1791663.  
Last modified: 29 May 2016, 16:47:10 UTC

Well, just got up, so I went down, paused and then exited BOINC, uninstalled and then reinstalled Lunatics, and it seemed to start right where it left off, with no drama. Only thing slightly unusual was that windows security asked if it was alright to allow BOINC thru the firewall, I've never seen that one before.

I am running Hynix 1866 memory, in singles per bank to allow the system to utilize it at the full speed, as I read that more sticks = slower speeds. And 32 gig is more than enough for what I am running on this.

So far, looking at temps, it appears that they may have crept up a few degrees, maybe an average of 3-5, but it looks like they are still for the most part at 50 or below except on 3-4 cores out of 24 on each CPU. But, I suppose that will vary depending on they type of WU is being processed.

I have seen several benchmarks over the years that have shown an increased latency as the number of DIMMs per memory channel is increased. So I try to stick to one DIMM per channel as well.

Normal AR tasks with the AVX app appear to be running slightly faster. With run times as low as 2hr50min. Once you start using GPUs that could change. The GPUs could either choke the CPU tasks or be restricted by already saturated memory i/o. Then again maybe nothing will change. This system system has been defying all previous attempts at logical behavior thus far.

Is the increase in temp you mentioned after using the AVX app? That would normally be reasonable behavior.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1791746 · Report as offensive
Profile zoom314
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 61237
Credit: 46,159,923
RAC: 11,428
United States
Message 1791754 - Posted: 29 May 2016, 17:17:17 UTC - in response to Message 1791654.  
Last modified: 29 May 2016, 17:19:02 UTC


I suspect that AVX may prove to be slower on that system. With 48 tasks at once that is a lot to stuff down the memory pipeline all at once.
It may not be the most correct way to say it, but I think higher level SIMD instructions tend to be more memory intensive.
I was already very surprised by the performance of the E5 v2 CPUs versus the E5 previous generation. So I'm split 50/50 on how AVX will compare to SSE3 & will have to find out if they are using DDR3 1600 or 1866 memory.

AVX apps proved to be the most efficient on my i5-4670K systems with DDR3 1600 memory.



. . FWIW On my i5 6400 with DDR4 2333 ram AVX works a treat, almost halving the runtimes and not running that hot, stays mainly in the 50,s. But efficiency drops off sharply if I run crunching on all 4 Cores (all four cores flat line at 100% and runtimes increase). So I just run 3 and live with a happy PC.

Mid to low 50C's here on an i7 3820 in Turbo mode(3.81GHz), with 4 wus running on the cpu all at once, that Alphacool Eisberg 240 works pretty good on the Asus RIVE(X79 chipset, I also have an EVGA X79 DARK, complete with a 3820 too, I did order a 4820K recently via the Paypal Credit line that I have on ebay, I'd gotten an upgrade in My credit limit by $398.83). Plus I have 2 wus running on the PNY LC 580, sure it's not a 980 or a 970 or even a 1080, but it will do for a bit, then I'll upgrade, to an EVGA 1070 Hybrid or I'll make one if EVGA doesn't.
My Computer Builds and Other Projects
My Amazon Wishlist
ID: 1791754 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 2949
Credit: 11,122,683
RAC: 300
United States
Message 1791787 - Posted: 29 May 2016, 19:12:31 UTC - in response to Message 1791730.  

Good info to know, thanks. Last thought, is the problem the size of the database in terms of it's software abilities, it's hardware limitations, storage limitations? Do you happen to know what the bottleneck is and any thoughts on what would be needed to address and correct it, so we wouldn't run into problems with it again for many years? Money, obviously, but on what? Thanks!

I think All of the Above is probably the most accurate answer regarding the db limitations. I don't believe anything has changed since you were asking about the limits a few weeks ago.
It was mentioned, maybe last year, that other database options were being explored. Breakthrough Listen may have interrupted that process. It was also mentioned they were looking to reduce their server footprint. So smaller more powerful servers may go hand in hand with new db software.

As a matter of fact, regarding the database performance issues.. Matt recently said
[...]we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

So that sounds like they must have found a solution/workaround for the low I/O performance that has been plaguing the database for years. Basically.. all the whitepapers for the hardware, software, and the db itself all said that it should be just fine above a certain I/O threshold, and in theory, they should have been well above that, but in practice, they couldn't get anywhere near the I/O requirements. I remember hearing that it might have been the RAID controller (that was like.. two years ago?) but never heard anything in the form of progress on that front since then--understandably so, because their time for troubleshooting something that isn't technically broken is very limited.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1791787 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1791789 - Posted: 29 May 2016, 19:23:28 UTC - in response to Message 1791787.  
Last modified: 29 May 2016, 19:23:39 UTC

Not sure if related to those DB performance changes at all, but while looking into some other Boinc related odd behaviour changes (some things started behaving 'more Windowsy'), I stumbled on that the recent Linux Kernel changes include a near complete redo of block layer IO device interfaces, supporting better scaling/threaded IO. Quite possible some sortof filesystem and/or kernel change combination might be part of that.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1791789 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12347
Credit: 127,047,834
RAC: 35,883
United Kingdom
Message 1791794 - Posted: 29 May 2016, 19:42:15 UTC - in response to Message 1791787.  
Last modified: 29 May 2016, 19:43:39 UTC

As a matter of fact, regarding the database performance issues.. Matt recently said [...]we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

Which is great - but the recent discussion has been about the BOINC task/workunit processing database. The science database is the (huge) repository of all the signals found since the project started - it doesn't affect day-to-day matters like 'tasks in progress' limits at all.
ID: 1791794 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1791796 - Posted: 29 May 2016, 19:50:07 UTC - in response to Message 1791746.  

Is the increase in temp you mentioned after using the AVX app? That would normally be reasonable behavior.

Yep, and just checked it a bit ago, still hanging in there around 50 after installing the AVX app, so I am pretty happy with the way it's been performing so far. And about defying logical behavior, well, that figures and is about par for me. lol

I am interested in seeing how the GPU effects it, and SuperMicro has a Lot of information about memory and the different configs/speeds on their site and in their manual. It truly is a server board, and as such they have taken their documentation seriously. This is the first server board I've bought in probably 20 some years, and the last one I bought was to roll my own server, and install Novell Netware 4.01. I had an account with Tech Data, and 4.0 had just came out. All I'll say is that I thought I wanted it because of NDS, because that made sense to me logically as opposed to bindery, but I think they said I was one of the 1st 10 customers to get it in the country at that time. I had a direct support line to Novell, and talk about half baked... I was one of their unofficial beta testers it turned out, because I had ran into things that they had never seen before, even after buying an Intel branded server (this was the 90's, remember) to try and get it to work. I think I still have that thing somewhere down in the basement, I should try firing it up one day for old times sake.

Anywho, not sure if other brands of server boards have this thorough of docs, but I have to say these seem to cover it pretty well, and the couple times I called them before the purchase with questions, they seemed to have their stuff in a group, and I was off the phone in less than 5 mins both times.

ID: 1791796 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 2949
Credit: 11,122,683
RAC: 300
United States
Message 1791868 - Posted: 29 May 2016, 22:23:57 UTC - in response to Message 1791794.  

As a matter of fact, regarding the database performance issues.. Matt recently said [...]we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

Which is great - but the recent discussion has been about the BOINC task/workunit processing database. The science database is the (huge) repository of all the signals found since the project started - it doesn't affect day-to-day matters like 'tasks in progress' limits at all.

Mmm.. noted. I must have misread that. But it is quite likely that maybe some of the solutions for the performance issues of the science DB can translate to the BOINC DB.

I remember the limits were put in place because of the lower-than-expected I/O performance of the DB, which was causing slowdowns to the point of outright crashing. I remember the thought being "if we get more disk spindles, we should be able to increase the I/O," but then it turned out that it was looking more like software/kernel limitations and not so much the hardware, but it was also suspected that it may have been the RAID controller itself and the drivers for it.

It's just been a long time since there was any details about it, so it's all a bit fuzzy now.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1791868 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 8516
Credit: 20,301,200
RAC: 8,945
United States
Message 1791923 - Posted: 30 May 2016, 1:22:02 UTC - in response to Message 1791796.  

We really deserve some pictures of this machine.
ID: 1791923 · Report as offensive
1 % Main, 99 % Beta (Avoid Linux, at all costs)
Volunteer tester

Send message
Joined: 1 Nov 08
Posts: 7346
Credit: 45,404,989
RAC: 6,189
Sweden
Message 1791925 - Posted: 30 May 2016, 1:26:42 UTC - in response to Message 1791923.  

We really deserve some pictures of this machine.

+1
Too much hormone treated meat.
Too much Monsanto veggies.
Too old and outdated constitution.
A crazy problem, as you Yanks use to say......

There is no God, and God never existed.
ID: 1791925 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1791962 - Posted: 30 May 2016, 3:49:26 UTC

The one I just built here with the 48 cores? I'll see if I can take a couple and find some way to get them on a photo hosting site, since our server isn't able to handle pics locally I was told when I wanted to post some here in the past. But, to be perfectly honest, it's less than impressive, because as with most of my systems, it's an open board with a PSU and a (couple, in this case) disk drives.

Not much to see, at all, but if you'd really want check it out, I suppose I can do that, to satisfy peoples curiosity. Maybe I will also post another of my setups, the one that I built a extended video card rack for, that has all the video cards running about 8" above the motherboard. That one I feel is much more interesting, and actually took some effort to build. :-)

ID: 1791962 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,231,156
RAC: 78,947
Australia
Message 1792004 - Posted: 30 May 2016, 7:05:45 UTC - in response to Message 1791962.  

... because as with most of my systems, it's an open board with a PSU and a (couple, in this case) disk drives.

Ah, that would explain why you've got such good CPU temperatures. Running in it's proper server case, even with all the fans screaming along, i'd expect it to be hotter than what it is.
Grant
Darwin NT
ID: 1792004 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1792110 - Posted: 30 May 2016, 14:56:36 UTC - in response to Message 1792004.  

... because as with most of my systems, it's an open board with a PSU and a (couple, in this case) disk drives.

Ah, that would explain why you've got such good CPU temperatures. Running in it's proper server case, even with all the fans screaming along, i'd expect it to be hotter than what it is.

Oh Heck, Yeah. It's because of that accurate description that I am doing it this way. Don't have a datacenter to muffle the sound, and they can get unbearably loud, especially having to listen to them 24/7. I was unsure what to expect from the OEM coolers, but they appear to be working pretty well, looking at the temps right now, room temp is 74f, and about 1/2 are running between 43 and 46, the other half are running 47-51, with only 4 cores currently at 50 or above, but that probably depends on the WU's they are crunching?

ID: 1792110 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 1792118 - Posted: 30 May 2016, 15:19:45 UTC - in response to Message 1792110.  

and about 1/2 are running between 43 and 46, the other half are running 47-51, with only 4 cores currently at 50 or above, but that probably depends on the WU's they are crunching?

The on-chip sensors were not originally intended to be accurate thermometers, and both in my personal experience and as reported by others sensors on the same die can be surprisingly mismatched.

For the specific differences you are reporting, I'd guess there may be some component of real cross-die temperature variation (probably your heat sink and thermal compound don't remove heat perfectly uniformly) with some component of "thermometer error".

As not even the slope of the devices is necessarily well matched, the ideal calibration method would involve setting the whole CPU die to near-zero power idle, but warming it up with external means (with HSF still on). If you warm it to near the actual operating point of interest, but there is very little power dissipation in the chip, then the real temperatures at all the sensors should be well matched, and you can take their reported differences as calibration offset errors. That gives you relative error, but still leaves the overall offset error uncontrolled.

Or you could say "looks pretty good to me" and leave it alone, which is what I'd do in your specific situation.
ID: 1792118 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1792127 - Posted: 30 May 2016, 15:38:17 UTC - in response to Message 1792118.  

...Or you could say "looks pretty good to me" and leave it alone, which is what I'd do in your specific situation.

You nailed it! :-D

But thank you for the detailed reply, that is one that would fall under 'Good to know'!

ID: 1792127 · Report as offensive
1 % Main, 99 % Beta (Avoid Linux, at all costs)
Volunteer tester

Send message
Joined: 1 Nov 08
Posts: 7346
Credit: 45,404,989
RAC: 6,189
Sweden
Message 1792304 - Posted: 30 May 2016, 21:56:47 UTC
Last modified: 30 May 2016, 21:58:16 UTC

Congratulations Al.

The computer is now 2 days old, and the credit is 50,460 and counting. No invalids, and no errors.

Not bad at all for a CPU only machine.
Too much hormone treated meat.
Too much Monsanto veggies.
Too old and outdated constitution.
A crazy problem, as you Yanks use to say......

There is no God, and God never existed.
ID: 1792304 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1792327 - Posted: 30 May 2016, 22:52:55 UTC - in response to Message 1792304.  

Thanks! My RAC in the software shows that it has rocketed from 0 to 4600 in those 2 short days. Who knows how high it just might go? :-)

ID: 1792327 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1676
Credit: 395,499,334
RAC: 286,546
United States
Message 1792434 - Posted: 1 Jun 2016, 3:35:18 UTC

Well, I just found out exactly how long my cache will last before it runs out of work, that is about 4 or so hours. The server went down at about 11 my time, and I ran out of work at about 3:30. So, looks like the system will be hanging around taking a break every Tuesday for 3-5 hours without much to do, though once the GPU is in, that should be fine, it as it will probably only be running 3-4 tasks at a time, we'll have to see how it goes. I learned something today about my new system, so that is a good thing!

ID: 1792434 · Report as offensive
1 % Main, 99 % Beta (Avoid Linux, at all costs)
Volunteer tester

Send message
Joined: 1 Nov 08
Posts: 7346
Credit: 45,404,989
RAC: 6,189
Sweden
Message 1792442 - Posted: 1 Jun 2016, 3:41:58 UTC - in response to Message 1792434.  

Well, I just found out exactly how long my cache will last before it runs out of work, that is about 4 or so hours. The server went down at about 11 my time, and I ran out of work at about 3:30. So, looks like the system will be hanging around taking a break every Tuesday for 3-5 hours without much to do, though once the GPU is in, that should be fine, it as it will probably only be running 3-4 tasks at a time, we'll have to see how it goes. I learned something today about my new system, so that is a good thing!

The servers has been offline due to Oscar's crash, for almost 24 hours. Just came back up.
Too much hormone treated meat.
Too much Monsanto veggies.
Too old and outdated constitution.
A crazy problem, as you Yanks use to say......

There is no God, and God never existed.
ID: 1792442 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

Message boards : Number crunching : LotzaCores and a GTX 1080 FTW


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.