Congratulations. We're the 171st fastest supercomputer

Author	Message
tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1493634 - Posted: 22 Mar 2014, 21:29:46 UTC - in response to Message 1493516. Last modified: 22 Mar 2014, 21:37:11 UTC Now, is anybody going to tell me that the top computer builder, or the top programmer, at Collatz or DistrRTgen can get 27 or 23 times more GFlops out of the same consumer-grade silicon, as their opposite numbers here at SETI? Let's pick at this a little bit. The Linpack benchmark used for estimating the Top 500's "real" power is pretty specific. How would any machine we might own fair running it? But, of course, the more basic question is, "Does Linpack represent a benchmark that has relevance to what we do?" I read that Linpack is a double-precision benchmark. We don't use double-precision. Is it possible that Eric's number is correct using Linpack as it's yardstick, but that in single-precision calculations we would double, triple, or more our performance? (considering the huge difference in single and double precision times on an NVIDIA GPU) Is it possible that the difference in AP and MB task "credits" we see are real based on what Linpack would say IF Linpack were running? It would be very interesting to know how Eric derived his number. My question is really this: Is it possible that the type of calculations being done at other projects really are better suited for the hardware we run and they really are getting multiple times the FLOPS we get? Take for instance GPUGRID -> I've long considered those "credits" to be fictional. BUT... is that my failure? When I ran GPUGRID I found it almost impossible to keep my cards cool. They were consuming a lot of power, producing a lot of heat, and racking-up a lot of credits. If my card shows a 95% GPU busy-state at Einstein, and a 95% busy-state at GPUGRID, but the card is 20C hotter at GPUGRID, additional electrons must be swirling around in there somewhere. Additional work is being done. More FLOPS per unit "busy-ness" of the card, right? ...or wrong? EDIT - Maybe what's called-for is to look at identical computers across the projects and see what their RAC is. The highest scoring machines at Einstein are all using AMD cards; notsomuch here. The difference is partially due to which hardware is better-suited to the tasks. Einstein's still running CUDA 3.2. If I ran my hardware here on CUDA 3.2 my TFLOPS would be much reduced. I'd really like to know how Eric estimates our collective capacity. ID: 1493634 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1493687 - Posted: 22 Mar 2014, 22:24:35 UTC - in response to Message 1493634. I'd really like to know how Eric estimates our collective capacity. (I'm not an expert on benchmarking, so I'll leave that part) If you haven't already, have a read of CreditNew, to remind yourself how we got into this situation. The bit which is relevant to your question is: The second credit system We then switched to the philosophy that credit should be proportional to the FLOPs actually performed by the application. We added API calls to let applications report this. We call this approach "Actual-FLOPs-based". SETI@home's application allowed counting of FLOPs, and they adopted this system, adding a scaling factor so that average credit per job was the same as the first credit system. Not all projects could count FLOPs, however. So SETI@home published their average credit per CPU second, and other projects continued to use benchmark-based credit, but multiplied it by a scaling factor to match SETI@home's average. This system has several problems: * It doesn't address GPUs properly; projects using GPUs have to write custom code. * Project that can't count FLOPs still have device neutrality problems. * It doesn't prevent credit cheating when single replication is used. Although this project is now running the Third (i.e. New) Credit system, the mechanics of the second system are still in place. Looking at the most recent task I reported from this computer, I see that stderr_txt contains Flopcounter: 38994473207145.844000 What's more, the actual report back to the project (I'm looking at the sched_request file) also contains <fpops_cumulative>111104300000000.000000</fpops_cumulative> A second task contains Flopcounter: 17692153721950.531000 <fpops_cumulative>50314560000000.000000</fpops_cumulative> Current workunit headers still contain <credit_rate>2.8499999</credit_rate>, aka the old "Credit Multiplier: 2.85" that old-timers like me will remember - and that's very close to the ratio of 'counted' and 'cumulative' floating point operations. So, all the components are there for a true estimate of FPOPs to be saved in the SETI database, and - since, obviously, we record time too - we can calculate true FLOPs directly. My guess (and it can only be a guess, until Eric gets back from the conference) is that Eric has access to the raw FLOPs data and had a quick look at it while preparing his speech. ID: 1493687 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1493692 - Posted: 22 Mar 2014, 22:34:17 UTC Richard........ You have to perpetuate the credit screwed scenario again? I thought it was calming down. I truly don't give the back end of a rat about it anymore. I just don't. In the true meaning of the project, it is simply just a non-issue. Not that I don't luv my creds. But slicing and dicing about why they could not be higher or better? Not my issue anymore. I do not care. Every other user here is issued credits on the same level and basis as I am. So it's all good, as far as I am concerned. I know, Let's have a spelling contest. That would work. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1493692 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1493715 - Posted: 22 Mar 2014, 23:06:03 UTC - in response to Message 1493687. So, all the components are there for a true estimate of FPOPs to be saved in the SETI database, and - since, obviously, we record time too - we can calculate true FLOPs directly. I hadn't put that two and two together, Richard. Thanks. I suspect you are right, then. I also suspect you are correct about the other projects. Maybe Eric can tell us if that's how he got his number. I'd like to know. Still, it does leave the heat issue unresolved. Maybe Jason will drop-by to tell us if heat is an indication of FLOPing or if heat is a result of some sort of hardware electrical effect caused by fewer memory accesses or something. (Mark - this isn't about credits, it's about calculating FLOPSs. It's possible to talk about this without whining about the credit numbers. We just did.) So, if our total RAC is ________, and our actual speed is 237 TFLOPS, I should be able to calculate my own speed within a reasonable margin of error for such things. Or did I miss something in your message? I've read it three times, but that doesn't mean I couldn't have misunderstood. ID: 1493715 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1493934 - Posted: 23 Mar 2014, 8:57:50 UTC - in response to Message 1493715. Last modified: 23 Mar 2014, 8:58:47 UTC Bigger temp most probably indicates fewer arithmetic unit stalls on data awaiting. FLOP conseption doesn't reflect the need of algorithm in memory. Some algorithms more memory-demanding, some not. Lets consider 2 simplified samples. One computes Y[i]=X[i]+C Another - Y[i]=C*sin(X[i]) Both need to read constant, read one variable and save(write) one result per array element, but amount of calculations will be quite different. First algorithm will be constrained by speed of memory access on all modern hardware CPU or GPU. Second one has some chances to hide at least part of memory access time by sine computations. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1493934 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1494060 - Posted: 23 Mar 2014, 17:45:58 UTC - in response to Message 1493934. First algorithm will be constrained by speed of memory access on all modern hardware CPU or GPU. Second one has some chances to hide at least part of memory access time by sine computations. So, if I understand the implication of what you are saying, then the hotter card IS doing more FLOPS because it waits less per second. If that is true, then there is a real difference in the FLOPS performance of the hardware and the hotter project is making more complete use of the resources. Said differently; a real count of the floating point operations WILL be higher per second. That means that if "credit" were granted on actual FLOPS, the code that produces the fewest waits will get the largest "credit." Interesting... ID: 1494060 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1494063 - Posted: 23 Mar 2014, 17:56:33 UTC - in response to Message 1494060. First algorithm will be constrained by speed of memory access on all modern hardware CPU or GPU. Second one has some chances to hide at least part of memory access time by sine computations. So, if I understand the implication of what you are saying, then the hotter card IS doing more FLOPS because it waits less per second. If that is true, then there is a real difference in the FLOPS performance of the hardware and the hotter project is making more complete use of the resources. Said differently; a real count of the floating point operations WILL be higher per second. That means that if "credit" were granted on actual FLOPS, the code that produces the fewest waits will get the largest "credit." Interesting... Yes, that's a valid way of looking at it: one of the tricks of optimisation is to make sure that the information you're going to need in a moment is exactly where you're going to need it, with the fastest possible route from there to the computational engine that's going to process it. The same applies to CPUs too: it's nothing special for GPUs (although the effects may be more dramatic, because of the longer route the data has to travel over the PCI bus from main memory). It's one of the reasons why the old AK_v8 applications for CPU (now withdrawn) were so good: because Alex Kan put a lot of time and cleverness into getting the memory access right for that particular generation of CPUs. Or so I'm told, anyway. ID: 1494063 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1494119 - Posted: 23 Mar 2014, 19:49:20 UTC - in response to Message 1494063. Last modified: 23 Mar 2014, 19:50:06 UTC It's one of the reasons why the old AK_v8 applications for CPU (now withdrawn) were so good: because Alex Kan put a lot of time and cleverness into getting the memory access right for that particular generation of CPUs. Or so I'm told, anyway. Slightly wrong. Cause AKv8 not withdrown first of all. It's still our base for MB7 opt CPU apps. Only ICC/IPP version abandoned for now. Hand-made optimizations still in work. Or replaced by even better ones. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1494119 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1494129 - Posted: 23 Mar 2014, 20:04:14 UTC Just a general comment, then: I'm more enthusiastic about NVIDIA's Maxwell line and the rumored inclusion of a Tegra processor on the card. IF I understand what I think I understand about what Jason said in an old thread somewhere, that will allow a programmer to put the entirety of the data in VRAM on the card, then program the Tegra to feed the GPU's "processors" when needed. Suddenly even antiquated PCIe bus speeds wouldn't affect total processing time very much. We ought to be able to screeeeeeam through some data if we can keep the cards cool enough and the Tegra can handle whatever CPU tasks the work we do requires (beyond RAM read/writes). I hope it is as large of an order of magnitude leap as it sounds. If so, we might either; A)challenge the bandwidth even at the COLO, B)risk running-out of data to process. The latter would be a good thing. The former might get us booted out of the COLO. ID: 1494129 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1494132 - Posted: 23 Mar 2014, 20:10:10 UTC I highly suspect that if the bandwidth usage ever became an issue with the colo facilities, they would simply throttle the project's bandwidth, not boot the project. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1494132 ·

ExchangeMan Volunteer tester Send message Joined: 9 Jan 00 Posts: 115 Credit: 157,719,104 RAC: 0	Message 1494277 - Posted: 24 Mar 2014, 2:16:17 UTC - in response to Message 1493557. I have that machine. I know I have been trying to catch you all winter. How do you run 5 GPU cards on one machine? That is your secret is it not? I don't think that 5 GPU cards or 4 explain the gap between your RAC and mine. We both have 8 GPUs, that's the bottom line. I have 1 Titan, 1 680 and 6 690s. You have 8 690s. The likely explanation is the higher performance of the Titan. ID: 1494277 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1494453 - Posted: 24 Mar 2014, 12:19:26 UTC - in response to Message 1494277. I have that machine. I know I have been trying to catch you all winter. How do you run 5 GPU cards on one machine? That is your secret is it not? I don't think that 5 GPU cards or 4 explain the gap between your RAC and mine. We both have 8 GPUs, that's the bottom line. I have 1 Titan, 1 680 and 6 690s. You have 8 690s. The likely explanation is the higher performance of the Titan. There are app and parameter differences between your hosts, you both run the Cuda50 app, Batterup as Stock, ExchangeMan as anonymous platform, Batterup uses: mbcuda.cfg, Global pfblockspersm key being used for this device pulsefind: blocks per SM 8 mbcuda.cfg, Global pfperiodsperlaunch key being used for this device pulsefind: periods per launch 200 Priority of process set to HIGH successfully Priority of worker thread set successfully ExchangeMan uses: mbcuda.cfg, Global pfblockspersm key being used for this device pulsefind: blocks per SM 16 mbcuda.cfg, Global pfperiodsperlaunch key being used for this device pulsefind: periods per launch 200 Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully Batterup runs the Stock r1316 AP app with the default parameters, but with the -hp switch: DATA_CHUNK_UNROLL at default:2 Priority of worker thread raised successfully Priority of process adjusted successfully, high priority class used ExchangeMan runs the r2058 AP app with the following parameters: DATA_CHUNK_UNROLL set to:16 FFA thread block override value:16384 FFA thread fetchblock override value:8192 Sleep() & wait for event loops will be used in some places Priority of worker thread raised successfully Priority of process adjusted successfully, high priority class used ExchangeMan is also doing AP on the CPU with r2137 x64, that will give him a good Boost to his RAC too. Claggy ID: 1494453 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494457 - Posted: 24 Mar 2014, 12:31:22 UTC And not forget Titan+680 produces a lot more than a 690, actualy the 690 is equivalent to a little less than 2x670 aproximately. ID: 1494457 ·

Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0	Message 1494699 - Posted: 24 Mar 2014, 19:50:53 UTC - in response to Message 1494457. And not forget Titan+680 produces a lot more than a 690, actualy the 690 is equivalent to a little less than 2x670 aproximately. Point of order; the 690 has two 680 detuned chips. My RAC just took a hit of 10,000 cobblestones because of two CPU shut downs, about 16 hours lost. I had to cut back the 4.25 overclock because of the warmer weather. The top cruncher is tuned a bit better than mine and most likely has an up time of 99 and 44/100%, no crashes. When I made changes to the code or hardware the time it took caused me to fall behind so I kept it to a minimum. Summer is coming so there has to be less crunching but that will give me a chance to tune so, I'll be back. ID: 1494699 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494736 - Posted: 24 Mar 2014, 20:52:04 UTC - in response to Message 1494699. Point of order; the 690 has two 680 detuned chips. You are right but they are so detuned so they actualy produces about the same daily production of about 2x670 running (+/- 50K/day), who is not to bad anyway. ID: 1494736 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.