Message boards :
Number crunching :
setiathome v7 7.00 MultiBeam
Message board moderation
Author | Message |
---|---|
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
I've noticed (in Process Lasso), that the AKv8c_r2549_winx86-64_AVXxjfs app runs at 24% cpu usage. Is there a way to improve the runtimes of these MBs by say, increasing the cpu usage? ie - adding another app_config entry referencing such - such as: = <cpu_versions> <cpu_usage>0.75</cpu_usage> </cpu_versions> = Could this help? Would it work? Or should it be just <cpu_usage>1</cpu_usage> = Where did 24% usage come from anyway? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
Is that on a quad-core machine, by any chance? |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Yes Richard. i5 2500 Quad with 4 Threads all cores available in prefs And I run 4x/4up That is, always 4 of them running - 1 on each core |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I didn't really understand the issue when you PMed me asking about this previously. However after looking at the run times for some of your CPU tasks Run time(sec) CPU time (sec) Credit Application 2,344.23 1,957.52 43.18 SETI@home v7 Anonymous platform (CPU) 2,506.23 1,973.60 47.56 SETI@home v7 Anonymous platform (CPU) 7,234.75 5,661.46 119.40 SETI@home v7 Anonymous platform (CPU) 2,401.64 1,854.82 40.20 SETI@home v7 Anonymous platform (CPU) I think I understand your issue. Which I think the question is more "Why are your run times so high compared to your CPU times?". The answer to that is normally that you have another application running causing the SETI@home CPU app to wait for CPU cycles, or you have in some other way over committed your system resources. I don't have a Core i CPU from that generation, but I suppose it could also be that the AVX app is not as efficient on AVX v1.0 hardware? But I don't think that is as likely. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Thanks HAL-- I see where you're coming from there My contention is, since this *is a cpu app and I have 4 cores at 3.3GHz - why isn't it *using the whole thing vs 24% of it. AVX is pretty strong but why cripple it by calling 24% instead of whole ball o wax. Is app_info or app_config messed up somehow? Would my "suggested" app_config entry fix it? = Yes I do run GPU versions alongside(CUDA50) - they only draw 1-4% cpu usage. |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
In Windows, the CPU resource is looked at as a whole, with 100% being the sum total of all cores and sockets in a system. So on your single socket, quad core CPU, 100% is all 4 cores, which means that no individual core can go over 25%. Note that Task Manager shows decaying averages for CPU usage. Which means ultimately, no, you cannot change any configuration in BOINC to use more than 24/25% CPU on a single core on your machine. |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Man! Thanks for the Feedback on this y'all. Illuminated my misconceptions of what I *thought I was working with. = Darn it! Every time I *think I'm onto something/getting somewhere, I get punked by Windows/WinTel I should say - *Thinking(and Marketing, of course). = Well, dunno *what to expect from my upgrade I'm working on, then. Just got and Building out next week: Intel Core i7-4790K Devil’s Canyon Quad-Core 4.0GHz 8 Threads HT Which Windows and Boinc will *Read/See-Show as 8 processors (should I expect it has only 4 FPUs? And will the AVX apps still run at 24%? Whether it's on a Physical *or Virtual core?) The i7 does boast AVX 2.0 mebe that'll help And +700MHz surely will too Dunno how Hyperthreading will *act (as far as my above questions) |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Which Windows and Boinc will *Read/See-Show as 8 processors You will have 4 real FPUs and 4 virtual ones, for a total of 8 FPUs to go along with your 8 ALUs. And again, since Windows sees the CPU resource as a whole of 100% (Windows doesn't really care about virtual or real cores), with a single socket and 8 total threads, your 100% is now divided by 8, so each core cannot go over 12.5%. My Core i7 3930K has 6 cores and 6 Hyperthreaded ones, so each CPU will not go above 8.33% (100 / 12). My Xeon X5660 has two sockets, 6 cores in each CPU, and each CPU has 6 Hyperthreaded cores, so a single CPU will not go above 4.16% (100 / 24). Mind you, you are not losing performance. 12.5% is 100% of one individual CPU core on a 8-CPU system. Just as 25% is 100% of one individual core on a quad-core system, and 4.16% is 100% of one individual core on a 24 "CPU" system. Darn it! Every time I *think I'm onto something/getting somewhere, I get punked by Windows/WinTel I should say - *Thinking(and Marketing, of course). This wasn't designed and/or implemented by the Marketing team for once. The way the OS handles CPU resources as a whole date all the way back to the first Symmetric Multi-Processor (SMP) enabled OS as co-developed by IBM and Microsoft in OS/2 2.11 and continued on when Microsoft forked from that and started Windows NT (which of course is the precursor to your Windows 7 OS). I'm sure if I expanded by studies beyond the x86 market, I'm sure the design philosophy dates even earlier than IBM and Microsoft's attempt too. |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Ah! I'm beginning to catch-on. To the fractional *nature/the gist of it anyway. Thanks for 'splainin to me that way. So from a *performance standpoint - that is, runtimes and the like, Brute force ie +700MHz at Stock is the performance Boost I'll *see (compared to my Quad 3.3GHz cores anyway) AVX 2.0 Boost = unknown benefit til I Launch/crunch/analyze and see. = OS is 64bit; hardware is 64bit; and bigger, faster Busses with more Lanes coming as well with new Z97 Board and its PCIe 3.0 Bus. Plus advent of Hyperthreading increases I/O generally, as I understand it; as well as igniting my Maxwell GPU's Unified Memory (another I/O enhancement for *it) = Experimenting with just how many Cores to use (in Prefs) and/or experimenting with P Lasso affinity, is next. My initial thoughts about affinities is to map CPU0 to Windows/Boinc,and run everything else(apps) with the remaining 7 Cores; unless of course It should just be Windows on CPU0 and Boinc *must be on the Remaining - uncertain. Other idea is separate cores(group) for CPU and GPU apps. Beats me--just venturing into fray. I still don't *get why I would need to "free a core" for my GPU apps, for example. Onward thru the Fog! Thanks again for your Feedback! = Edit: I do Disable Speedstep in BIOS and TurboBoost ON - because I don't want *any Down-clocking *anywhere. And Jason G caught me up on why 64bit app is slower than 32bit(at this time due to app dependencies/latency I think he said) -- unless maybe the bigger Busses on this new Board will *allow dbl wide traffic- without a performance hit?) |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
So from a *performance standpoint - that is, runtimes and the like, Brute force ie +700MHz at Stock is the performance Boost I'll *see (compared to my Quad 3.3GHz cores anyway) Well raw clock speeds are only comparable when you are within the same CPU generation. Every other release (at least in Intel's Tick/Tock cadence for CPU releases) increases the number of instructions per clock cycle, known as Instruction Per Cycle or IPC. So a 3.3GHz 5xxx series Core i7/i5 is going to be faster than a 3.3GHz 2xxx series Core i7/i5, and even faster still than say a 3.3GHz Intel Core 2 series. So you're gaining more than just 700MHz in raw clock speed, but only about ~5-10% (depending on the application and optimizations) on a clock-per-clock comparison basis. Other idea is separate cores(group) for CPU and GPU apps. It has to do with resource contention and how any modern OS handles multitasking and thread-level priorities. A CPU core that is 100% busy working on a thread and has to constantly stop to respond to other active threads (such as feeding the GPU) won't provide all threads the level of attention they need to work efficiently. Given that the GPU is far more efficient than a single CPU core (or all of them put together for that matter), it is recommended to sacrifice the performance of a single core so as to allow it to be responsive to the GPU thread... to sort of "feed" it. Edit: I do Disable Speedstep in BIOS and TurboBoost ON - because I don't want *any Down-clocking *anywhere. Yes, that's correct. And Jason G caught me up on why 64bit app is slower than 32bit(at this time due to app dependencies/latency I think he said) -- unless maybe the bigger Busses on this new Board will *allow dbl wide traffic- without a performance hit?) Nope. PCI Express (or PCI-e / PCIe) are all the same bits or "width" per lane from PCIe 1.0 through PCIe 3.0. It is a serial bus that gains speed increases from pushing data faster through the same leads, and optimization in the protocols that send the data. Also, the CPU doesn't sit on the PCIe bus, so the faster PCIe 3.0 bus won't offer faster performance vs the number of bits in the CPU registers. |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Aha!(again) - Thanks OzzFan. So to take advantage on my current machine, I should goto .75/75% Cores in Prefs? And same percent on new one (more I think) since I only have One discrete GPU that I run CUDAs at .25GPU config - but AP-OpenCL NV is config with 1+1. = I'm currently planning to DVI my new iGD HD 4600 to my Monitor only - to free up discrete GPU VRAM for Crunch only. I'll "unlist" use Intel GPU/iGD in Prefs - until I see OC options for it in BIOS that is - (that would come close to 960 GPU's base clock of 540MHz) who knows? Not I til I *seethis BIOS. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Aha!(again) - Thanks OzzFan. Telling BOINC to only use 3 of the 4 CPU cores in your current machine is likely to increase its overall performance. If you set BOINC CPU usage to 75% the setting in your app_config for the OpenCL AP app to use a whole core will still be used. So while AP tasks are running only 2 CPU cores will be used. So you may want to update your app_config settings so that does not occur. Alternatively you could modify your MB GPU app settings to also reserve a whole core instead of changing the BOINC CPU setting. I'm not 100% sure what you are referring to in regards to the iGPU clock vs the GTX 960. Most of the iGPUs have a clock that runs 1.0-1.3 GHz. That doesn't really mean anything as they have far fewer shader units & thus a much lower GFLOPS benchmark. The iGPU in the i7-4790K should be rated at 400 GFLOPS according to the formula they use. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Aha!(again) - Thanks OzzFan. BOINC gets mine equally wrong as well. OpenCL: Intel GPU 0: Intel(R) HD Graphics 4600 (driver version 10.18.14.4156, device version OpenCL 1.2, 1195MB, 1195MB available, 32 GFLOPS peak) The formula used is Seventh generation (HD Graphics 4000, 5000) - EU * 8 * 2 * clock speed Given there are 20 compute units 20*8*2*1200=384,000 or 384 GFLOPS. The extra 50MHz by default on the i7-4790K makes it 400 GFLOPS. BOINC is much more optimistic about my HD6870 GPU. OpenCL: ATI 0: ATI Radeon HD 6870 (Barts XT) (driver version 1573.4 (VM), device version OpenCL 1.2 AMD-APP (1573.4), 1024MB, 991MB available, 4032 GFLOPS peak) When in reality it is rated at half of that. Unless your GTX980's are overclocked it looks like BOINC is a bit over their rated 4612 GFLOPS as well. NVIDIA & ATI both use the formula shaders * 2 * clock speed to compute the SP GFLOP rating. Trying to apply that to the iGPUs doesn't seem to come up with the numbers BOINC is spitting out either. [sarcasm]It is troubling. As I know we all deeply depend on the benchmark values provided by BOINC for so many things.[/sarcasm] SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Glad you brought it up HAL This, from Intel ARK: Processor Graphics ‡ Intel® HD Graphics 4600 Graphics Base Frequency 350 MHz Graphics Max Dynamic Frequency 1.25 GHz Graphics Video Max Memory 1.7 GB Graphics Output eDP/DP/HDMI/VGA Execution Units 20 = NVidia GTX 960 SC *Base clock is 540 MHz There is a *multiplier at play here-for the life of me can't pin it down NV lists "Direct Compute" = 5.2 in one util/5.0 another (which I have always assoc with Shaders) - 960 has 1024 Unified but sez SM5.0 Yet, there are 8 Multiprocs - *Read as 8 CUs in AP stderr (So, what and which *multiplier do I use to get a clean number *here?(Intel) Nor do I *understand the Intel Execution Units 20--for comparative purposes Memory clocks are a little more obvious. Intel uses my DDR3 sys RAM which will be 1600 MHz but Bandwidth there ? NV = the 2048 GDDR5 VRAM 128bit/112 GB/s stuff So *close/no cigar on the comparisons albeit whatever I find OC wise in ASUS BIOS -may exceed ARK specs. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Glad you brought it up HAL The Intel "Execution Units" are the shaders. I'm not sure what you are talking about in regards to a multiplier. The "base Frequency" listed for the GPU is GPUs idle frequency when it slows down to to save power. Not to be confused with a CPU "Base Clock" value. In the case of the GTX 960 it's default clock is 1127MHz when actively under load. That's where the default FLOP rating comes from with 1024 shaders. 1024*2*1128=2308 GFLOPS in SP. The 540MHz I imagine is your GPUs idle frequency. Which someone labeled as "base clock"? Base Clock in that instance could be used to describe the GPUs frequency floor. Rather than a root clock frequency. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Ah. The Core Graphics(Base)clock is associated with NV RAMDAC-the heart of the thing; which until recently(Fermi then Kepler and now Maxwell) has been a 400 MHz. Jumped about 50MHz/per change. The multiplier I refer to is associated with number of Proc Cores, each with its own RAMDAC backbone; similar to SRAM Cache on a CPU. Shader Units figure into it *somewhere - as a bandwidth thing - carburetor or engine of a sort, as I understand it - the SM module. Although, the Direct Compute units numbers don't match the number of Procs - rather, the SM Module; when it comes to discussions of *which unit is a CU - will the real CU stand up please? = Yes, the stark difference between Intel and NV in the Shaders and Base Clocks departments, is pretty close to Radical. Add GDDR5 vs sys RAM DDR3 and the difference widens even more. Comes down to apples n oranges comparisons and Expectations ie don't Expect a Volkswagon to beat that Vette off the line - live with it! |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Well, I did *try reducing Cores to 3/75% in Boinc and therefore both SETI and Einstein sites. Overnight(about 8 hrs +/-) experiment involving about 30 GPU/CPU tasks. = Zero *observed benefits - of course near 50% tasks were _0 trailers and went to Pending so difficult to track runtimes improvements = Seemingly smoother , faster ops with CUDAs by moving down to .33 (from .25)config - but nothing dynamic/dramatic. = I terminated the Plan and reverted to 100% Cores = Computer Case I'm waiting on arrived Houston at Noon - hope they're kidding I must wait til June 1 for it to Drive 200 miles to San Antonio and get to my porch! |
rob smith Send message Joined: 7 Mar 03 Posts: 22237 Credit: 416,307,556 RAC: 380 |
"_0" indicates that this task is the first of the initial replicate of a work unit, the other being "_1". To the user there is no significance in these two, however if you see "_5" upwards it is a good indication of a problematic work unit. You will personally only ever get one "version" of a given work unit. The best way of looking at changes in performance is to look at trends over several days (a week at least) as this will allow a wide range of work units to be processed and so give you a reasonable average. This is particularly true when there is a "shorty" storm (tasks which run abnormally quickly), or a pile of very slow tasks. "Pending" indicates that validation has not been completed on that task. When validation is complete the status will change to "valid" (and will vanish from your visible list after 24 hours), or, "inconclusive" (which means the two of you didn't agree on the result), or, "invalid" (which means your result has been rejected for one of a number of reasons. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Ah. The Core Graphics(Base)clock is associated with NV RAMDAC-the heart of the thing; which until recently(Fermi then Kepler and now Maxwell) has been a 400 MHz. Jumped about 50MHz/per change. The shader units are a result of how many SM units there are in the GPU. For Maxwell there are 32 shaders per scheduler & 4 schedulers per SM. GTX 960 with 1024 shaders 1024/32/4, or more simply 1024/128, = 8 SM. A view with clinfo, or the output of the Open CL apps, will refer to the SM as "Max compute units:" For Intel GPUs the shaders are the same as what you see for "Max compute units:" in the OpenCL apps. Then for ATI it can be found by taking Texture mapping units/4. So everyone is doing it a completely different way!!! The memory speed is important, but isn't used when calculating FLOPS. FLOPS just gives us the maximum potential of the GPU. Then depending on what memory is coupled to it will determine how much of that performance you can use. Just like using different speed of memory with the CPU. I'm not sure, but I think there are 1 or 2 Volkswagens that can take a Vette off of the line. :P SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
JBird Send message Joined: 3 Sep 02 Posts: 297 Credit: 325,260,309 RAC: 549 |
Ya, the math impossible to figure out and get *reported numbers to *match. Either sandbagging or unknown multiplier or some other formula afoot. ie - GPU Z v0.8.3 reports Maxwell GM206-A as SM5/Direct Compute 5 = Actually, Best numbers I've seen from a what's-what and how many utility by far, has to be GPU Caps Viewer (Geeks 3D)Do scroll down to the yellow banner for v.1.23.0.2 Very comprehensive I can't do math without Fractions in my results to save me......*!* = Is refreshing to see actual Live clock speed in stderr even tho it refers to a Kepler; at least its not referencing the cards Base Tuning tuning schmooning ;) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.