Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 83 · Next
| Author | Message |
|---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
Stock clock. Yes, after looking at your benchmark ratings, I figured you were at stock cpc clock. I would try to find the time to try out the AVX Linux version. It seems to be really fast on my 1700X. Don't have any SSE4.x version to compare against in Windows though. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
I noticed this part of a comment from Elmor over in the ROG overclocking thread regarding changes to the SMU/FW update to AGESA 1.0.0.6.
Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
Curious where you have your 1700 clocked at. The second thing is did you finally determine that the SSE4.2 CPU app is faster than the AVX app? I was wondering why your CPU task completion times are significantly longer than mine. . . Glad you mentioned that Keith, I have been wondering about SSE4.1/4.2 versus AVX myself ... . . And we still have to wait for the results ... ??? Stephen ?? |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
As time has permitted over the last few weeks, I've run many different test scenarios in an attempt to identify what factor(s) might be causing the poor performance of the GTX 960 on my host 8253697 when running the Special App (both 6.5 and 8.0) under Linux. The one obvious factor that differentiates that card from the other two in the box is the use of a riser cable from a PCIe x8(x4) slot. That, by itself, seemed unlikely to cause a problem (at least to me, if not to TBar), since that card/cable/slot configuration has worked just fine in Windows with both Cuda50 and SoG. On the hardware side, I first tried swapping that 960 with another 960 that's working just fine with the Linux Special App in another box (no difference), then with a GTX 750Ti (slightly worse performance than the 960), and then tried a different riser cable (no difference). I also moved the monitor cable from the 960 to one of the other cards (possible infinitesimal difference). There's not much I could do with the PCIe slot itself, the bus or the motherboard. To see if the Linux OS might be a problem, I ran the Linux SoG app (2 tasks at a time) for a few hours. As it happened, the 960 only caught guppi VLARs in that stretch, but the run times, although slightly longer (perhaps due to a different BLCnn batch), were still roughly comparable to those from the Windows SoG app. Noticing that at least a few of the other Linux Special App users are adding a "pfb" value to the command line, I tried running with a value of '15', since that's what I used to use for "pfblockspersm" in Cuda50. This might have shown some barely perceptible improvement, though probably not statistically significant. Then I decided to try Petri's latest Cuda 8.0 app, the "zi3t2b" which runs without Blocking Sync. BINGO! Run times dropped dramatically, down to approximately what I'm getting with the four 960s in my other Linux host 8257247. On that box, 2 of the 960s are on the board in x16(x16) slots, while 2 are on risers from x16(x8) slots (which appears to have no impact on performance). For good measure, I also tried running that app with the "-bs" option set, which caused the performance to go in the toilet again. Here are some performance comparisons, with each Average Run Time representing six tasks in the specified AR: BASELINE: Host 7057115 running Win8.1 "SoG" (2/GPU) High AR ---- Avg RT = 10:33, Tasks/Hr = 11.37 Normal AR -- Avg RT = 19:36, Tasks/Hr = 6.12 VLAR ------- Avg RT = 32:01, Tasks/Hr = 3.74 Host 8253697 running Linux "Cuda6.5" High AR ---- Avg RT = 7:10, Tasks/Hr = 8.37, Change = -26.4% Normal AR -- Avg RT = 14:11, Tasks/Hr = 4.23, Change = -30.9% VLAR ------- Avg RT = 13:55, Tasks/Hr = 4.31, Change = +15.2% Host 8253697 running Linux "Cuda8.0" (w/ built-in Blocking Sync) High AR ---- Avg RT = 6:35, Tasks/Hr = 9.11, Change = -19.9% Normal AR -- Avg RT = 13:20, Tasks/Hr = 4.50, Change = -22.2% VLAR ------- Avg RT = 13:08, Tasks/Hr = 4.57, Change = +22.2% Host 8253697 running Linux "Cuda8.0" (w/o Blocking Sync) High AR ---- Avg RT = 2:28, Tasks/Hr = 24.32, Change = +113.9% Normal AR -- Avg RT = 5:08, Tasks/Hr = 11.69, Change = +91.0% VLAR ------- Avg RT = 8:28, Tasks/Hr = 7.09, Change = +89.6% So, at least for a GTX 960, it would appear that Blocking Sync might not be a good choice for a card tied into a PCIe slot that's less than x8 electircal. Why that would be the case is not something I have the expertise to explain. Whether this also applies to other cards, or to other motherboard/bus setups which have an x4 or x1 slot, is beyond my ability to test. Perhaps others can do so. In any event, hopefully this info will be useful to other users in the future. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
Interesting, considering I have two machines running the same App with cards in electrical x4 slots that don't show this behavior with the Blocking Sync. In fact, in over a Year of testing I have never seen such a slowdown as your One machine shows, and there have been many, many different App versions tested. This machine is running a GTX 950 in an electrical x4 slot listed as device 3, http://setiathome.berkeley.edu/results.php?hostid=6796479&offset=340. On that machine device 2 is an identical GTX 950 in an electrical x16 slot. This machine is running a GTX 750Ti in an electrical x4 slot listed as device 3, https://setiathome.berkeley.edu/results.php?hostid=7769537&offset=300. Both machines are running an Old Intel Motherboard classed somewhere around 'workstation' with the first running dual processors. For comparison with the 750Ti, this other machine is running 750Ti cards in electrical x8 slots, https://setiathome.berkeley.edu/results.php?hostid=6906726&offset=220. As you can see the difference is nowhere near the difference your one machine shows, so, it must be something other than just the x4 slot with the Blocking Sync. |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Yes, there certainly could be other factors at work. I was trying to isolate one variable at a time and Blocking Sync was the only thing that really popped. BTW, I did the testing with Petri's 8.0 app last Thursday and most of the tasks have validated and gone, but here are links to a couple that are still pending at the moment: http://setiathome.berkeley.edu/result.php?resultid=5745745690 and http://setiathome.berkeley.edu/result.php?resultid=5745322027. I think I've captured most of the others in my archives, though, should their Stderr output ever prove useful. This machine is running a GTX 950 in an electrical x4 slot listed as device 3, http://setiathome.berkeley.edu/results.php?hostid=6796479&offset=340.One thing that does pique my curiosity in comparing a couple of your tasks from that device with similar ones from my testing is that yours appear to be using somewhat more CPU time than mine. For instance, on High AR tasks, mine used between 29 and 35 seconds, whereas the one of yours I looked at used 48 seconds. Similarly, on a normal AR, yours looked to be using about 100 seconds where mine only used between 61 and 71 seconds. Perhaps there's a difference in the polling frequency that has an impact. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
...in comparing a couple of your tasks from that device with similar ones from my testing is that yours appear to be using somewhat more CPU time than mine. For instance, on High AR tasks, mine used between 29 and 35 seconds, whereas the one of yours I looked at used 48 seconds. Similarly, on a normal AR, yours looked to be using about 100 seconds where mine only used between 61 and 71 seconds. Perhaps there's a difference in the polling frequency that has an impact.Have you tried running the Apps at Normal Priority as discussed back here, https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855061#1855061 <cc_config>
<log_flags>
<sched_op_debug>1</sched_op_debug>
</log_flags>
<options>
<no_priority_change>1</no_priority_change>
<use_all_gpus>1</use_all_gpus>
<max_file_xfers_per_project>8</max_file_xfers_per_project>
<save_stats_days>365</save_stats_days>
<skip_cpu_benchmarks>1</skip_cpu_benchmarks>
</options>
</cc_config>I think others found the no_priority_change option doesn't work when using the version of BOINC from the Repository, it works with the Berkeley version of BOINC. |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
I hadn't tried it before, but have set it now. BOINC seems to accept it. Mon 22 May 2017 08:49:39 PM PDT | | Config: run apps at regular priority It shouldn't take too long to see if it makes a difference, at least with the 6.5 app that I'm currently running. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
I believe you need to look at Top to see if it's actually working, https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855477#1855477 If it's working the GPU App will run at Priority 20 Nice 0, if not it will be Priority 30 Nice 10. |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Looks OK, i think. 2748 jeff 20 0 28.382g 406416 294872 S 14.6 6.7 0:56.12 setiathome_x41p 2756 jeff 20 0 28.393g 435824 312876 S 10.3 7.1 0:34.73 setiathome_x41p 2735 jeff 20 0 28.393g 435452 312552 S 8.3 7.1 0:46.89 setiathome_x41p EDIT 1: But no discernible impact on the 960's first task, a guppi VLAR, after adding the option. http://setiathome.berkeley.edu/result.php?resultid=5755772741 The 13:44 Run Time is within the previous range. EDIT 2: And a shortie (AR=5.602958) still took 6:33, although that is slightly faster than the 7:10 average in my test sample. http://setiathome.berkeley.edu/result.php?resultid=5755834797 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0
|
... Present blocking sync implementation is relatively simple/naive, and there are likely too many of them. Once the pulsefinding (and any other serious) wrinkles are ironed out, we can look at scaling the synchronisation on a timed basis, with something akin to a frames per second target (perhaps launches per second, and scale the launches). Before other issues are addressed, that would be putting the cart before the horse though. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
OK, let's see how this works on the HP 'Workstation' board seen here, https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855727#1855727 From the image, it has One x1, and One electrical x4 slot that I usually don't use. They are spaced in a manner that you can run the Two 750Ti in either those slots or the Two x16 slots. Since this is a $22 dollar board, I had no qualms about removing the end plate on the x1 slot so it can take a x16 card. I preformed the operation the day I received the board. Other than having to reconfigure the xorg.conf file, the only problem so far is I can't convince it to use the fan control on the Top card using the x1 slot. So, it's running a little hotter than normal. It doesn't look that much different than running both cards in the x16 slots, https://setiathome.berkeley.edu/results.php?hostid=6906726&offset=200 The monitor is connected to the lower card in the x4 slot. I'll let it run for a while, but for now, it doesn't appear to be having any trouble. I'm Not using any risers, both cards are mounted in slots and do not use external power connections. Yes, that means the 750Ti is pulling the power from the x1 slot only. The pciBusID = 52 is the x1 slot, setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 750 Ti, 1998 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 40, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2000 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 52, pciSlotID = 0 |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
Hmmmm, it appears the HP board isn't having any trouble using the x4 slot with the Blocking Sync either. In fact, the times are within a few seconds of the 750Ti in the Intel board running in a x4 slot. The times for the HP x1 slot is about what would be expected. The HP board will run the slots at x8 if both x16 slots are used. The times for the HP board using the Blocking Sync are around; Shorties; x1 around 203 seconds x4 around 173 seconds x16(8) around 162 seconds BLC03; x1 around 700 seconds x4 around 660 seconds x16(8) around 640 seconds There is another machine I'm aware of running the Blocking Sync with a x4 slot. It's similar to one of my machines, Slots 1 & 2 are x16, Slots 3 & 4 are electrically x4, http://setiathome.berkeley.edu/results.php?hostid=7942417&offset=340 But wait, there is at least one more. Jeff's machine with the 4 GTX 960s ran the Blocking Sync for a while and didn't appear to be having problems with the Two x4 slots, https://setiathome.berkeley.edu/results.php?hostid=8257247&offset=2280 So, the score would be 5 different machines Not having problems with the Blocking Sync and a x4 slot, and One machine having problems. 5 to 1... For now, the Only machine I know of having that problem is Jeff's One machine. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14039 Credit: 208,696,464 RAC: 304
|
Could it be IRQ/DPC related? Is the odd system out low on physical RAM? Or lots of video cards, lots of video memory to manage resulting in system resource contention/overhead? Grant Darwin NT |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
But wait, there is at least one more. Jeff's machine with the 4 GTX 960s ran the Blocking Sync for a while and didn't appear to be having problems with the Two x4 slots, https://setiathome.berkeley.edu/results.php?hostid=8257247&offset=2280No x4 slots on that machine. It's an HP xw9400 with 2 x16(x16) slots and 2 x16(x8) slots. I've mentioned several times that the riser cables on that machine are on the x16(x8) slots. That's why I was drawing a distinction between x8 (which works fine with Blocking Sync) and x4 (which doesn't). |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Could it be IRQ/DPC related?I can certainly try to check the IRQ, though I'm not sure where that info shows up in Linux. The system has 6GB of ECC memory, and the last time I checked, GKrellM was showing less than half of that in use. In fact, I actually removed the memory krell from the monitor display since there was always so much free memory available. The GTX 960 itself has 2GB RAM but slightly less than 1.6GB appears to be in use with the pfp set to 8 by autotune. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
OK, so we're back to 4 machines Not having trouble with x4 Slots verses your One machine with the riser that is. The one item that sticks in your face is the Riser, it's probably introducing just enough delay that the Blocking Sync doesn't work correctly with That Riser in That Slot. The x4 Slot in my HP xw4600s (I have 2) is actually an open ended x8 slot that's wired for x4. It will take a x16 card without any trouble. If your x4 slot can take a x16 card I'd suggest you try it with the card mounted in the slot rather than the Riser. I have 3 different machines that don't have any trouble with the card mounted in the x4 Slot. Two of the machines work 24/7 using a x4 Slot which is why I don't think there is a problem for Most people who don't use a Riser cable. There doesn't seem to be any trouble using even the x1 Slot with the card mounted in the Slot.But wait, there is at least one more. Jeff's machine with the 4 GTX 960s ran the Blocking Sync for a while and didn't appear to be having problems with the Two x4 slots, https://setiathome.berkeley.edu/results.php?hostid=8257247&offset=2280No x4 slots on that machine. It's an HP xw9400 with 2 x16(x16) slots and 2 x16(x8) slots. I've mentioned several times that the riser cables on that machine are on the x16(x8) slots. That's why I was drawing a distinction between x8 (which works fine with Blocking Sync) and x4 (which doesn't). If these ever hit NewEgg, no one will need a Riser, and my Old HP will take Four 1050 Ti with just an additional 8 inch Desk fan and an open case. http://techreport.com/news/31888/inno3d-squeezes-a-geforce-gtx-1050-ti-into-a-single-slot#metal |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
...it's probably introducing just enough delay that the Blocking Sync doesn't work correctly with That Riser in That Slot.Certainly could be a possibility. I think Jason's the latency expert around here. The 6" cable is the shortest I can use, just enough to get the card outside the box. If your x4 slot can take a x16 card I'd suggest you try it with the card mounted in the slot rather than the Riser.Wish I could, but that machine is a Dell T7400 and the slot arrangement just won't allow it. The x8(x4) slot is #28 and is tucked in tight under the x16 slot (#27) where the 980 currently resides. And just below the slot is a cover which forms an airflow tunnel over the memory sticks and CPUs. I can barely get my fingers in to seat the riser cable without pulling something else out. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14039 Credit: 208,696,464 RAC: 304
|
Could it be IRQ/DPC related? Not so much the IRQ used, but the OS/CPU overhead in servicing requests. On my C2D with 2 GTX*750Tis running CUDA 50 with 2 WUs at a time, when crunching Arecibo WUs the IRQ/DPC CPU overhead can peak as high at 20%, with periods of 15% for several seconds. While the CPU is handling that load, it's not supplying CPU time to crunching WUs. Thought system overhead my be playing a role in your reduced GPU performance depending on the blocking sync setting. Grant Darwin NT |
Jeff Buck ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0
|
Oh, okay, another possible latency issue, then. The thing is, with Blocking Sync on, the CPU usage is fairly minimal, but with it off, a full CPU is basically dedicated to each GPU task. If there was a DPC latency issue, I would think it would be more obvious without Blocking Sync than with it, though my understanding of that interaction is certainly pretty fuzzy. ;^) |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.