Message boards :
Number crunching :
I've Built a Couple OSX CUDA Apps...
Message board moderation
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 58 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Yes, All the Apps compiled with the v8 'Baseline' code run with 'normal' CPU usage on a Mac. It's only the 'Special' Code that uses a Full CPU core. After months of looking and prodding it's still the same. Unfortunately I wouldn't know which part of the codes to compare. The code in the 'baseline' section works normally, the code in the Alpha section doesn't. I decided to go really retro, and BOINC 6.10.56 trashes everything when going from 7.2.33 no matter what settings you use, so, I'll have to play with Plan Classes later. First is to find a BOINC that works with Ubuntu 11.04 and actually updates the counters without having to run the mouse across the screen. But hey, the CUDA 42 App works as expected. Maybe I should update the driver to 304? The plan is to run the old setup in Beta as Stock, then switch over to the CUDA 42 App and compare the results. However, I Really need counters that work. I'll also have to see about resurrecting these recent Ghosties... Hmmm, it did the same as last time. The server doesn't mind resending the GPU tasks, or the normal CPU tasks, but sometimes insist it must expire the VLARs. Oh well, they've already been sent to someone else. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Yes, All the Apps compiled with the v8 'Baseline' code run with 'normal' CPU usage on a Mac. It's only the 'Special' Code that uses a Full CPU core. After months of looking and prodding it's still the same. Unfortunately I wouldn't know which part of the codes to compare. The code in the 'baseline' section works normally, the code in the Alpha section doesn't. In cudaAcceleration.cu the stock code is 96 bool cudaAcc_setBlockingSync(int device) 97 { 98 // CUdevice hcuDevice; 99 // CUcontext hcuContext; 100 101 /* CUresult status = cuInit(0); 102 if(status != CUDA_SUCCESS) 103 return false; 104 105 status = cuDeviceGet( &hcuDevice, device); 106 if(status != CUDA_SUCCESS) 107 return false; 108 109 status = cuCtxCreate( &hcuContext, 0x4, hcuDevice ); //0x4 is CU_CTX_BLOCKING_SYNC 110 if(status != CUDA_SUCCESS) 111 return false;*/ 112 113 #if CUDART_VERSION < 4000 114 CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceBlockingSync),false); 115 // CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false); 116 #else 117 CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync),false); 118 // CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false); 119 #endif 120 return true; 121 } my code is different. Try using the same as in stock. I'll get back in 10 hours or so and tell the other place(s) to look. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Thanks Petri, I'll give that a go tomorrow. I think I've already tried the bottom part though. I'm not having much luck tonight. First I can't get BOINC to work correctly in Ubuntu 11.04. Seems the old version doesn't want to see OpenCL. Even with the recently updated version of 304.128, with OpenCL links scattered throughout the filesystem, BOINC says No OpenCL...CUDA works fine. So, back to Ubuntu 14.4 with the same driver, BOINC says OpenCL Exists, but Nothing from the Beta Server. The status page says they have tasks, but won't send any. They don't give a reason either; Thu 17 Mar 2016 01:16:48 AM EDT | | Starting BOINC client version 7.2.33 for x86_64-pc-linux-gnu Thu 17 Mar 2016 01:16:48 AM EDT | | CUDA: NVIDIA GPU 0: GeForce GTS 250 (driver version unknown, CUDA version 5.0, compute capability 1.1, 1023MB, 844MB available, 705 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT | | CUDA: NVIDIA GPU 1: GeForce 8800 GT (driver version unknown, CUDA version 5.0, compute capability 1.1, 512MB, 474MB available, 544 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT | | OpenCL: NVIDIA GPU 0: GeForce GTS 250 (driver version 304.128, device version OpenCL 1.0 CUDA, 1023MB, 844MB available, 705 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT | | OpenCL: NVIDIA GPU 1: GeForce 8800 GT (driver version 304.128, device version OpenCL 1.0 CUDA, 512MB, 474MB available, 544 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT | | OS: Linux: 3.13.0-79-generic Thu 17 Mar 2016 01:16:48 AM EDT | | Memory: 1.95 GB physical, 13.00 GB virtual Thu 17 Mar 2016 01:16:48 AM EDT | | Disk: 58.93 GB total, 54.18 GB free Thu 17 Mar 2016 01:16:48 AM EDT | | Local time is UTC -4 hours Thu 17 Mar 2016 01:23:30 AM EDT | SETI@home Beta Test | work fetch resumed by user Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] Starting scheduler request Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work. Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33386.07 seconds; 2.00 devices Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Server version 707 Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07 Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] Starting scheduler request Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work. Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33355.29 seconds; 2.00 devices Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Server version 707 Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07 Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work. Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33109.97 seconds; 2.00 devices Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Server version 707 Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07 Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project .... I suppose Beta just doesn't have a place for the older Hardware. Suppose it will be Anonymous platform or nothing. I wonder how -poll would work on these cards... |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
Petri i replaced this part of stock code in yours but the result is the same 100% cpu is using. I don't want to believe, I want to know! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Worth to look at is the actual CPU time, under both modes. Some of Petri's code is fast enough that it could easily turn 20-40% looking CPU into 40-80%, just by staying the same amount of CPU (but faster elapsed). That's one of the Indicators we need to push more to the GPU. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Gianfranco is using a much newer CPU than myself, yet the CPU use is the same. The runtimes aren't much different, but his still uses a full CPU. To complicate matters further, Chris is using an older version of my Mountain Lion build and is only using around 70% CPU. The same version on my machine uses nearly 100% even though the CPUs aren't much different, http://setiathome.berkeley.edu/results.php?hostid=7942417&offset=240. He is running more CPU tasks as well and has a higher total CPU load than mine yet the same App uses less CPU on his machine. Note he is still getting the SIGBUS Errors. The -poll option works on the older cards but the increase isn't as much as it is with the GTX 750Ti. I just removed the option on the cards at Beta so the last few will run without the option, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013 Too bad Beta doesn't currently have an App that will work with these cards in Linux...my 4.2 App seems to work well there... |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
@TBar One of the listings says opencl 1.0 for your machine a few posts before. OpenCL: NVIDIA GPU 0: GeForce GTS 250 (driver version 304.128, device version OpenCL 1.0 CUDA, 1023MB, 844MB available, 705 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT OpenCL: NVIDIA GPU 1: GeForce 8800 GT (driver version 304.128, device version OpenCL 1.0 CUDA, 512MB, so the driver is too old. The plan class should have something not saying opencl on that machine. @jason_gee yes, a faster app may use more cpu. But I'll check my code where I have those nanosleep loops. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Apparently it was the driver version, not the OpenCL version. After updating to 337.25, which is about the highest driver that can be used in Linux with pre-Fermi cards, I was able to download some cuda 60 tasks. They ran the same as before, almost twice as slow as with the cuda 42 tasks, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013. I was never able to download any OpenCL tasks. The almost twice as slow phenomena is also what I saw with the cards in OSX with the old cuda 5 & 5.5 Apps, with the cuda 42 App in OSX the cards are almost twice as fast as with the older v5.x versions. Fortunately the cuda 42 App also works well with my GTX 750Ti which when used with the -poll option produces the fastest 750Ti times I could find on Beta, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=60. Hmmm, I'll have to try the -poll option in Linux with the 750 and cuda 60 the next chance I get. I did notice the Apps on Beta are having the Same problem with a large number of Apple Laptops as the current stock App, much slower than they should be. Sometimes up to 4x as slow with a few giving errors as well. It doesn't appear any help for the Laptops is coming anytime soon. Has anyone noticed with the Special CUDA App in OSX that of the three different machines the one with the Weakest CPU has the lowest CPU times? Likewise, the machine with the fastest CPU uses the most amount of CPU. Also, this problem with nearly 100% CPU use doesn't exist in Linux with even a Weaker CPU. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
With the case of PreFermi's, you're dealing with native 32 bit cards, so there's an extra layer of latency mapping that into 64 bitness. That's pretty much why on Windows, PreFermi speed dropped off after Cuda 2.3. In the case of Linux, the driver model is considerably leaner (no whopping DirectX behemoth sitting in the middle). The picture with Mac is murkier, but likely even higher latency. With Petri's default code GPUs will spin on a CPU core, and with his suggested reversion to my blocking sync, you'll sacrifice throughput and maybe drop CPU a bit, but more likely there is still a CPU spin embedded in the driver/OS to keep the device responsive. Preparing to support these heavier driver models with more appropriate techniqes is where a lot of the threaded infrastructure being laid out in alpha is headed, though not trivial. I'm hoping that vk over Metal surfaces first (which would then give OpenCL 2), although automated self scaling is likely before that, in order to minimise synchronisation. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
And Yes, A faster GPU needs more attention from the CPU. Btw. Did you specify maxrregcount=64 when not using Makefile? I guess you did - and if so, then that is not the issue. Not specifying would have caused some major register spilling and induced a huge performance penalty on the GPU code. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well if maxrregcount=64 is the same as -m64 then Yes, and since Chris is using my older App, and I gave Gianfranco the line to use in the Makefile, I'd say all three machines are using the same Makefile line; NVCCFLAGS = -O3 --use_fast_math --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 Gianfranco may have added -gencode arch=compute_52,code=sm_52 since it appears he is using ToolKit 7.5. It may be time to post the Mac App so perhaps the handful of people that can use an app requiring CC 3.2 or higher can post their results. Maybe require a link to the machine they are using so at least we see how its preforming. It wouldn't be any help if we couldn't see how it was running. I looked over my inconclusive results from yesterday, 14 inconclusive against ~750 Valid results. Of those 14, 4 were Immediate overflows, and 8 from Obviously misbehaving wingpeople. That left 2 legitimate inconclusives against ~750 Valid tasks. Some days are better, some worse, but I don't think it is going to get much better anytime soon. Oh, the SIGBUS Errors are Gone and I haven't seen an invalid with this App either. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Nah, not the same thing. -m64 is the bitness of the GPU code (and required to match the OS executable binary these days). You'll need to set maxregcount to 32 for any nvcc code meant to run on less than GK110 (GTX 780 or compute capability >= 3.5), 64 for cc 3.5 or above. Symptoms of setting this too high can range from slow operation to crashes, or some silent failures in between, depending on driver model and GPU. This is quite likely one of the aspects that breaks Cuda 3.2 libraries on Maxwell or later, internally, outside of our control. So I'd suggest limiting to 32 for generic compatibility, until I get the dispatch/plugin architecture built. 64 would be fine for personal builds that will never be run on a GPU < cc 3.5 (or so, probably best stick to Maxwell, cc5.0+, for safety if using 64 for maxregcount) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Okay, I just searched the Xbranch folder from the last build and found; $(cuda_cu_objs): cuda/$(@:.o=.cu) $(NVCC) -c cuda/$(@:.o=.cu) -o $@ -Icuda $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) -I$(top_srcdir)/db $(BOINC_CFLAGS) --maxrregcount=32 $(NVCCFLAGS) $(CUDA_CFLAGS) Since changing it has never been mentioned before, All the Apps have been built with the default which appears to be 32. Hmmm, I might try setting it to 64 the next time I have the urge to pull all the 750s out and boot to Mountain Lion....just to see how it works on the 750s. Any other suggestions? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Only suggestion I have, is that If going for a target specific build for own use, then there are many options for both host and GPU code you can try out (like Petri does). Exhaustively going through every setting on every .cu GPU file, and every host cpp file, would not be entirely practical, and limit builds to very narrow applicability, but small incremental improvement can add up if you have the patience. For wide distribution (e.g. stock), it will become easier as the infrastructure evolves to basically do this kindof finetuning itself, at install, first run and at runtime. Taking the limitations of having to support the lowest common denominator out of the picture, and automating the tuning process is where then next gen apps will be, and Petri's additions of streaming etc are more or less the ideal testing tool to work out that dispatch/plugin architecture (meant for x42). As it stands, there are still some issues to work out for generic builds, though each day slightly closer to a clearer roadmap, that will let us have our cake and eat it too. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm afraid I need to save all the patience available for things of more importance. As long as it works decently without screen lag I'll be satisfied. It appears the 100% CPU usage Does occur with Linux machines as well, http://setiathome.berkeley.edu/results.php?hostid=7907890&offset=320 Just determine why That Linux build uses 100% CPU and maybe we'll have a clue. Perhaps there is hope after all... |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That's right. Nothing to do with how Cuda or OpenCL work, but how the applications are engineered. Good thing is about how the way things are headed, peopel will be able to choose. I like choices. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
Okay, I just searched the Xbranch folder from the last build and found; Tom, I recompiled the code with maxrregcount = 64 and the results are very promising. The client seems faster for average AR of about 1 minute, however, the use of the CPU is always close to 100%. I will do other tests and I will let you know. I don't want to believe, I want to know! |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
http://setiathome.berkeley.edu/result.php?resultid=4800007576 with maxrregcount=32 http://setiathome.berkeley.edu/result.php?resultid=4800260714 with maxrregcount=64 and the same AR 0.415 Performance increase 19,8%. This result are on my GTX 960 with gencode 35,50 and 52. [/b] I don't want to believe, I want to know! |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
Jason the same thing happens with your code, placing maxrregcount = 64 and testing with the reference work unit present on Lunatics I register an increase in performance of 3.6%. The performance increase is modest compared with 21% of Petri code, but still present. I don't want to believe, I want to know! |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Yes, it looks as though mine shaved off about 45 secs from the 0.41 AR tasks. I was waiting to see how the shorties fared, but seems the shorties are hard to find. Best I can tell the shorties didn't change much. It still uses nearly 100% CPU... |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.