I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 58 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772045 - Posted: 17 Mar 2016, 2:36:41 UTC - in response to Message 1771985.  
Last modified: 17 Mar 2016, 3:33:55 UTC

Yes, All the Apps compiled with the v8 'Baseline' code run with 'normal' CPU usage on a Mac. It's only the 'Special' Code that uses a Full CPU core. After months of looking and prodding it's still the same. Unfortunately I wouldn't know which part of the codes to compare. The code in the 'baseline' section works normally, the code in the Alpha section doesn't.

I decided to go really retro, and BOINC 6.10.56 trashes everything when going from 7.2.33 no matter what settings you use, so, I'll have to play with Plan Classes later. First is to find a BOINC that works with Ubuntu 11.04 and actually updates the counters without having to run the mouse across the screen. But hey, the CUDA 42 App works as expected. Maybe I should update the driver to 304?

The plan is to run the old setup in Beta as Stock, then switch over to the CUDA 42 App and compare the results. However, I Really need counters that work.
I'll also have to see about resurrecting these recent Ghosties...


Hmmm, it did the same as last time. The server doesn't mind resending the GPU tasks, or the normal CPU tasks, but sometimes insist it must expire the VLARs. Oh well, they've already been sent to someone else.
ID: 1772045 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1772071 - Posted: 17 Mar 2016, 5:13:35 UTC - in response to Message 1772045.  

Yes, All the Apps compiled with the v8 'Baseline' code run with 'normal' CPU usage on a Mac. It's only the 'Special' Code that uses a Full CPU core. After months of looking and prodding it's still the same. Unfortunately I wouldn't know which part of the codes to compare. The code in the 'baseline' section works normally, the code in the Alpha section doesn't.

I decided to go really retro, and BOINC 6.10.56 trashes everything when going from 7.2.33 no matter what settings you use, so, I'll have to play with Plan Classes later. First is to find a BOINC that works with Ubuntu 11.04 and actually updates the counters without having to run the mouse across the screen. But hey, the CUDA 42 App works as expected. Maybe I should update the driver to 304?

The plan is to run the old setup in Beta as Stock, then switch over to the CUDA 42 App and compare the results. However, I Really need counters that work.
I'll also have to see about resurrecting these recent Ghosties...


Hmmm, it did the same as last time. The server doesn't mind resending the GPU tasks, or the normal CPU tasks, but sometimes insist it must expire the VLARs. Oh well, they've already been sent to someone else.


In cudaAcceleration.cu the stock code is
96	bool cudaAcc_setBlockingSync(int device)
97	{
98	//     CUdevice  hcuDevice;
99	//     CUcontext hcuContext;
100	
101	/* CUresult status = cuInit(0);
102	     if(status != CUDA_SUCCESS)
103	        return false;
104	
105	     status = cuDeviceGet( &hcuDevice, device);
106	     if(status != CUDA_SUCCESS)
107	        return false;
108	
109	     status = cuCtxCreate( &hcuContext, 0x4, hcuDevice ); //0x4 is CU_CTX_BLOCKING_SYNC
110	     if(status != CUDA_SUCCESS)
111	        return false;*/
112	               
113	#if CUDART_VERSION < 4000
114	         CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceBlockingSync),false);
115	//       CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false);
116	#else
117	         CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync),false);
118	//       CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false);
119	#endif
120	     return true;
121	} 


my code is different. Try using the same as in stock.

I'll get back in 10 hours or so and tell the other place(s) to look.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1772071 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772078 - Posted: 17 Mar 2016, 5:53:23 UTC - in response to Message 1772071.  
Last modified: 17 Mar 2016, 6:10:52 UTC

Thanks Petri, I'll give that a go tomorrow. I think I've already tried the bottom part though.

I'm not having much luck tonight. First I can't get BOINC to work correctly in Ubuntu 11.04. Seems the old version doesn't want to see OpenCL. Even with the recently updated version of 304.128, with OpenCL links scattered throughout the filesystem, BOINC says No OpenCL...CUDA works fine. So, back to Ubuntu 14.4 with the same driver, BOINC says OpenCL Exists, but Nothing from the Beta Server. The status page says they have tasks, but won't send any.
They don't give a reason either;
Thu 17 Mar 2016 01:16:48 AM EDT |  | Starting BOINC client version 7.2.33 for x86_64-pc-linux-gnu
Thu 17 Mar 2016 01:16:48 AM EDT |  | CUDA: NVIDIA GPU 0: GeForce GTS 250 (driver version unknown, CUDA version 5.0, compute capability 1.1, 1023MB, 844MB available, 705 GFLOPS peak)
Thu 17 Mar 2016 01:16:48 AM EDT |  | CUDA: NVIDIA GPU 1: GeForce 8800 GT (driver version unknown, CUDA version 5.0, compute capability 1.1, 512MB, 474MB available, 544 GFLOPS peak)
Thu 17 Mar 2016 01:16:48 AM EDT |  | OpenCL: NVIDIA GPU 0: GeForce GTS 250 (driver version 304.128, device version OpenCL 1.0 CUDA, 1023MB, 844MB available, 705 GFLOPS peak)
Thu 17 Mar 2016 01:16:48 AM EDT |  | OpenCL: NVIDIA GPU 1: GeForce 8800 GT (driver version 304.128, device version OpenCL 1.0 CUDA, 512MB, 474MB available, 544 GFLOPS peak)
Thu 17 Mar 2016 01:16:48 AM EDT |  | OS: Linux: 3.13.0-79-generic
Thu 17 Mar 2016 01:16:48 AM EDT |  | Memory: 1.95 GB physical, 13.00 GB virtual
Thu 17 Mar 2016 01:16:48 AM EDT |  | Disk: 58.93 GB total, 54.18 GB free
Thu 17 Mar 2016 01:16:48 AM EDT |  | Local time is UTC -4 hours
Thu 17 Mar 2016 01:23:30 AM EDT | SETI@home Beta Test | work fetch resumed by user
Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] Starting scheduler request
Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work.
Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA
Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu 17 Mar 2016 01:23:31 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33386.07 seconds; 2.00 devices
Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks
Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Server version 707
Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds
Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07
Thu 17 Mar 2016 01:23:33 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project
Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] Starting scheduler request
Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work.
Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA
Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu 17 Mar 2016 01:24:14 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33355.29 seconds; 2.00 devices
Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks
Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Server version 707
Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds
Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07
Thu 17 Mar 2016 01:24:15 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project
Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | Sending scheduler request: To fetch work.
Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | Requesting new tasks for NVIDIA
Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu 17 Mar 2016 01:31:46 AM EDT | SETI@home Beta Test | [sched_op] NVIDIA work request: 33109.97 seconds; 2.00 devices
Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | Scheduler request completed: got 0 new tasks
Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Server version 707
Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | Project requested delay of 7 seconds
Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Deferring communication for 00:00:07
Thu 17 Mar 2016 01:31:47 AM EDT | SETI@home Beta Test | [sched_op] Reason: requested by project
....

I suppose Beta just doesn't have a place for the older Hardware.
Suppose it will be Anonymous platform or nothing.
I wonder how -poll would work on these cards...
ID: 1772078 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1772083 - Posted: 17 Mar 2016, 6:58:14 UTC - in response to Message 1772071.  


In cudaAcceleration.cu the stock code is

96 bool cudaAcc_setBlockingSync(int device)
97 {
98 // CUdevice hcuDevice;
99 // CUcontext hcuContext;
100
101 /* CUresult status = cuInit(0);
102 if(status != CUDA_SUCCESS)
103 return false;
104
105 status = cuDeviceGet( &hcuDevice, device);
106 if(status != CUDA_SUCCESS)
107 return false;
108
109 status = cuCtxCreate( &hcuContext, 0x4, hcuDevice ); //0x4 is CU_CTX_BLOCKING_SYNC
110 if(status != CUDA_SUCCESS)
111 return false;*/
112
113 #if CUDART_VERSION < 4000
114 CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceBlockingSync),false);
115 // CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false);
116 #else
117 CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync),false);
118 // CUDA_ACC_SAFE_CALL(cudaSetDeviceFlags(cudaDeviceScheduleYield),false);
119 #endif
120 return true;
121 }


my code is different. Try using the same as in stock.


Petri i replaced this part of stock code in yours but the result is the same 100% cpu is using.
I don't want to believe, I want to know!
ID: 1772083 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1772088 - Posted: 17 Mar 2016, 8:10:58 UTC - in response to Message 1772071.  

Worth to look at is the actual CPU time, under both modes. Some of Petri's code is fast enough that it could easily turn 20-40% looking CPU into 40-80%, just by staying the same amount of CPU (but faster elapsed). That's one of the Indicators we need to push more to the GPU.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1772088 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772161 - Posted: 17 Mar 2016, 15:29:11 UTC - in response to Message 1772088.  

Gianfranco is using a much newer CPU than myself, yet the CPU use is the same. The runtimes aren't much different, but his still uses a full CPU. To complicate matters further, Chris is using an older version of my Mountain Lion build and is only using around 70% CPU. The same version on my machine uses nearly 100% even though the CPUs aren't much different, http://setiathome.berkeley.edu/results.php?hostid=7942417&offset=240. He is running more CPU tasks as well and has a higher total CPU load than mine yet the same App uses less CPU on his machine. Note he is still getting the SIGBUS Errors.

The -poll option works on the older cards but the increase isn't as much as it is with the GTX 750Ti. I just removed the option on the cards at Beta so the last few will run without the option, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013 Too bad Beta doesn't currently have an App that will work with these cards in Linux...my 4.2 App seems to work well there...
ID: 1772161 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1772173 - Posted: 17 Mar 2016, 17:04:01 UTC

@TBar
One of the listings says opencl 1.0 for your machine a few posts before.
OpenCL: NVIDIA GPU 0: GeForce GTS 250 (driver version 304.128, device version OpenCL 1.0 CUDA, 1023MB, 844MB available, 705 GFLOPS peak) Thu 17 Mar 2016 01:16:48 AM EDT
OpenCL: NVIDIA GPU 1: GeForce 8800 GT (driver version 304.128, device version OpenCL 1.0 CUDA, 512MB,

so the driver is too old. The plan class should have something not saying opencl on that machine.


@jason_gee
yes, a faster app may use more cpu. But I'll check my code where I have those nanosleep loops.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1772173 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772420 - Posted: 18 Mar 2016, 15:04:34 UTC - in response to Message 1772173.  
Last modified: 18 Mar 2016, 15:27:04 UTC

Apparently it was the driver version, not the OpenCL version. After updating to 337.25, which is about the highest driver that can be used in Linux with pre-Fermi cards, I was able to download some cuda 60 tasks. They ran the same as before, almost twice as slow as with the cuda 42 tasks, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013. I was never able to download any OpenCL tasks. The almost twice as slow phenomena is also what I saw with the cards in OSX with the old cuda 5 & 5.5 Apps, with the cuda 42 App in OSX the cards are almost twice as fast as with the older v5.x versions.

Fortunately the cuda 42 App also works well with my GTX 750Ti which when used with the -poll option produces the fastest 750Ti times I could find on Beta, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=60. Hmmm, I'll have to try the -poll option in Linux with the 750 and cuda 60 the next chance I get. I did notice the Apps on Beta are having the Same problem with a large number of Apple Laptops as the current stock App, much slower than they should be. Sometimes up to 4x as slow with a few giving errors as well. It doesn't appear any help for the Laptops is coming anytime soon.

Has anyone noticed with the Special CUDA App in OSX that of the three different machines the one with the Weakest CPU has the lowest CPU times? Likewise, the machine with the fastest CPU uses the most amount of CPU. Also, this problem with nearly 100% CPU use doesn't exist in Linux with even a Weaker CPU.
ID: 1772420 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1772457 - Posted: 18 Mar 2016, 16:59:48 UTC - in response to Message 1772420.  

With the case of PreFermi's, you're dealing with native 32 bit cards, so there's an extra layer of latency mapping that into 64 bitness. That's pretty much why on Windows, PreFermi speed dropped off after Cuda 2.3. In the case of Linux, the driver model is considerably leaner (no whopping DirectX behemoth sitting in the middle). The picture with Mac is murkier, but likely even higher latency.

With Petri's default code GPUs will spin on a CPU core, and with his suggested reversion to my blocking sync, you'll sacrifice throughput and maybe drop CPU a bit, but more likely there is still a CPU spin embedded in the driver/OS to keep the device responsive.

Preparing to support these heavier driver models with more appropriate techniqes is where a lot of the threaded infrastructure being laid out in alpha is headed, though not trivial. I'm hoping that vk over Metal surfaces first (which would then give OpenCL 2), although automated self scaling is likely before that, in order to minimise synchronisation.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1772457 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1772496 - Posted: 18 Mar 2016, 21:46:28 UTC

And Yes,
A faster GPU needs more attention from the CPU.
Btw. Did you specify maxrregcount=64 when not using Makefile?
I guess you did - and if so, then that is not the issue.
Not specifying would have caused some major register spilling and induced a huge performance penalty on the GPU code.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1772496 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772513 - Posted: 18 Mar 2016, 22:41:52 UTC - in response to Message 1772496.  

Well if maxrregcount=64 is the same as -m64 then Yes, and since Chris is using my older App, and I gave Gianfranco the line to use in the Makefile, I'd say all three machines are using the same Makefile line;
NVCCFLAGS = -O3 --use_fast_math --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50

Gianfranco may have added -gencode arch=compute_52,code=sm_52 since it appears he is using ToolKit 7.5.

It may be time to post the Mac App so perhaps the handful of people that can use an app requiring CC 3.2 or higher can post their results. Maybe require a link to the machine they are using so at least we see how its preforming. It wouldn't be any help if we couldn't see how it was running. I looked over my inconclusive results from yesterday, 14 inconclusive against ~750 Valid results. Of those 14, 4 were Immediate overflows, and 8 from Obviously misbehaving wingpeople. That left 2 legitimate inconclusives against ~750 Valid tasks. Some days are better, some worse, but I don't think it is going to get much better anytime soon.
Oh, the SIGBUS Errors are Gone and I haven't seen an invalid with this App either.
ID: 1772513 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1772534 - Posted: 19 Mar 2016, 0:42:05 UTC - in response to Message 1772513.  
Last modified: 19 Mar 2016, 0:46:54 UTC

Nah, not the same thing. -m64 is the bitness of the GPU code (and required to match the OS executable binary these days). You'll need to set maxregcount to 32 for any nvcc code meant to run on less than GK110 (GTX 780 or compute capability >= 3.5), 64 for cc 3.5 or above. Symptoms of setting this too high can range from slow operation to crashes, or some silent failures in between, depending on driver model and GPU. This is quite likely one of the aspects that breaks Cuda 3.2 libraries on Maxwell or later, internally, outside of our control.

So I'd suggest limiting to 32 for generic compatibility, until I get the dispatch/plugin architecture built. 64 would be fine for personal builds that will never be run on a GPU < cc 3.5 (or so, probably best stick to Maxwell, cc5.0+, for safety if using 64 for maxregcount)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1772534 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772538 - Posted: 19 Mar 2016, 1:08:52 UTC - in response to Message 1772534.  

Okay, I just searched the Xbranch folder from the last build and found;
$(cuda_cu_objs): cuda/$(@:.o=.cu)
	$(NVCC) -c cuda/$(@:.o=.cu) -o $@ -Icuda $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) -I$(top_srcdir)/db $(BOINC_CFLAGS) --maxrregcount=32 $(NVCCFLAGS) $(CUDA_CFLAGS)

Since changing it has never been mentioned before, All the Apps have been built with the default which appears to be 32.

Hmmm, I might try setting it to 64 the next time I have the urge to pull all the 750s out and boot to Mountain Lion....just to see how it works on the 750s.

Any other suggestions?
ID: 1772538 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1772544 - Posted: 19 Mar 2016, 1:34:50 UTC - in response to Message 1772538.  

Only suggestion I have, is that If going for a target specific build for own use, then there are many options for both host and GPU code you can try out (like Petri does). Exhaustively going through every setting on every .cu GPU file, and every host cpp file, would not be entirely practical, and limit builds to very narrow applicability, but small incremental improvement can add up if you have the patience.

For wide distribution (e.g. stock), it will become easier as the infrastructure evolves to basically do this kindof finetuning itself, at install, first run and at runtime. Taking the limitations of having to support the lowest common denominator out of the picture, and automating the tuning process is where then next gen apps will be, and Petri's additions of streaming etc are more or less the ideal testing tool to work out that dispatch/plugin architecture (meant for x42).

As it stands, there are still some issues to work out for generic builds, though each day slightly closer to a clearer roadmap, that will let us have our cake and eat it too.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1772544 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772577 - Posted: 19 Mar 2016, 3:37:13 UTC - in response to Message 1772544.  

I'm afraid I need to save all the patience available for things of more importance. As long as it works decently without screen lag I'll be satisfied.

It appears the 100% CPU usage Does occur with Linux machines as well, http://setiathome.berkeley.edu/results.php?hostid=7907890&offset=320
Just determine why That Linux build uses 100% CPU and maybe we'll have a clue.
Perhaps there is hope after all...
ID: 1772577 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1772578 - Posted: 19 Mar 2016, 3:45:05 UTC - in response to Message 1772577.  

That's right. Nothing to do with how Cuda or OpenCL work, but how the applications are engineered. Good thing is about how the way things are headed, peopel will be able to choose. I like choices.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1772578 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1772597 - Posted: 19 Mar 2016, 6:57:44 UTC - in response to Message 1772538.  

Okay, I just searched the Xbranch folder from the last build and found;
$(cuda_cu_objs): cuda/$(@:.o=.cu)
$(NVCC) -c cuda/$(@:.o=.cu) -o $@ -Icuda $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) -I$(top_srcdir)/db $(BOINC_CFLAGS) --maxrregcount=32 $(NVCCFLAGS) $(CUDA_CFLAGS)

Since changing it has never been mentioned before, All the Apps have been built with the default which appears to be 32.

Hmmm, I might try setting it to 64 the next time I have the urge to pull all the 750s out and boot to Mountain Lion....just to see how it works on the 750s.

Any other suggestions?


Tom, I recompiled the code with maxrregcount = 64 and the results are very promising. The client seems faster for average AR of about 1 minute, however, the use of the CPU is always close to 100%.

I will do other tests and I will let you know.
I don't want to believe, I want to know!
ID: 1772597 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1772600 - Posted: 19 Mar 2016, 7:30:22 UTC

http://setiathome.berkeley.edu/result.php?resultid=4800007576

with maxrregcount=32

http://setiathome.berkeley.edu/result.php?resultid=4800260714

with maxrregcount=64 and the same AR 0.415

Performance increase 19,8%.

This result are on my GTX 960 with gencode 35,50 and 52.

[/b]
I don't want to believe, I want to know!
ID: 1772600 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1772613 - Posted: 19 Mar 2016, 8:27:44 UTC

Jason the same thing happens with your code, placing maxrregcount = 64 and testing with the reference work unit present on Lunatics I register an increase in performance of 3.6%.
The performance increase is modest compared with 21% of Petri code, but still present.
I don't want to believe, I want to know!
ID: 1772613 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1772662 - Posted: 19 Mar 2016, 18:07:15 UTC - in response to Message 1772597.  
Last modified: 19 Mar 2016, 18:15:28 UTC

Yes, it looks as though mine shaved off about 45 secs from the 0.41 AR tasks. I was waiting to see how the shorties fared, but seems the shorties are hard to find. Best I can tell the shorties didn't change much.
It still uses nearly 100% CPU...
ID: 1772662 · Report as offensive
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.