New ATI OpenCL-based AstroPulse (rev516) released

Message boards : Number crunching : New ATI OpenCL-based AstroPulse (rev516) released
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1098389 - Posted: 17 Apr 2011, 14:04:31 UTC - in response to Message 1098386.  

@ Raistmer:

I got a new validated AP with the new (higher) ffa-block:


Good info, maybe helpful for others who tune their setups.

ID: 1098389 · Report as offensive
terencewee*

Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1098402 - Posted: 17 Apr 2011, 14:41:54 UTC

I've change another parameter in my app_info.xml

<max_ncpus>0.08</max_ncpus>

The CPU-load tick-tock is now 0-2%, 4-12%. Did not go above 12%.

Wow!

Fast AP-processing and low CPU-load.


Thank you so much Raistmer!
ID: 1098402 · Report as offensive
terencewee*

Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1098411 - Posted: 17 Apr 2011, 14:54:04 UTC
Last modified: 17 Apr 2011, 15:00:06 UTC

@Raistmer:

Something just occurred to me:

For every minute of reduced GPU-work there is a similar 1-sec reduction of CPU-work for a 0.00 blanked AP WU.

This mean, the APgraph.png isn't really reflective of CPU-loading...
ID: 1098411 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1098436 - Posted: 17 Apr 2011, 15:51:22 UTC - in response to Message 1098411.  

@Raistmer:
This mean, the APgraph.png isn't really reflective of CPU-loading...

??? why you came to such conclusion?
In GPU build quite big share of CPU spent on GPU waiting/synching.
If GPU finishes work fast - less check iterations from CPU side are needed. Hence decrease in overall time usually leads to small decrease in CPU time too.
ID: 1098436 · Report as offensive
terencewee*

Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1098470 - Posted: 17 Apr 2011, 17:25:51 UTC - in response to Message 1098436.  
Last modified: 17 Apr 2011, 17:33:33 UTC

@Raistmer:
This mean, the APgraph.png isn't really reflective of CPU-loading...

??? why you came to such conclusion?
In GPU build quite big share of CPU spent on GPU waiting/synching.
If GPU finishes work fast - less check iterations from CPU side are needed. Hence decrease in overall time usually leads to small decrease in CPU time too.


I'm asking for your opinion as you *own* the app. :)


I'm thinking the APgraph is representative of *length of time* spent checking/fetching/pushing to GPU, not actual load placed on the CPU.

The axis of the graph is after all "% of blanking" vs. "time(sec)".

When the reduction of of GPU-time equate to similar reduction of CPU-time on the exact same cruncher = the difference is in the ffa blocks and fetches.

Identical:
Amount of transffered data
Amount of processing required
Amount of blanking

Not-identical:
Size of ffa blocks
Size of ffa fetchblocks


During run-time,

Case 1 (high CPU-load): The app took 30-45% (60%-90% when running 2 instances) of CPU/sec. There was less CPU/sec to process other CPU-WU.

Case 2 (low CPU-load): The app took 0-12% (x2=0-24%) of CPU/sec. There was more CPU/sec for doing other CPU-WUs processing.

The actual CPU-load could be very different - there is no guarantee we won't face high CPU-load.


The APgraph is still very useful, if our AP-WU finishes ABOVE normal time - we know our app_info.xml settings are not optimized. It is important the APgraph was done with CPU512x2 having low CPU-load.

Personally, I can accept 0-12% CPU-load during run time.



Put another way:
When 2 crunchers are running exactly x10 0.00 blanked AP-WU. Cruncher-Case 2 finishes faster as expected (length of time, as per the very useful APgraph), the CPU is then free 100% to do CPU-WUs (bonus-time).

When we focus at *during* the run time of Case 2, with (if) less load is placed on the CPU, more CPU-Wus got completed. This is not "captured" in the APgraph.


Sorry if I'm totally wrong - I'm not a scientist, just a Sicituradastra team-member. :)
ID: 1098470 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1098487 - Posted: 17 Apr 2011, 18:32:42 UTC - in response to Message 1098470.  

Well, (CPU time/ Elapsed time)*100% should give you average CPU load in case 1 task per GPU.
IF 2 instances running on same GPU then average CPU load could be computed by next expression:
100%*2*CPU_time_for_single_task_running_together_with_another/Elapsed_time_for_single task.
On graph "elapsed" shown as normalized value (Elapsed_time_for_single_task/2, cause 2 tasks were done for this wallclock time).

So, expression will be: 100%*CPU_time/Norm_elapsed_time.
If normilized eelapsed tim in case of 2 instances per GPU is less than elpased time for single instance and CPU tim almost the same - then CPU load (in%)is increased.

Actually those graphs aquired on fully loaded quad system. That is, 4 AKv8 CPU tasks + 1 or 2 GPU AP/MB tasks. So they reflect real host performance, accounting for all existing overheads (task switching in user space, cache trashing and so on).
The single non-accounted cPU overhead is OS overhead in kernel space (drivers, task switching in kernel space and so on). AFAIK it should be reported in CPU time for system process (maybe some part for separate GPU driver process), but I didn't capture it.
ID: 1098487 · Report as offensive
terencewee*

Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1098554 - Posted: 17 Apr 2011, 22:58:07 UTC

Thank you so much for taking the time to explain the above.

I am still learning. I need to go run a single instance with 16384/8192/0-blanked first and launch Excel... :)

I should look at r512ElapATi and r512CPUATi. Both are single instance I presume.

ID: 1098554 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1098663 - Posted: 18 Apr 2011, 8:29:29 UTC - in response to Message 1098554.  
Last modified: 18 Apr 2011, 8:33:21 UTC

"x2" for running 2 tasks at once on graphs.
There is Perl script that I did for automation of statistic data collection.
If you feel comfortable with Perl you could use it to simplify data gathering.


$path="client_state.xml"; 
$results="Times.txt";


open (IN, $path);  
open (RES, ">".$results);
print RES "Task name"."\t"."Result Type"."\t"."Revision"."\t"."Parameter"."\t"."ElapsedTime"."\t"."CPUTime"."\n";

while (<IN>) {   
		if( /<result>/ ){
#R: we need only result sections in file 
			$trueAR=-1;#R: error condition, will not store such blocks
			$WUname="";
			$ElapsedTime=0;
			$CPUTime=0;
			$ResultType=0;#R: 1-AKv8,2-ATI MB,3-ATI AP,4-CPU AP, 5-CUDA MB
			$Revision=0;#R: to distinguish between different builds
			while(<IN>){
				if( /<\/result>/ ){
					#R: ready to analyse collected info
					if( ($trueAR==-1) || ($ElapsedTime==0) ||($CPUTime==0) || ($Revision==0) ||
						($ResultType==0) ){ last;#R: record unready
					}
					print RES $WUname."\t".$ResultType."\t".$Revision."\t".$trueAR."\t".$ElapsedTime."\t".$CPUTime."\n";	
					last;#R: finished with this result record
				}
				if(/<name>(.*)<\/name>/){
					$WUname=$1;
					next;
				}
				if( /<final_cpu_time>(.*)<\/final_cpu_time>/ ){
					$CPUTime=$1;
					next;
				}
				if( /<final_elapsed_time>(.*)<\/final_elapsed_time>/ ){
					$ElapsedTime=$1;
					next;
				}
				if( /<exit_status>(.*)<\/exit_status>/ ){
					if($1 !=0){ last;#R:invalid result, no need to look into it further
					}
					next;
				}
				if( /rev (.*), 5.06 match/ ){#R: AP detection block
					$ResultType=3;
					$Revision=$1;
					next;
				}
				if( /Windows x86 rev (.*), Don't Panic!/ || /Linux 64 bit, rel.  Rev (.*)/ || /Linux 32 bit, rel.  Rev (.*)/){#uje: CPU opt AP detection
          $ResultType=4;
          $Revision=$1;
          next;
        }
				if( /percent blanked: (\S*)/){
					$trueAR=$1;#R: for AP this parameter means blanking instead of AR
					next;
				}
				if( /Found 30 single pulses and 30 repeating pulses, exiting./){
 					$Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;#R: don't count overflows
				}
				if( /repetitive pulses: 30/){#R: no need task with FFA disabled
 					$Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;#R: don't count overflows
				}
				if( /Multibeam x32f Preview/){
					$Revision="32";
					$ResultType=5;
					while (<IN>) {   
					 if( /Informational message -9 result_overflow/){
   					   $Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;#R: don't count overflows
					 }
					 if( /<\/result>/ ){#R: broken stderr
   					   $Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;
					 }
					 if( /WU true angle range is :  (\S*)/ ){
						$trueAR=$1;
						next;
					 }
					 if( /<\/stderr_txt>/){ last;}
					}
				}
				if( /Build (.*) , Ported by/ ){#R: ATI MB/AKv8 detection block
					$Revision=$1;
					$ResultType=1;#R: suppose AKv8 by default, change if it's ATI MB
					while (<IN>) {   
					 if( /Informational message -9 result_overflow/){
   					   $Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;#R: don't count overflows
					 }
					 if( /<\/result>/ ){#R: broken stderr
   					   $Revision=$CPUTime=$ElapsedTime=0;$trueAR=-1;last;
					 }
					 if( /WU true angle range is :  (\S*)/ ){
						$trueAR=$1;next;
					 }
					 if(/OpenCL version by Raistmer/){
						$ResultType=2;#R: do result type correction
						next;
					 }
					 if( /<\/stderr_txt>/){ last;}
					}#R:finish with MB task
				}
			}

		}
}


Usage:
1)copy client_state.xml in separate dir, copy script in PL file there, run:
perl <script_name.pl>
Extracted data will be in txt file ready to direct copy/pase into Excel.
ID: 1098663 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1098701 - Posted: 18 Apr 2011, 12:36:21 UTC - in response to Message 1098663.  

"x2" for running 2 tasks at once on graphs.
There is Perl script that I did for automation of statistic data collection.
If you feel comfortable with Perl you could use it to simplify data gathering.

Usage:
1)copy client_state.xml in separate dir, copy script in PL file there, run:
perl <script_name.pl>
Extracted data will be in txt file ready to direct copy/pase into Excel.


NB You will have to have some sort of Perl installed. It only works while the tasks have uploaded but not yet reported - you might want to set to no network to catch them. As posted, doesn't work with stock (and hence doesn't work on beta).

BTW with the NV OpenCL MB app taking 1/2 to 2/3 of a core and BOINC happily running two full CPU tasks alongside, even on a count of .5 on npcu, I set a count of 1 and reserved a full core. Quoting DA: 'The client tries to use all GPUs, even if this overcommits the CPUs. The assumption is that this maximizes throughput.'

Consequently, if you are running two instances (or more), you might wish to set a count of .51 and thus make sure BOINC reserves the full CPU needed.



Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1098701 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1098782 - Posted: 18 Apr 2011, 19:00:30 UTC - in response to Message 1098701.  


BTW with the NV OpenCL MB app taking 1/2 to 2/3 of a core and BOINC happily running two full CPU tasks alongside, even on a count of .5 on npcu, I set a count of 1 and reserved a full core. Quoting DA: 'The client tries to use all GPUs, even if this overcommits the CPUs. The assumption is that this maximizes throughput.'



Miep,it's data for SETI7 OpenCL NV alpha app, not for OpenCL MB SETI6 app with much lower CPU usage.
ID: 1098782 · Report as offensive
terencewee*

Send message
Joined: 10 Oct 09
Posts: 53
Credit: 7,022,510
RAC: 0
Malaysia
Message 1098834 - Posted: 18 Apr 2011, 22:41:32 UTC

@Raistmer:

Again, thank you for your assistance.


@Miep:
Yes, I do have ActivePerl installed in computer (Qt git requires it).



ID: 1098834 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1098903 - Posted: 19 Apr 2011, 9:31:46 UTC - in response to Message 1098834.  

Hi, playing with FFA_Block Fetch and Unroll, ffa_block 16384 and ffa_block_fetch
4096 and unroll 12. (Rev.516)

After trying:
<cmdline>-instances_per_device 1 -hp -unroll 10 -ffa_block 4096 -ffa_block_fetch 2048</cmdline>

CPU load is lower, GPU use still low (and silent), also running 8x SSSE3 MB, on this
host.


Will let it run, also have to go.


ID: 1098903 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1098915 - Posted: 19 Apr 2011, 11:26:07 UTC - in response to Message 1098782.  

Miep,it's data for SETI7 OpenCL NV alpha app, not for OpenCL MB SETI6 app with much lower CPU usage.


Good :D

That was mainly a general comment, that if you see high CPU load from a GPU app, you might want to provide for it.
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1098915 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : New ATI OpenCL-based AstroPulse (rev516) released


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.