Question about the special app for linux

Message boards : Number crunching : Question about the special app for linux
Message board moderation

To post messages, you must log in.

AuthorMessage
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1989646 - Posted: 11 Apr 2019, 13:43:57 UTC

Looking to improve the performance for the special app, I have a question. Probably mainly for Petri:

As far as I can see on my hosts, there is an interval of a couple of seconds from a task completes until the next one is fully loaded and up and running.

Would it be possible to reduce the wasted time by setting Boinc up to run two tasks at a time, and then use semaphores to let two instances of the special app synchronize between themselves approximately in the following manner:

- Instance 1 starts, aquires the semaphore, loads stuff and starts working.
- Instance 2 starts, does all possible initialization but does not start working with such activities as load the GPU.
- Instance 1 completes its work on the GPU, signals instance 2 to start working (releases the semaphore), and then finishes up such tasks as do not load the GPU.
- Instance 2 aquires the semaphore and immediately starts working while instance 1 is finishing up.
- Instance 1 then starts, does all possible initialization etc... step 2 from above.

Could there be anything to gain from such an approach?
ID: 1989646 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1989650 - Posted: 11 Apr 2019, 14:25:02 UTC

The special app is designed to use as much of the GPU's resources as possible when running a single task.
The overhead using semaphores in the manner you describe may well reduce the performance. It is rumoured that Petri has come up with a few more wrinkles that should further improve the performance.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1989650 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1989653 - Posted: 11 Apr 2019, 14:36:33 UTC - in response to Message 1989646.  

petri is working on a new version of the app, with a similar goal to reduce some of the wasted time, but maybe he's using a different method. from what i remember he's claiming a 10-15% speed boost. but it's still in the testing phase.

one thing you can do now, if you aren't already, is add the -nobs cmdline argument to your app_info file in the appropriate location. it'll work your CPU harder, but you'll get maybe 5% speedup.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1989653 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1989657 - Posted: 11 Apr 2019, 14:59:28 UTC - in response to Message 1989650.  

The overhead using semaphores in the manner you describe may well reduce the performance.
I respectfully disagree. My suggestion is that the app would still run as a single task for as long as it has work to do on the GPU. The semaphore (or mutex) would only be aquired once, i.e. before GPU-processing starts, and then it will be held for the duration of the GPU-work.
But it is a real question whether or not there would be anything substantial to gain from such an approach.
ID: 1989657 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1989682 - Posted: 11 Apr 2019, 19:30:55 UTC - in response to Message 1989646.  

Looking to improve the performance for the special app, I have a question. Probably mainly for Petri:

As far as I can see on my hosts, there is an interval of a couple of seconds from a task completes until the next one is fully loaded and up and running.

Would it be possible to reduce the wasted time by setting Boinc up to run two tasks at a time, and then use semaphores to let two instances of the special app synchronize between themselves approximately in the following manner:

- Instance 1 starts, aquires the semaphore, loads stuff and starts working.
- Instance 2 starts, does all possible initialization but does not start working with such activities as load the GPU.
- Instance 1 completes its work on the GPU, signals instance 2 to start working (releases the semaphore), and then finishes up such tasks as do not load the GPU.
- Instance 2 aquires the semaphore and immediately starts working while instance 1 is finishing up.
- Instance 1 then starts, does all possible initialization etc... step 2 from above.

Could there be anything to gain from such an approach?


Hi Oddbjörnik,

In short: Yes! Running one at a time and initializing one in the background makes sense. I'm glad you noticed that too. I like to have my GPU to cool-off those seconds but the super crunchers with their water cooled units would like to have that feature (I guess) right now or preferably yesterday.

Those seconds could really make a difference. Especially when running a long batch of shorties. Implementing such a scheme is not so hard. I'd be happy to include that in to the code if someone has time to experiment, develop and test. The source code is available and I'd be happy if someone had time to do so.

The upcoming version has a much reduced memory footprint. You will all be able to experiment. (You can try with the current code to set -unroll 1 and run 2 at a time. My machine was slow with that).

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1989682 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1989838 - Posted: 12 Apr 2019, 22:03:23 UTC - in response to Message 1989682.  

I've started fooling around with mutexes/semaphores/futexes on the Linux platform. Don't know if anything useful will come out of it yet, but it's fun to play with.
I'm a bit surprised with the apparent lack of a simple, robust mutex in Linux. Much error handling and cleaning up needs to be taken care of, while the only thing I'm really interested in from my perspective is: Do I own the mutex or don't I?
I don't care whether the other task/previous holder died a natural death, quit in a controlled manner, or was shot in the head with kill -9. However, such a careless mutex mechanism seems evasive on the Linux platform.
Please correct me if I'm wrong, and point me in the right direction.
ID: 1989838 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1989908 - Posted: 13 Apr 2019, 12:30:07 UTC

I think I have a (relatively) robust test / proof of concept up and running.
The idea is that you can start as many instances of this program as you like, and they will hold the mutex lock one at a time. If you then kill one of the running instances, one of the others will inherit the mutex, do the cleanup and continue as if nothing happened.
If anyone would care to take a closer look at it; here's the code:

//============================================================================
// Name        : robust.cpp
// Author      : Oddbjornik
//============================================================================

#include <string.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <iostream>

using namespace std;

pthread_mutex_t *robustMutex;
pthread_mutexattr_t robustAttr;

int main(int argc, char *argv[]) {

	int delay = 2;
	if (argc > 1) {
		delay = atoi(argv[1]);
	}

	printf("Robust interprocess mutex test program\n");

	int ret;

	if ((ret = pthread_mutexattr_init(&robustAttr)) != 0) {
		printf("pthread_mutexattr_init: %d - %s!\n", ret, strerror(ret));
		exit(9);
	}

	if ((ret = pthread_mutexattr_setrobust(&robustAttr, PTHREAD_MUTEX_ROBUST)) != 0) {
		printf("pthread_mutexattr_setrobust: %d - %s!\n", ret, strerror(ret));
		exit(9);
	}

	if ((ret = pthread_mutexattr_setpshared(&robustAttr, PTHREAD_PROCESS_SHARED)) != 0) {
		printf("pthread_mutexattr_setpshared: %d - %s!\n", ret, strerror(ret));
		exit(9);
	}

	// Mutex must be created in named shared memory.
	// Code partly stolen from https://stackoverflow.com/questions/4068974/initializing-a-pthread-mutex-in-shared-memory

	const char *shmName = "/obliMutex_1";
	int shm = shm_open(shmName, (O_CREAT | O_RDWR | O_EXCL), (S_IRUSR | S_IWUSR));
	if (shm == -1) {
		// We failed, so someone else probably already owns the mutex.
		if (errno == EEXIST) {

			// Yes, right, wait for that other task to properly initialize
			usleep(1000);

			// Then just open it
			shm = shm_open (shmName, O_RDWR, (S_IRUSR | S_IWUSR));
			if (shm == -1) {
				printf("shm_open(O_RDWR): %d - %s!\n", errno, strerror(errno));
				exit(9);
			}

			// And find the already working mutex in shared memory
			robustMutex = (pthread_mutex_t*)mmap(NULL, sizeof *robustMutex, PROT_READ | PROT_WRITE, MAP_SHARED, shm, 0);

			// Check for memory mapping failure
			if (robustMutex == (pthread_mutex_t*)-1) {
				printf("mmap: %d - %s!\n", errno, strerror(errno));
				exit(9);
			}
		}
		else {
			// Some other error occurred
			printf("shm_open(O_CREAT | O_RDWR | O_EXCL): %d - %s!\n", errno, strerror(errno));
			exit(9);
		}
	}
	else {
		// We successfully created the shared memory. Now we must initialize the mutex inside it.

		// First allocate the necessary space
		if ((ret = ftruncate(shm, sizeof *robustMutex)) != 0) {
			printf("ftruncate: %d - %s!\n", errno, strerror(errno));
			exit(9);
		}

		// Then map the memory to our mutex pointer
		robustMutex = (pthread_mutex_t*)mmap(NULL, sizeof *robustMutex, PROT_READ | PROT_WRITE, MAP_SHARED, shm, 0);

		if (robustMutex == (pthread_mutex_t*)-1) {
			printf("mmap: %d - %s!\n", errno, strerror(errno));
			exit(9);
		}

		// And initialize the mutex
		if ((ret = pthread_mutex_init(robustMutex, &robustAttr)) != 0) {
			printf("pthread_mutex_init: %d - %s!\n", ret, strerror(ret));
			exit(9);
		}
	}

	// Mutex object has been obtained, one way or the other. Now loop for days and see if anything fails

	int count = 0;
	while (true)
	{
		// Obtain the lock
		if ((ret = pthread_mutex_lock(robustMutex)) != 0) {
			if (ret == EOWNERDEAD) {
				printf("That one died in a bad way, or maybe it just died.\n");

				if ((ret = pthread_mutex_consistent(robustMutex)) != 0) {
					printf("pthread_mutex_consistent: %d - %s!\n", ret, strerror(ret));
					exit(9);
				}

				// No need to (re-)lock the mutex in here, since EOWNERDEAD means we got the lock, we just
				// have to clean it up before unlocking it.
			}
			else if (ret == ENOTRECOVERABLE) {
				printf("Not recoverable!\n");

				if ((ret = pthread_mutex_consistent(robustMutex)) != 0) {
					printf("pthread_mutex_consistent: %d - %s!\n", ret, strerror(ret));
					exit(9);
				}

				printf("We marked it consistent anyway!\n");

				if ((ret = pthread_mutex_lock(robustMutex)) != 0) {
					printf("But relock failed with pthread_mutex_lock: %d - %s!\n", ret, strerror(ret));
					exit(9);
				}

			}
			else {
				printf("pthread_mutex_lock: %d - %s!\n", ret, strerror(ret));
				exit(9);
			}
		}

		printf("Mutex loop count: %d\n", ++count);
		sleep(delay);

		pthread_mutex_unlock(robustMutex);

		usleep(1);	// Yield to next process
	}

	return 0;
}

ID: 1989908 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1989934 - Posted: 13 Apr 2019, 16:22:17 UTC

Hi oddbjornik,

there is a global int variable gCUDADevPref that holds the -device num parameter value.
Each GPU should be allowed to run one task at a time.

See PM for additional details.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1989934 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1989938 - Posted: 13 Apr 2019, 17:47:20 UTC - in response to Message 1989934.  

Each GPU should be allowed to run one task at a time.
That makes sense!
ID: 1989938 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 1992956 - Posted: 7 May 2019, 12:03:36 UTC
Last modified: 7 May 2019, 12:04:22 UTC

Going off topic here. I recently upgraded one machine from a GTX 1060 to a 1660Ti. I took the opportunity to also upgrade from the CUDA 80 to 101 while I was at it. Below is the output from one of each. Should I be worried the the CUDA 10.1 has decided to use -pfp 1 on the GTX1660Ti while the GTX1060 decided to use -pfp 9? They both have autotune in the command line.

GTX 1660 Ti
unroll limits: min = 1, max = 256. Using unroll autotune.
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536
computeCap 7.5, multiProcs 24
pciBusID = 9, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 1660 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 1660 Ti
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special

GTX 1060
unroll limits: min = 1, max = 256. Using unroll autotune.
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536
computeCap 6.1, multiProcs 9
pciBusID = 9, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 1060 3GB is okay
SETI@home using CUDA accelerated device GeForce GTX 1060 3GB
Unroll autotune 9. Overriding Pulse find periods per launch. Parameter -pfp set to 9

setiathome v8 enhanced x41p_zi3v, Cuda 8.00 special
CUDA 8.0 Special version by petri33.
BOINC blog
ID: 1992956 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992971 - Posted: 7 May 2019, 14:22:17 UTC
Last modified: 7 May 2019, 14:24:47 UTC

No, the new autotune at -unroll 1 is faster in the new 0.98b1 app. You can prove it to yourself by running both unroll values in the benchmark.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992971 · Report as offensive

Message boards : Number crunching : Question about the special app for linux


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.