BOINC assigns device X - Problem

Message boards : Number crunching : BOINC assigns device X - Problem
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1813381 - Posted: 29 Aug 2016, 7:48:31 UTC

Every w/u that I've checked being done with an OpenCL based app always reports being done Device 0 and that is just plain wrong in about 50% of cases when using more than 1 GPU.

Cheers.
ID: 1813381 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1813461 - Posted: 29 Aug 2016, 15:57:43 UTC - in response to Message 1813375.  


Today's testing was run plain vanilla, with no command line parameters for the SoG tasks. The host is on BOINC 7.6.22.

put some option into command line and place spacebar after it or just put some spaces into cmd line - will it help?

It doesn't appear that a trailing space makes a difference. I've just run two tasks with SoG, the first with a cmdline containing just two blank spaces. It's task 5123812346. It ran on Device 3 but the Stderr shows the usual "BOINC assigns device 0".

The second task, 5123914697, which ran on Device 1, was initially started with a simple "-use_sleep" cmdline. After confirming that "BOINC assigns device 0" was showing in the slot's Stderr and that -use_sleep was recognized, I suspended the task, added a space to the end of the cmdline ("-use_sleep ") and then resumed the task. The Stderr now shows a second occurrence of "BOINC assigns device 0". I also confirmed that the cmdline file with "-use_sleep" had a size of 10 bytes, while "-use_sleep " was 11 bytes, so I'm certain the trailing space wasn't being stripped before the file was saved.
ID: 1813461 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1813495 - Posted: 29 Aug 2016, 17:12:12 UTC - in response to Message 1813461.  

I can also atest to that.

No change with a trailing space in the commandline
ID: 1813495 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1813543 - Posted: 29 Aug 2016, 19:00:32 UTC - in response to Message 1813495.  

I can also atest to that.

No change with a trailing space in the commandline

I concur with my own tests.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1813543 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1813647 - Posted: 30 Aug 2016, 0:05:33 UTC

I have removed the trailing space and letting it run. My initial observations were not with bench testing. I just looked at task files before and after the change. There could be issues in timing that caused a mis-interpretation, but if I let it run for a day, I should know for certain.

One thing for certain, is that before I did all of this, every task said BOINC assigns device 0. That was with Lunatics. Now without Lunatics, the device numbers are correct. I noticed it was correct from the start of the new install and thought the problem came back when I added my typical command line arguments.

I will report back in a day.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1813647 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1813695 - Posted: 30 Aug 2016, 3:39:02 UTC - in response to Message 1813543.  

I can also atest to that.

No change with a trailing space in the commandline

I concur with my own tests.

Same here.
Always "BOINC assigns device 0" in Std_err, even though the manager shows device 0 & device 1.
Grant
Darwin NT
ID: 1813695 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1813710 - Posted: 30 Aug 2016, 5:09:56 UTC

Just done a check on one of my crunchers, and all tasks as shown as having run on device 0, despite being attributed to either a GTX970 or a GTX1080. BOINC manager shows tasks as running on devices 0, 1 & 2. That indicates to me that BOINC "knows" about the correct device numbers while the task is running but this isn't getting into the "write processor data to file" process.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1813710 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1813719 - Posted: 30 Aug 2016, 5:44:59 UTC

It seems there was a long discussion regarding device assignment logic over in Beta, about a month ago, that might have some bearing on this problem. Based on at least a simplistic understanding of what they were talking about over there, I just tried an experiment that Richard Haselgrove had suggested in that thread, namely to delete the

<api_version>7.5.0</api_version>

line from the SoG <app_version> section in app_info.xml (after shutting down BOINC, of course).

It appears that that might be where this problem originates, as well. Looking at the Stderr for Task 5124498887 which was running on Device 3 both before and after the app_info change, it shows "BOINC assigns device 0" before and both "Running on device number: 3" and "BOINC assigns device 3" after.

I'll also document two tasks that started from scratch after the change: Task 5124498589 also ran on Device 3 and Task 5124498260 ran on Device 2. The Stderr for each shows the correct device.

So, this problem also appears to tie in with that api_version / init_data.xml issue discussed in Beta. What the actual solution to that was, I couldn't actually figure out. ;^)
ID: 1813719 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1813725 - Posted: 30 Aug 2016, 6:12:21 UTC - in response to Message 1813719.  

So, this problem also appears to tie in with that api_version / init_data.xml issue discussed in Beta. What the actual solution to that was, I couldn't actually figure out. ;^)


For the root of the problem, someone decided to change the interface portion of the boincapi mid major version.

For the Cuda app builds on Windows, where I use an older (modified for thread safety) boincapi, the solution was indeed to begin always specifying the <api_version> when crafting an app info. For newer api support (,some Linux and Mac builds are built against more recent Boinc,) a small patch for main.cpp courtesy of Juha did the trick, such that the Cuda app may be built with old or new api, and operate with older or newer clients.

The key point is that <api_version> should be specified >= 7.5.0 if newer api and client is used, or less if the 'quasi-standard' --device command line needs to be used instead.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1813725 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1813733 - Posted: 30 Aug 2016, 6:59:38 UTC - in response to Message 1813725.  

For the root of the problem, someone decided to change the interface portion of the boincapi mid major version.

For the Cuda app builds on Windows, where I use an older (modified for thread safety) boincapi, the solution was indeed to begin always specifying the <api_version> when crafting an app info. For newer api support (,some Linux and Mac builds are built against more recent Boinc,) a small patch for main.cpp courtesy of Juha did the trick, such that the Cuda app may be built with old or new api, and operate with older or newer clients.

The key point is that <api_version> should be specified >= 7.5.0 if newer api and client is used, or less if the 'quasi-standard' --device command line needs to be used instead.


Would the problem be solved if the new BOINC Client Beta (7.6.33) is used with the latest Lunatics 0.45? Currently, I don't have the issue with 7.6.22 used with stock apps. But the strange part is I seem to remember having the device ID issue with r3430 of Lunatics, but don't have the issue now with stock apps, which now is r3430.

I'm using Windows 10, 64bit, and AMD Fiji GPU.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1813733 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1813736 - Posted: 30 Aug 2016, 7:40:48 UTC - in response to Message 1813733.  
Last modified: 30 Aug 2016, 7:43:33 UTC

I'm using Windows 10, 64bit, and AMD Fiji GPU.


As long as the Installer (presumably the newest Beta at least) specifies the correct <api_version>, in generated app_info.xml then *should* work fine. Lots of provisos there, but Richard's very careful with that AFAIK. Not sure exactly which point the practice of manually specifying the api version became normal/expected, as most libraries/apis you usually have some means of asking programmatically. Possibly just another sign that a hard major revision might be warranted, given that the libraries+client are increasingly subject to workarounds to deal with technology change not in the original design.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1813736 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813737 - Posted: 30 Aug 2016, 7:46:48 UTC
Last modified: 30 Aug 2016, 7:56:57 UTC

Guys - we really have to work out if this is an assignment problem (science application runs on wrong hardware), or a reporting problem (stderr lies to us about what is going on). Let ne get some more coffe, then have a poke around.

FWIW, setting <api_version> >= 7.5.0 merely suppresses the ancient 'pass device number via command line' mechanism - that's the only possible vector by which fiddling with spaces in the command line could have any effect. The way to check that would be to run Process Explorer and inspect the actual command line passed to the app - make sure that parameter separators are present in all the right places.

(oh and also FWIW - seeing (device n) in BOINC Manager merely says BOINC scheduled, intended, and instructed the science application to run on that device. Doesn't tell us whether the app obeyed that instruction - that was the problem we had at Beta)
ID: 1813737 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813739 - Posted: 30 Aug 2016, 8:16:53 UTC

First indications are that it's a reporting problem. For a task which BOINC says is running on device 1:

BOINC assigns device 0

OpenCL version by Raistmer, r3500

 OpenCL Platform Name:					 NVIDIA CUDA
Number of devices:				 2

  Max compute units:				 13
  Name:						 GeForce GTX 970

  Max compute units:				 5
  Name:						 GeForce GTX 750 Ti

Used GPU device parameters are:
	Number of compute units: 5

Command line according to Process Explorer:

"projects/setiathome.berkeley.edu/MB8_win_x86_SSE3_OpenCL_NV_SoG_r3500.exe  "

(BOINC clearly adds a couple of spaces anyway for luck)

(I am using a whole lot of tuning parameters, but they're read from a file, not passed on the actual command line.)

Compare the command line for the other GPU:

"projects/www.gpugrid.net/acemd.848-65.exe   --device 0"

That project uses <api_version>7.2.4</api_version>, and also passes the device information via the preferred

<gpu_type>NVIDIA</gpu_type>
<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>

in init_data.xml
ID: 1813739 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1813740 - Posted: 30 Aug 2016, 8:21:07 UTC - in response to Message 1813737.  

(oh and also FWIW - seeing (device n) in BOINC Manager merely says BOINC scheduled, intended, and instructed the science application to run on that device. Doesn't tell us whether the app obeyed that instruction - that was the problem we had at Beta)

On my system i've got a GTX 1070 & GTX 750Ti.
Going by the runtimes the 750Ti is reported as Device 1 & the 1070 as Device 0.


From one of my Stderr outputs.
OpenCL platform detected: NVIDIA Corporation
BOINC assigns device 0
Info: BOINC provided OpenCL device ID used



Further along it then gives the details of the card that did the work.
Used GPU device parameters are:
	Number of compute units: 5
	Single buffer allocation size: 256MB
	Total device global memory: 2048MB


Which is the GTX 750Ti, which in the manager is Device 1.


In any of my GPU WUs run on this system that I've clicked on the entry in Stderr is always "BOINC assigns device 0", never device 1.
Grant
Darwin NT
ID: 1813740 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813743 - Posted: 30 Aug 2016, 8:58:00 UTC - in response to Message 1813740.  

In any of my GPU WUs run on this system that I've clicked on the entry in Stderr is always "BOINC assigns device 0", never device 1.

I'm increasingly convinced that the phrase in quotation marks should be

"BOINC passes device 0 in the command line"

and pays no attention to the actual assignment (which happens elsewhere). And of course '0' is an application default used in place of 'null' when the command line is empty.

One more coffee, then I'll delve into Raistmer's code.
ID: 1813743 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813775 - Posted: 30 Aug 2016, 12:45:04 UTC

My god. It took one hell of a lot of coffee to wade through that spaghetti. But...

The message is printed by line 1159 of GPU_lock.cpp:

fprintf(stderr,"BOINC assigns device %d\n",BOINCs_device);

BOINCs_device is set either from the command line at line 475 of main.cpp:

AquireExecutionSlot::BOINCs_device=selected_device;

or at line 952 of GPU_lock.cpp:

unsigned AquireExecutionSlot::BOINCs_device=0;

But when we're using proper, native, OpenCL detection, we get to line 1290 of GPU_lock.cpp:

					fprintf(stderr,"Info: BOINC provided OpenCL device ID used\n");
					device_id=boinc_device_id;
				}
//R: now we know on what device program will be executed so query this particular device for capabilities
		host.Init(device_id);

(as we know, because stderr contains that 'Info:' print)

BOINCs_device <> boinc_device_id

And now I'm late for lunch...
ID: 1813775 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1813780 - Posted: 30 Aug 2016, 12:53:33 UTC

Richard's small coffee cup:


Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1813780 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813793 - Posted: 30 Aug 2016, 13:44:22 UTC - in response to Message 1813780.  

Thanks rob. I'm off out to get a better pair of glasses for that sort of work (my new varifocals aren't really ideal...)
ID: 1813793 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1813812 - Posted: 30 Aug 2016, 15:33:58 UTC

I let my host 8064262 run all night with SoG and with the "<api_version>" line for the SoG app removed from app_info.xml. Everything looks fine, with device numbers being consistently reported correctly in the Stderr.
ID: 1813812 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1813822 - Posted: 30 Aug 2016, 15:54:05 UTC - in response to Message 1813812.  

I let my host 8064262 run all night with SoG and with the "<api_version>" line for the SoG app removed from app_info.xml. Everything looks fine, with device numbers being consistently reported correctly in the Stderr.

Yes, it'll work like that - and it confirms the mechanism causing the bug that I identified with my codewalk.

You can do that by editing app_info.xml: if this application, or one of its derivatives, is ever deployed as a stock application, stock users will see the bug and not be able to work round it. Better to cure at source. [<api_version> is extracted automatically from the final compiled application as part of the stock deployment process]
ID: 1813822 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : BOINC assigns device X - Problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.