Problem with multi GPUs

Message boards : Number crunching : Problem with multi GPUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407433 - Posted: 24 Aug 2013, 20:30:52 UTC - in response to Message 1407431.  

Thanks. I'll break off and sleep on the problem for now.

I have an almost identical machine - i7, Win7/64, 2x GTX 670 - so I'll try to reproduce this next time I see any APs available to download.
ID: 1407433 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407434 - Posted: 24 Aug 2013, 20:35:34 UTC - in response to Message 1407136.  

Try this simple one:

<cc_config>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>

look if it works, and if not post your first 20 lines of the initial log with it.

remember totaly exit the boinc before try (not just the boincmgr) to be sure it´s working fine


Juan pasted into empty app_config.xml and restarted BOINC. It was as if it wan't read when BOINC started.


I don't buy computers, I build them!!
ID: 1407434 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407436 - Posted: 24 Aug 2013, 20:37:16 UTC

Going off-line, will pick up in the am.


I don't buy computers, I build them!!
ID: 1407436 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407437 - Posted: 24 Aug 2013, 20:39:32 UTC - in response to Message 1407431.  

Tbret The machine has always ran at 80% (6 CPU task/cores). The GPU tasks are assigned to .5 CPU and .5 count. So that's 6 CPU tasks & 4 GPU tasks. Since this problem, I did drop the percentage to 50% and still only one GPU was working, so I reset it back to 80%.




Cliff - I AM guessing, but I am NOT shooting in the dark. Reduce the CPU usage to 25%, or all the way down to 0%, and try it again. AP "wants" a full core per instance no matter what you try to tell it. Things changed with the NV OpenCL application.

It won't take but a minute, it's easy, and it eliminates a potential problem.

Then let's get 4 running on the GPU, THEN go back and add CPU.

Don't follow me down this rabbit-hole. I'm just reporting, not suggesting you start editing files: The only times I have ever had the problem you are having was when I set total GPU usage too high in app_config (in other words, told it 51% one thing, 50% another thing) OR I ran the system out of CPU resources.

I've had it happen with triple GPUs, double GPUs, and single GPUs and in each case it was a matter of having too few cores available for NV OpenCL. It never used to happen with just MB.

The fact that you can get work on both cards says the cards are working.

The fact that one card would run two instances of AP while the other was idle shows that you have the correct config to run two instances per card.

The text files you are posting say that the parameters are correct.

The problem seems to be the total number of APs running at once.

It is not a malfunction for an AP not-to-start if it can't.

What keeps it from starting is a lack of resources. I suspect this strongly.

You need to leave 4 free cores, not 3 or 2.

Just try it, please. I think we may be hunting pink elephants.
ID: 1407437 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1407440 - Posted: 24 Aug 2013, 20:49:03 UTC - in response to Message 1407437.  
Last modified: 24 Aug 2013, 21:00:00 UTC

Tbert

I think he have something wrong in his app_info file, you know a single letter in a wrong plase since all is working with 1 WU and he say all start after a HD failure, so he could have something hidden on one file and the editor can´t show that.

My sugeston is try to make the same with an small app_config.xml file exactly to avoid that, something like: (he don´t do MB)

<app_config>
<app>
<name>astropulse_v6</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.00</cpu_usage>
</gpu_versions>
</app>
</app_config>

could be do the same as changes in the app_info file but as the configuration on app_config overides the app_info that could prove my point. for debug i prefer use a small well tested and less possible typeerr file.

yes we are chasing the evil creature... but is grey not pink...
ID: 1407440 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407448 - Posted: 24 Aug 2013, 21:16:29 UTC - in response to Message 1407440.  



My sugeston is try to make the same with an small app_config.xml file exactly to avoid that, something like: (he don´t do MB)





Of course, you are correct to suggest that. With any troubleshooting procedure the first thing you want to do is to make it simple.

What seems to be going-on here is that we are changing multiple parameters and then trying again. Cliff may be solving his problem, but simultaneously creating another.

If the goal is to get both GPUs to run two instances I think "we" should stop trying to do anything else. Fix that. Then add. Your suggestion to start simple is the same idea. I agree.
ID: 1407448 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407651 - Posted: 25 Aug 2013, 18:28:41 UTC

OK, we have some AP tasks again, so I tried to replicate Cliff's situation as closely as I could. Used the Lunatics v0.41 apps and app_info: instead of modifying app_info, I over-rode it with an app_config specifying 0.5 CPU and 0.5 GPU. I run 6 cores normally (i7 with HT on), so running 4xAP knocks out another two cores.

I'm running 2xGTX 670 - close enough - so I copied Cliff's command line, but did not add the -instances switch we discussed yesterday.

25/08/2013 18:49:13 | SETI@home | [coproc] NVIDIA instance 0: confirming for 26mr08ag.22530.17655.13.12.12_0
25/08/2013 18:49:13 | SETI@home | [coproc] NVIDIA instance 1: confirming for 28fe08ad.23910.17250.12.12.93_0
25/08/2013 18:49:13 | SETI@home | [coproc] NVIDIA instance 1: confirming for 26mr08ag.26842.15610.14.12.30_0
25/08/2013 18:49:13 | SETI@home | [coproc] NVIDIA instance 0: confirming for 26mr08ag.26842.15610.14.12.27_1

25/08/2013 18:49:31 | SETI@home | Computation for task 26mr08ag.22530.17655.13.12.12_0 finished
25/08/2013 18:49:31 | SETI@home | [coproc] NVIDIA instance 1: confirming for 28fe08ad.23910.17250.12.12.93_0
25/08/2013 18:49:31 | SETI@home | [coproc] NVIDIA instance 1: confirming for 26mr08ag.26842.15610.14.12.30_0
25/08/2013 18:49:31 | SETI@home | [coproc] NVIDIA instance 0: confirming for 26mr08ag.26842.15610.14.12.27_1
25/08/2013 18:49:31 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 0 to ap_02no08ab_B3_P0_00056_20130825_20140.wu_1
25/08/2013 18:49:31 | SETI@home | Starting task ap_02no08ab_B3_P0_00056_20130825_20140.wu_1 using astropulse_v6 version 604 (cuda_opencl_100) in slot 0

25/08/2013 18:49:46 | SETI@home | Computation for task 28fe08ad.23910.17250.12.12.93_0 finished
25/08/2013 18:49:46 | SETI@home | [coproc] NVIDIA instance 1: confirming for 26mr08ag.26842.15610.14.12.30_0
25/08/2013 18:49:46 | SETI@home | [coproc] NVIDIA instance 0: confirming for 26mr08ag.26842.15610.14.12.27_1
25/08/2013 18:49:46 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_02no08ab_B3_P0_00056_20130825_20140.wu_1
25/08/2013 18:49:46 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 1 to ap_02no08ab_B2_P1_00091_20130825_19253.wu_1
25/08/2013 18:49:46 | SETI@home | Starting task ap_02no08ab_B2_P1_00091_20130825_19253.wu_1 using astropulse_v6 version 604 (cuda_opencl_100) in slot 9

25/08/2013 18:56:16 | SETI@home | Computation for task 26mr08ag.26842.15610.14.12.30_0 finished
25/08/2013 18:56:16 | SETI@home | [coproc] NVIDIA instance 0: confirming for 26mr08ag.26842.15610.14.12.27_1
25/08/2013 18:56:16 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_02no08ab_B3_P0_00056_20130825_20140.wu_1
25/08/2013 18:56:16 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_02no08ab_B2_P1_00091_20130825_19253.wu_1
25/08/2013 18:56:16 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 1 to ap_02no08ab_B3_P0_00121_20130825_20140.wu_1
25/08/2013 18:56:16 | SETI@home | Starting task ap_02no08ab_B3_P0_00121_20130825_20140.wu_1 using astropulse_v6 version 604 (cuda_opencl_100) in slot 1

25/08/2013 19:03:52 | SETI@home | Computation for task 26mr08ag.26842.15610.14.12.27_1 finished
25/08/2013 19:03:52 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_02no08ab_B3_P0_00056_20130825_20140.wu_1
25/08/2013 19:03:52 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_02no08ab_B2_P1_00091_20130825_19253.wu_1
25/08/2013 19:03:52 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_02no08ab_B3_P0_00121_20130825_20140.wu_1
25/08/2013 19:03:52 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 0 to ap_02no08ab_B2_P1_00190_20130825_19253.wu_1
25/08/2013 19:03:52 | SETI@home | Starting task ap_02no08ab_B2_P1_00190_20130825_19253.wu_1 using astropulse_v6 version 604 (cuda_opencl_100) in slot 6

25/08/2013 19:04:41 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_02no08ab_B3_P0_00056_20130825_20140.wu_1
25/08/2013 19:04:41 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_02no08ab_B2_P1_00091_20130825_19253.wu_1
25/08/2013 19:04:41 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_02no08ab_B3_P0_00121_20130825_20140.wu_1
25/08/2013 19:04:41 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_02no08ab_B2_P1_00190_20130825_19253.wu_1

which is as tidy a handover as one could wish for. Whatever the problem is, I withdraw my suggestion it might be to do with the AP -instances switch.

Only difference is that I'm running BOINC v7.2.10, and Cliff has v7.0.64: we did have to get some changes made between those two versions, to accommodate multi-threaded application and GPUs. I have to go out again, so can't test that right now, but it's on my ToDo list.
ID: 1407651 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407694 - Posted: 25 Aug 2013, 21:09:11 UTC
Last modified: 25 Aug 2013, 21:15:47 UTC

Day 2 --

After a quick perusal of the posts since I signoff yesterday, I think a quick review of the situation is in order.

1) This machine has been primarily an AP machine for over two years, always working 2 tasks per multi GPU, through several iterations of GPU updrades, with the CPU and count being set to .5, and using 80% of the CPUs (6 cores) with no apparent problems. The current GPUs are 2 x EVGA GTX660SC @ 2 GB.

2) On 19 Aug the system SSD (C:) died and had to be RMAed. The data directories are on the data drive which is on another physical device (D:). After installation of the new SSD and a clean install of all programs/modules including the OS and BOINC/Lunatics, BOINC/Seti developed a condition which I consider to be unacceptable for this machine and am attempting to resolve.

3) The condition being that only 1 GPU device is currently working Seti tasks instead of both. BOINC recognizes both GPUs as cuda and opencl devices.

4) This machine has 2 x pci-e x 16 slots. It doesn't matter whether the machine is running both GPUs or in single GPU configuration, with or without BOINC running, there is a suspected pci-e slot that shows the GPU in that slot to be under performing power/clock wise according to EVGA Precision X. This condition is what I believe is preventing BOINC/Seti from using both GPUs.

5) Suggestions from several people have been explored, including a simple app_config.xml file, decreasing the number of cores, btw 0 cores is the same as 100%, none of which has changed the condition. The app_config was modified to include scenarios of the cpu and gpu count being 1 and both being .5. All scenarios failed to accomplish anything different.

--------------------

As of today --

1) After reading Richard's post from today, I upgraded BOINC to 7.2.10. At the same time I modified the app_info.xml file to set the cpu and count to 1 for the AP tasks, this includes opencl. There was no change to the result. Only 1 GPU was working. I uninstalled 7.2.10, recycled the machine and reinstalled 7.0.64.

2) CPU & count is still set to 1 in the app_info.xml and BOINC is currently running only 1 opencl & 6 CPU tasks. Using the coproc debug flag, I can see device1 getting assigned work, but it is never confirmed.

3) The instances phase is still included in the ap_cmdline_winx86_SSE2_OpenCL _NV.txt file.

4) Question -- Does anyone think that the complete removal of all everything under and including the parent data directory (D:\BOINC), and completely regenerate everything help the situation?

[edit] I checked out a couple of tasks for this machine and one thing I noticed is that I don't see where BOINC/Seti/Lunatics is checking device 1 to see if it is an available/compatible device to receive work. Usually I see this happening.[\edit]


I don't buy computers, I build them!!
ID: 1407694 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1407705 - Posted: 25 Aug 2013, 21:45:55 UTC - in response to Message 1407694.  

If you think you have a bad slot, remove the card from the good slot and see if the bad slot is still bad with only one card in the machine. If the 'bad' slot works fine with just one card in the machine, you might look at the power supply.

I don't think you've tried a different power supply yet?
ID: 1407705 · Report as offensive
musicplayer

Send message
Joined: 17 May 10
Posts: 2430
Credit: 926,046
RAC: 0
Message 1407706 - Posted: 25 Aug 2013, 21:48:54 UTC

Cliff.

I did not read it all, but have you tried swapping the two cards?
ID: 1407706 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1407707 - Posted: 25 Aug 2013, 21:49:43 UTC

if you have app_config.xml file remove it completely. Then restart BOINC.
Also, remove -number* switch from config file. Leave onlu <count>0.5</count> in coprocessor block in app_info.xml

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1407707 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1407709 - Posted: 25 Aug 2013, 21:52:13 UTC

And at least this http://setiathome.berkeley.edu/result.php?resultid=3121144507 your recent task was executed on secondary GPU according to stderr.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1407709 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407719 - Posted: 25 Aug 2013, 22:47:40 UTC - in response to Message 1407706.  

Cliff.

I did not read it all, but have you tried swapping the two cards?

Yes he did. It proved they are both working.
ID: 1407719 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407721 - Posted: 25 Aug 2013, 22:59:10 UTC

I managed to create a similar situation to Cliff.

Scenario: this is the first time I've run AP on the i7. So, I have no APR established to guide BOINC's estimation of runtime. So, BOINC estimated (and is still estimating, 14 tasks later) that AP tasks will take 11 hours. So, in order to be able to get some sleep tonight without micromanaging, I turned up the cache setting to fetch more tasks.

Now, I run SETI - usually MB only, but for this test AP too - only on GPUs. I run other projects on CPU cores, and they mostly have short deadlines: 7 days, for the two I'm running currently.

So, when I tried setting the cache to 10 days, my CPU tasks were in deadline trouble and were switched into 'Earliest Deadline First' mode, shown as 'high priority'. And BOINC reclaimed every CPU possible to meet those deadlines: it pre-empted the GPU tasks that I'd set to use 0.5 CPUs.

That's one way to stop the full number of AP tasks running. I'll find a cc_config log flag that demonstrates the behaviour some other time, perhaps in the morning. In the meantime, I've turned the cache numbers back down again.

And please could my wingmates return some AP tasks from this new batch?
ID: 1407721 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407722 - Posted: 25 Aug 2013, 23:00:23 UTC - in response to Message 1407707.  

if you have app_config.xml file remove it completely. Then restart BOINC.

What's wrong with app_config? What are you trying to discover here?
ID: 1407722 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407724 - Posted: 25 Aug 2013, 23:22:03 UTC

Richard -

All of my CPU tasks are running high-priority, but I've seen them all run hp before and still have the GPUs working @ 2x. When this happens, I've usually drop down to 70% and let it run.

Since the app_config in theory is read after the app_info and overrides it, a suggestion was made to create one to see if it resolved the situation under any of the scenarios that I presented. Unfortunately it looks like the file was not read and did not override the app_info. I will keep it, but rename it for retention.

Raister -

Even though the task was reported on 19 Aug., it was reported prior to the crash. I will remove the switch.

Tbar -

Already ahead of you, luckily I have an identical psu in the secondary machine and swapping them did not resolve the problem.


I don't buy computers, I build them!!
ID: 1407724 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1407728 - Posted: 25 Aug 2013, 23:46:40 UTC - in response to Message 1407724.  

[quoteUnfortunately it looks like the file was not read and did not override the app_info. I will keep it, but rename it for retention.[/quote]
Could be because the .txt normaly added to the end of file by notepah?

ID: 1407728 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407733 - Posted: 25 Aug 2013, 23:56:09 UTC - in response to Message 1407728.  

Unfortunately it looks like the file was not read and did not override the app_info. I will keep it, but rename it for retention.

Could be because the .txt normaly added to the end of file by notepah?

You would see the line

26/08/2013 00:50:35 | SETI@home | Found app_config.xml

in the message log if the file is properly named and placed, and read by BOINC.

It should be read (and logged) by BOINC at every restart, and whenever you 'Read config files' from the advanced menu in BOINC Manager - no restart needed.

Cliff - what version of BOINC was running before the crash/rebuild? I'll try v7.0.64 in both EDF and normal modes in the morning, but not now, I'm afraid - too tired to risk the inevitable mistakes.
ID: 1407733 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407738 - Posted: 26 Aug 2013, 0:05:32 UTC - in response to Message 1407721.  
Last modified: 26 Aug 2013, 0:10:00 UTC

So, when I tried setting the cache to 10 days, my CPU tasks were in deadline trouble and were switched into 'Earliest Deadline First' mode, shown as 'high priority'. And BOINC reclaimed every CPU possible to meet those deadlines: it pre-empted the GPU tasks that I'd set to use 0.5 CPUs.

That's one way to stop the full number of AP tasks running. I'll find a cc_config log flag that demonstrates the behaviour some other time, perhaps in the morning. In the meantime, I've turned the cache numbers back down again.

And please could my wingmates return some AP tasks from this new batch?


Richard -

I took what you noted and reversed it by suspending all CPU tasks (93) and the GPUs kicked in full force. I resumed the CPU tasks individually and when I hit 4, 1 task on device 1 was suspended. When I hit 6 both tasks on device 1 was suspended.

So it seems that what was perceived to be a problem has been solved. I have restored the app_info.xml to its original state, both cpu & count @ .5. I also have suspended all GPU activity while the machine is busy to allow the CPU tasks the maximum space to get caught up. Time estimates for CPU tasks has been extended from approx. 11 hrs. to 13 hrs. Not going to try to guess when the backlog will be caught up. The GPU tasks should be alright as they only take a little over an hour to complete and the first deadline is 12 Aug.

As far as the app_config.xml file is concerned, I don't remember seeing it in the log either when starting BOINC or when re-reading the config file.

I want to thank everyone for taking the time and effort on this. MANY THANKS!!


I don't buy computers, I build them!!
ID: 1407738 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407740 - Posted: 26 Aug 2013, 0:08:26 UTC - in response to Message 1407738.  

YAY! That's good to hear - I'll sleep well tonight.

And I'm glad we've saved you the price of a new motherboard along the way...
ID: 1407740 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problem with multi GPUs


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.