Problem with multi GPUs

Message boards : Number crunching : Problem with multi GPUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407743 - Posted: 26 Aug 2013, 0:16:14 UTC - in response to Message 1407740.  

YAY! That's good to hear - I'll sleep well tonight.

And I'm glad we've saved you the price of a new motherboard along the way...


A new build is down the pike, my stepson (41) needs a computer and its the perfect excuse to give him the i7/930 and build me another by Christmas. It's the oldest of the two and its starting to show its age. It will be a good starter for him, that is if he can keep my 4 year old granddaughter off of it. She loves to draw on it using the Paint program when she come over.


I don't buy computers, I build them!!
ID: 1407743 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407780 - Posted: 26 Aug 2013, 2:29:46 UTC - in response to Message 1407738.  


I took what you noted and reversed it by suspending all CPU tasks (93) and the GPUs kicked in full force.


Imagine that.

I'm glad you got it working. I think it was driving several of us crazy.
ID: 1407780 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407786 - Posted: 26 Aug 2013, 2:57:09 UTC

Cliff - I'm sure you are tired of editing and fooling with it. BUT - If your 660s are like my 660Tis, then they will crunch two APs at a time and you will pick-up just a few minutes by doing two at a time.

HOWEVER, since you have v7 and AP running; You may find that your maximum through-put is to run 1 AP and 1 MB on each card.

You do that through the app_config.

Juan helped me through it after someone helped him through it, so he can advise you better than I can.

On one of my dual machines it looks like this:

<app_config>
<app>
<name>astropulse_v6</name>

<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>.51</gpu_usage>
<cpu_usage>1.0</cpu_usage>

</gpu_versions>
</app>
<app>
<name>setiathome_v7</name>
<gpu_versions>

<gpu_usage>.49</gpu_usage>
<cpu_usage>.06</cpu_usage>

</gpu_versions>
</app>
</app_config>


That says, "If you've got AP-only, run one per card. If you've got AP and MB, run one each per card. If all you've got is MB, run two per card." It runs two because I also have two-per-card set by >.5< in app_info.

There's another way to do it, but someone else would have to help with that.



ID: 1407786 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1407811 - Posted: 26 Aug 2013, 5:04:21 UTC - in response to Message 1407722.  
Last modified: 26 Aug 2013, 5:08:51 UTC

if you have app_config.xml file remove it completely. Then restart BOINC.

What's wrong with app_config? What are you trying to discover here?


In case app_config has wrong values. Better to use either app_info for task number settings or app_cinfig. Just one possible source of misconfiguration.

EDIT: And though problem for particular host seems to be solved there is probloem in BOINC task management IMO. To block GPU completely when CPU comes into panic mode seems not right decision. Particularly because the reasons to go into panic for CPU too often not justified.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1407811 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1407894 - Posted: 26 Aug 2013, 11:32:33 UTC - in response to Message 1407786.  

Juan -

Looking at your AP tasks, the times seems to be inline with mine. With only 1 task working I've would have thought the times would be greatly reduced, but comparing similar times it seems that my working 2 x per GPU is approx. the same as yours.

I do have one curiosity though and it has baffled from day one, and it's not just you that I see this -- Why is it that people with AMD machines desire to fill them with NVidia GPUs and visa-versa? Is simply to satisfy the requirements of other projects they may be running?

Tbret -

If you think it was driving you crazy, think what it was doing to old man. The most HP tasks running at one time that I've seen prior to this is 3, and then the next 3 came on line when they were done. I guess the 4-5 days that the machine was down through everything out of whack.

I'm already running 2 x on the GPUs, which I actually prefer and I'll try your sample as it is troubling going thru the app_info to make changes, but with a couple of changes. For AP both will be set to .5, and for MB when it has to be used the GPU will be .25 & the CPU @ .5. I've found that I when I have to run MB, I can run 4 x per GPU.


I don't buy computers, I build them!!
ID: 1407894 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407897 - Posted: 26 Aug 2013, 11:47:14 UTC - in response to Message 1407811.  

if you have app_config.xml file remove it completely. Then restart BOINC.

What's wrong with app_config? What are you trying to discover here?

In case app_config has wrong values. Better to use either app_info for task number settings or app_config. Just one possible source of misconfiguration.

Ah - if you just meant temporarily to eliminate confusion and possible side effects from a forgotten file while we diagnosed, I'd agree with you.

But you made such an assertive statement (without explanation) that I thought you meant that people shouldn't use it at all. Personally, I find it one of the most useful recent innovations - though it's incomplete, and still shows the rough edges of hasty coding.
ID: 1407897 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407911 - Posted: 26 Aug 2013, 12:38:18 UTC - in response to Message 1407811.  
Last modified: 26 Aug 2013, 13:16:42 UTC

EDIT: And though problem for particular host seems to be solved there is probloem in BOINC task management IMO. To block GPU completely when CPU comes into panic mode seems not right decision. Particularly because the reasons to go into panic for CPU too often not justified.

But I don't think that's a fair description. BOINC isn't blocking the GPUs per se: it's blocking an application which needs both CPU and GPU. Because it needs the CPU back for something more urgent, a GPU app that requires high CPU input can't be accommodated.

I repeated the experiment this morning, with proper logging in place - the output is complicated, but I'll try to demonstrate.

First running 4x AP - count of 0.5 on 2 GPUs, and requiring 0.5 CPU per task. That works as expected:

26/08/2013 11:11:00 |  | [cpu_sched_debug] schedule_cpus(): start
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (coprocessor job, FIFO) (prio -1.979064)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (coprocessor job, FIFO) (prio -1.999741)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] scheduling ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (coprocessor job, FIFO) (prio -2.020418)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] scheduling ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (coprocessor job, FIFO) (prio -2.041096)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
...
26/08/2013 11:11:00 |  | [cpu_sched_debug] preliminary job list:
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 0: ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 1: ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 2: ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 3: ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (MD: no; UTS: yes)
...
26/08/2013 11:11:00 |  | [cpu_sched_debug] final job list:
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 0: ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 1: ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 2: ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (MD: no; UTS: yes)
26/08/2013 11:11:00 | SETI@home | [cpu_sched_debug] 3: ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (MD: no; UTS: yes)
...
26/08/2013 11:11:00 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:11:00 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:11:00 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1
26/08/2013 11:11:00 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1

But then I drove the CPUs into EDF by setting an absurdly high cache ( minimum work buffer - 'connect every') level.

26/08/2013 11:12:04 |  | [cpu_sched_debug] schedule_cpus(): start
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (coprocessor job, FIFO) (prio -1.979067)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (coprocessor job, FIFO) (prio -1.999744)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] scheduling ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (coprocessor job, FIFO) (prio -2.020422)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] scheduling ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (coprocessor job, FIFO) (prio -2.041099)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] reserving 0.500000 of coproc NVIDIA
...
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] earliest deadline: 1378026746 sd_sixt15_540_2.8_4D_err__19__s__62.31_60.32__2_4__6__20_1_sixvf_boinc1534_0
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] scheduling sd_sixt15_540_2.8_4D_err__19__s__62.31_60.32__2_4__6__20_1_sixvf_boinc1534_0 (CPU job, EDF) (prio -0.020811)
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] earliest deadline: 1378050429 sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__2_4__6__10_1_sixvf_boinc3913_0
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] scheduling sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__2_4__6__10_1_sixvf_boinc3913_0 (CPU job, EDF) (prio -0.021123)
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] earliest deadline: 1378050429 sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__4_6__6__55_1_sixvf_boinc3939_0
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] scheduling sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__4_6__6__55_1_sixvf_boinc3939_0 (CPU job, EDF) (prio -0.021435)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] earliest deadline: 1378054516 wu_sf2_DS-24x12_Grp179495of284940_0
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] scheduling wu_sf2_DS-24x12_Grp179495of284940_0 (CPU job, EDF) (prio -0.021055)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] earliest deadline: 1378054516 wu_12E10_SF120-2_Idx6_Grp15485of32887_0
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] scheduling wu_12E10_SF120-2_Idx6_Grp15485of32887_0 (CPU job, EDF) (prio -0.021367)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] earliest deadline: 1378054516 wu_sf2_DS-24x12_Grp180055of284940_0
...
26/08/2013 11:12:04 |  | [cpu_sched_debug] preliminary job list:
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] 0: ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (MD: no; UTS: yes)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] 1: ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (MD: no; UTS: yes)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] 2: ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (MD: no; UTS: yes)
26/08/2013 11:12:04 | SETI@home | [cpu_sched_debug] 3: ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (MD: no; UTS: yes)
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] 4: sd_sixt15_540_2.8_4D_err__19__s__62.31_60.32__2_4__6__20_1_sixvf_boinc1534_0 (MD: yes; UTS: yes)
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] 5: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__2_4__6__10_1_sixvf_boinc3913_0 (MD: yes; UTS: yes)
26/08/2013 11:12:04 | LHC@home 1.0 | [cpu_sched_debug] 6: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__4_6__6__55_1_sixvf_boinc3939_0 (MD: yes; UTS: no)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] 7: wu_sf2_DS-24x12_Grp179495of284940_0 (MD: yes; UTS: yes)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] 8: wu_12E10_SF120-2_Idx6_Grp15485of32887_0 (MD: yes; UTS: no)
26/08/2013 11:12:04 | NumberFields@home | [cpu_sched_debug] 9: wu_sf2_DS-24x12_Grp180055of284940_0 (MD: yes; UTS: no)
...
26/08/2013 11:12:04 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:12:04 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:12:04 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1
26/08/2013 11:12:04 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1
...
26/08/2013 11:12:05 |  | [cpu_sched_debug] final job list:
26/08/2013 11:12:05 | LHC@home 1.0 | [cpu_sched_debug] 0: sd_sixt15_540_2.8_4D_err__19__s__62.31_60.32__2_4__6__20_1_sixvf_boinc1534_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | LHC@home 1.0 | [cpu_sched_debug] 1: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__2_4__6__10_1_sixvf_boinc3913_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | LHC@home 1.0 | [cpu_sched_debug] 2: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__4_6__6__55_1_sixvf_boinc3939_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | NumberFields@home | [cpu_sched_debug] 3: wu_sf2_DS-24x12_Grp179495of284940_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | NumberFields@home | [cpu_sched_debug] 4: wu_12E10_SF120-2_Idx6_Grp15485of32887_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | NumberFields@home | [cpu_sched_debug] 5: wu_sf2_DS-24x12_Grp180055of284940_0 (MD: yes; UTS: yes)
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] 6: ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (MD: no; UTS: yes)
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] 7: ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (MD: no; UTS: yes)
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] 8: ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (MD: no; UTS: no)
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] 9: ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (MD: no; UTS: no)
...
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] skipping GPU job ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1; CPU committed
26/08/2013 11:12:05 | SETI@home | [cpu_sched_debug] skipping GPU job ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1; CPU committed
...
26/08/2013 11:12:05 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:12:05 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:12:05 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 0 to ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1
26/08/2013 11:12:05 | SETI@home | [coproc] Assigning 0.500000 of NVIDIA instance 1 to ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1

And there you have Cliff's problem. The CPU tasks have been moved up from priority 4 - 9 to priority 0 - 5, and they have grabbed all the available CPUs before the lower-priority GPU jobs get a chance. Clear the EDF, and the GPU tasks get allocated first, and get first pick of the CPUs.

I tried again, reserving more cores via CPU %usage and reducing the CPU demand for AP tasks to 0.24:

26/08/2013 11:22:47 |  | [cpu_sched_debug] final job list:
26/08/2013 11:22:47 | LHC@home 1.0 | [cpu_sched_debug] 0: sd_sixt15_540_2.8_4D_err__19__s__62.31_60.32__2_4__6__20_1_sixvf_boinc1534_0 (MD: yes; UTS: yes)
26/08/2013 11:22:47 | LHC@home 1.0 | [cpu_sched_debug] 1: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__2_4__6__10_1_sixvf_boinc3913_0 (MD: yes; UTS: yes)
26/08/2013 11:22:47 | LHC@home 1.0 | [cpu_sched_debug] 2: sd_sixt10_490_1.8_4D_err__49__s__62.31_60.32__4_6__6__55_1_sixvf_boinc3939_0 (MD: yes; UTS: yes)
26/08/2013 11:22:47 | NumberFields@home | [cpu_sched_debug] 3: wu_sf2_DS-24x12_Grp179495of284940_0 (MD: yes; UTS: yes)
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] 4: ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1 (MD: no; UTS: yes)
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] 5: ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0 (MD: no; UTS: yes)
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] 6: ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1 (MD: no; UTS: yes)
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] 7: ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1 (MD: no; UTS: yes)
...
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] scheduling ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] scheduling ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1
26/08/2013 11:22:47 | SETI@home | [cpu_sched_debug] scheduling ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1
...
26/08/2013 11:22:47 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_30mr08ab_B0_P1_00204_20130825_07727.wu_1
26/08/2013 11:22:47 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_30mr08ab_B0_P1_00307_20130825_07727.wu_0
26/08/2013 11:22:47 | SETI@home | [coproc] NVIDIA instance 0: confirming for ap_11mr08ad_B5_P0_00145_20130825_01524.wu_1
26/08/2013 11:22:47 | SETI@home | [coproc] NVIDIA instance 1: confirming for ap_21ap08ac_B2_P0_00284_20130825_32234.wu_1

All four GPU tasks ran, even though the CPUs were again scheduled first in EDF.

That worked because the rule is: "in EDF, CPU tasks come first. GPUs can have extra CPU resources, up to a maximum of 1.00 CPU". 2x0.5 takes you to that limit: 4x0.24 is still less than one, and allowed.

My trouble is that I'd like to allow CUDA MB tasks to be scheduled in the mix alongside OpenCL AP tasks - not least, because I'm about to run out of AP tasks again. CUDA doesn't need such heavy CPU support, so I'd like to use an extra CPU or two for other projects while MB is running, and claw them back when AP needs the helping hand. So I'll be experimenting with different CPU usage values in app_config, and let you know how I get on. I still need to track down a couple of EXIT_TIME_LIMIT_EXCEEDED errors overnight - see what brought those on, first.
ID: 1407911 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1407945 - Posted: 26 Aug 2013, 14:15:12 UTC - in response to Message 1407894.  
Last modified: 26 Aug 2013, 14:31:01 UTC

Juan -

Looking at your AP tasks, the times seems to be inline with mine. With only 1 task working I've would have thought the times would be greatly reduced, but comparing similar times it seems that my working 2 x per GPU is approx. the same as yours.

I run 2 WU at a time too, 1 AP + 1MB, so the times must be very similar, but don´t forget even if i my "smaller GPU" is a GTX670FTW, almost all of my hosts runs with slow I5 CPU´s (only 4 cores avaiable) even the ones with 2xGPU (590/690) and runs at stock speeds (i have no AC and use No OC - Brazil is a hot tropical country even if i live a not so high temperature place 30C or more is normal here) so they are realy slower than a desire. On other hand, they can´t run 2 AP on the same GPU even if the GPU have the capacity, simply ther are no cores free to keep the GPU feeded.

That´s one of the main reason i don´t like to run AP, my hosts are fairly superior when crunch MB (more WU per HR) than when they crunch AP but the creditnew problem mess with all the balance. So for now i try to run all i can an that is 1 AP + 1 MB... but i know, my hosts works better (do a lot more science) when crunch MB only.

Somebody could ask me why i use allmost only slow I5 with big GPUs, the answer is easy, here the hardware is so expensive due the taxes and spend more than U$ 400 on a single slow I7 is out of mind, and we don´t run heavy CPU intensive app here. And when you crunch MB only (as i do in the past) CPU power is very little important compared with the GPU power itself (i not even crunch on the CPU´s normaly), and i don´t spect to continue to crunch AP in the future.

BTW: The explanation why your 2 GPU does not start the task is vary interesting and a you give us the oportunity to know something new (at least diferent as you could see by the thread). I never encounter a situation like yours on my 2xGPu hosts and i have few. So is good to know about that limitation aparently on the Boinc scheduler.

@ricard: i make about the same question of your last paragraph a long time ago and the anwser i receive in the forum was something like: Boinc have no support for that, looks like you need to ask to add that feature in the Boinc wishes list. If you find a way, i´m sure you will going to share that with us. This function is very important when your run more than 1 project and a time.
ID: 1407945 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1407959 - Posted: 26 Aug 2013, 15:36:44 UTC - in response to Message 1407897.  


Ah - if you just meant temporarily to eliminate confusion and possible side effects from a forgotten file while we diagnosed, I'd agree with you.

sure.

But you made such an assertive statement (without explanation) that I thought you meant that people shouldn't use it at all. Personally, I find it one of the most useful recent innovations - though it's incomplete, and still shows the rough edges of hasty coding.

I use it too and found it very useful (when all configured right). :)

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1407959 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1407961 - Posted: 26 Aug 2013, 15:40:09 UTC - in response to Message 1407911.  

EDIT: And though problem for particular host seems to be solved there is probloem in BOINC task management IMO. To block GPU completely when CPU comes into panic mode seems not right decision. Particularly because the reasons to go into panic for CPU too often not justified.

But I don't think that's a fair description. BOINC isn't blocking the GPUs per se: it's blocking an application which needs both CPU and GPU. Because it needs the CPU back for something more urgent, a GPU app that requires high CPU input can't be accommodated.


Well, we all know that ANY device in Pc requires some CPU fraction to handle its interrupts at least, even keyboard and mouse ;)
So, the question is "what is high" ? If GPU app uses ~5% of CPU and provides few times more processing power than full CPU core - is this "high" or not? For example, CUDA app uses about such % with proper drivers.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1407961 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1407966 - Posted: 26 Aug 2013, 15:47:50 UTC - in response to Message 1407961.  

EDIT: And though problem for particular host seems to be solved there is probloem in BOINC task management IMO. To block GPU completely when CPU comes into panic mode seems not right decision. Particularly because the reasons to go into panic for CPU too often not justified.

But I don't think that's a fair description. BOINC isn't blocking the GPUs per se: it's blocking an application which needs both CPU and GPU. Because it needs the CPU back for something more urgent, a GPU app that requires high CPU input can't be accommodated.

Well, we all know that ANY device in Pc requires some CPU fraction to handle its interrupts at least, even keyboard and mouse ;)
So, the question is "what is high" ? If GPU app uses ~5% of CPU and provides few times more processing power than full CPU core - is this "high" or not? For example, CUDA app uses about such % with proper drivers.

In this case, 'high' is what we declare it to be, through app_info or app_config. BOINC doesn't (yet) actually measure CPU usage and dynamically feed it back in to the scheduler: nor does it measure the difference between 'light' CPU usage, spinning to be ready for the next kernel launch, and 'heavy' CPU usage, with heavy load on the FPU to carry out the (variable amount of) radar blanking.

You could try pitching that one to David... ;-)
ID: 1407966 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407985 - Posted: 26 Aug 2013, 16:50:11 UTC - in response to Message 1407894.  



Why is it that people with AMD machines desire to fill them with NVidia GPUs and visa-versa?



The reason I run the NVIDIA cards is because they have been better at SETI@Home using CUDA apps than the AMD cards have been under OpenCL.

With the new applications, I am not sure that is still true.

With the CUDA applications it made little difference if you were running AMD or Intel CPUs. Spending extra money on a CPU was a waste for SETI@Home.

In my *opinion* the useful work done by a CPU per watt is so inefficient compared with the GPU that it did not make sense *for me* to run CPU work (a situation made worse by choosing AMD CPUs).

So, I settled-into the habit of buying cheap CPUs so I could spend the money on GPUs.

The unexpected situation with the new applications is a need for more free cores per instance of AP. I should have bought more cores in my cheap CPUs.

I suspected you were out of CPU resources and that was preventing your GPU work units from starting because I now have that problem with AP. In some of my systems I run out of CPU before I run out of GPU because I bought four-core CPUs. I should have been buying six or eight core CPUs just to support the OpenCL AP GPU applications' "using up" a core with each instance.

The downside to having as many different configurations as I do is that experimentation takes a lot of time. The limited experiments I have done seem to be telling me that if I force more AP work I slow everything down. The verdict is still out, but it looks that way.

Please notice that I'm not even attempting to tell you what the best situation is.
ID: 1407985 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1407988 - Posted: 26 Aug 2013, 16:59:13 UTC - in response to Message 1407966.  



In this case, 'high' is what we declare it to be, through app_info or app_config. BOINC doesn't (yet) actually measure CPU usage and dynamically feed it back in to the scheduler: nor does it measure the difference between 'light' CPU usage, spinning to be ready for the next kernel launch, and 'heavy' CPU usage, with heavy load on the FPU to carry out the (variable amount of) radar blanking.



Exactly.

As a "clueless user" I have to assume that a core of the CPU is going to be busy doing real work. Task Manager shows me that the core is busy even if it is only busy twiddling its thumbs.

The fact is that it *might* become fully occupied and if I over-burden it I can bring a system to its knees.

It "looks" busy all the time, but often has a lot of spare capacity. If I set-up for that spare capacity and it really becomes busy I can make a 30 minute AP work unit take several hours.

That's probably just a situation I bring-on myself by having marginal CPU resources available.
ID: 1407988 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1408049 - Posted: 26 Aug 2013, 19:15:28 UTC - in response to Message 1407966.  


In this case, 'high' is what we declare it to be, through app_info or app_config.

Really? how?
Do you saying that if I set

<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.04</max_ncpus>

for example, BOINC will not suspend GPU tasks in its panic mode? Did you really test this?


SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1408049 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1408076 - Posted: 26 Aug 2013, 19:43:49 UTC - in response to Message 1408049.  
Last modified: 26 Aug 2013, 19:50:46 UTC


In this case, 'high' is what we declare it to be, through app_info or app_config.

Really? how?
Do you saying that if I set

<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.04</max_ncpus>

for example, BOINC will not suspend GPU tasks in its panic mode? Did you really test this?

Yes, I tested it and posted the log in this thread, earlier today. [final block of log extracts in message 1407911]

You only have to modify <avg_ncpus>: <max_cpus> is "not currently used for anything", according to http://boinc.berkeley.edu/trac/wiki/PlanClassFunc. Alternatively, you can use <cpu_usage> in app_config - that maps to the same internal parameter in the client data structure.

Add up all the fractional CPUs that would be defined for all the GPU apps that are running at the same time. If the total is less than one (or possibly equal, but let's play safe), then CPU starvation will not block a GPU task from running, even if all permitted CPUs are occupied by EDF tasks.

Edit - although EDF works well, I prefer not to invoke it, by keeping a modest cache size setting. Some projects like to get their work back quickly.
ID: 1408076 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1408101 - Posted: 26 Aug 2013, 21:19:35 UTC - in response to Message 1408076.  
Last modified: 26 Aug 2013, 21:19:51 UTC

Fine, then this works as it should, no bugs here.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1408101 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Problem with multi GPUs


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.