No GPU work on one machine


log in

Advanced search

Message boards : Number crunching : No GPU work on one machine

1 · 2 · Next
Author Message
Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 36,077,193
RAC: 56,725
Canada
Message 1177084 - Posted: 10 Dec 2011, 15:53:22 UTC

I haven't been able to get any GPU work on one of my machines (5360487), for at least 2-3 weeks now. Even though everything seems to be running smoothly again. My other machines are all getting what they want now, except for this one. It has the max 100 CPU WU's but just can't get any GPU work. Any ideas.

Thanks

Brett
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4072
Credit: 32,911,093
RAC: 7,780
United Kingdom
Message 1177097 - Posted: 10 Dec 2011, 17:10:35 UTC - in response to Message 1177084.
Last modified: 10 Dec 2011, 17:57:08 UTC

I haven't been able to get any GPU work on one of my machines (5360487), for at least 2-3 weeks now. Even though everything seems to be running smoothly again. My other machines are all getting what they want now, except for this one. It has the max 100 CPU WU's but just can't get any GPU work. Any ideas.

Thanks

Brett
First Update that host to Boinc 6.10.60, and no later. (Boinc 6.10.18 is just a bit too Buggy, i think it had a libcurl DNS Bug)

Move that host to a different venue from the others (home, school or work), set up a new set of Project Preferences for that venue, then Disable CPU work fetch, by setting Use CPU to no, then update that host,
Now it can only do ATI GPU work fetches, when it has enough ATI GPU work, enable CPU work fetch again, (Boinc 6.10.18 and earlier also had a Bug where preferences across venues didn't work correctly)

Edit: I've now noticed you have a HD 2600 on that host and are running Hybrid Astropulse, is it really worth it?, do you get any speedup over the r409 CPU app? I certainly didn't get a speedup with my overclocked E8500 & HD5770.

Claggy

Team kizb
Send message
Joined: 8 Mar 01
Posts: 219
Credit: 3,709,162
RAC: 0
Germany
Message 1177112 - Posted: 10 Dec 2011, 18:20:39 UTC - in response to Message 1177097.

I'm having a simular issue. Blue has been crunching WU's just fine, but for some reason Green keeps saying project as no new work.
____________
My Computers:
Blue Offline
Green Offline
Red Offline

Team kizb
Send message
Joined: 8 Mar 01
Posts: 219
Credit: 3,709,162
RAC: 0
Germany
Message 1177115 - Posted: 10 Dec 2011, 18:43:54 UTC - in response to Message 1177112.

Seems as the above talked about work around is working, interesting. Any idea why it doesn't like to download GPU tasks when CPU is enabled?
____________
My Computers:
Blue Offline
Green Offline
Red Offline

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4072
Credit: 32,911,093
RAC: 7,780
United Kingdom
Message 1177119 - Posted: 10 Dec 2011, 18:54:44 UTC - in response to Message 1177115.

Seems as the above talked about work around is working, interesting. Any idea why it doesn't like to download GPU tasks when CPU is enabled?

The scheduler probably thinks the CPU is faster than the GPU, so sends work to that first, why?, don't know, but probably because of the Dodgy DCF,

Claggy

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177129 - Posted: 10 Dec 2011, 19:58:35 UTC - in response to Message 1177119.

Seems as the above talked about work around is working, interesting. Any idea why it doesn't like to download GPU tasks when CPU is enabled?

The scheduler probably thinks the CPU is faster than the GPU, so sends work to that first, why?, don't know, but probably because of the Dodgy DCF,

Claggy


I find at times that the "to completion" times on CPU WU's is shorter than the times for the same type of WU on GPU.


____________
Kevin


Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 36,077,193
RAC: 56,725
Canada
Message 1177170 - Posted: 11 Dec 2011, 0:44:16 UTC

Before I starting running the GPU app, my RAC was around 1,000. With the GPU running it was upto 1,700, so 70% improvement is worth it. I'll try the fixes and see what happens.
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4230
Credit: 1,043,255
RAC: 311
United States
Message 1177180 - Posted: 11 Dec 2011, 2:50:52 UTC - in response to Message 1177129.

Kevin Olley wrote:
I find at times that the "to completion" times on CPU WU's is shorter than the times for the same type of WU on GPU.

It may be the VLARs you rescheduled onto GPU which have made the servers think your CPU can do 63 GFLOPS on MB tasks. The servers don't know those ~3800 second runtimes were done on GPU so they affect the CPU average.
Joe

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177226 - Posted: 11 Dec 2011, 10:54:58 UTC - in response to Message 1177180.

Kevin Olley wrote:
I find at times that the "to completion" times on CPU WU's is shorter than the times for the same type of WU on GPU.

It may be the VLARs you rescheduled onto GPU which have made the servers think your CPU can do 63 GFLOPS on MB tasks. The servers don't know those ~3800 second runtimes were done on GPU so they affect the CPU average.
Joe


I doubt it, I was seeing this a long time before I had any reason to reschedule any WU's from CPU to GPU. The only reason I reschedule is lack of GPU WU's.

I have been having DCF problems, or problems with "to completion" times affecting cache size.

Too many shorties on GPU will reduce DCF to a level that if it was not for the CPU limits in place I would have been flooded with CPU WU's.

Too many shorties or regular WU on CPU when GPU was processing regular WU would increase DCF and "to completion" times to a point that it would affect my GPU cache or Boinc to the point that the GPU's when they had a to limits cache would be forced into high priority mode until DCF reduced to a more normal level.

This can be controlled to a certain extent with AP's, with careful use of suspend button you can force an AP to start you then unsuspend suspended units and the AP is next in line. If no AP's are available then I have a couple of CPND WU's that I use.

I use BoincTasks on this machine and one of the things that it shows is the total estimated completion time for the cache. the way the estimate jumps on the completion of a VLAR on CPU on the GPU cache estimated runtime is horrifying. (18 days to 250+ days not unusual(divide by 9))




____________
Kevin


ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4072
Credit: 32,911,093
RAC: 7,780
United Kingdom
Message 1177227 - Posted: 11 Dec 2011, 11:12:55 UTC - in response to Message 1177129.

Seems as the above talked about work around is working, interesting. Any idea why it doesn't like to download GPU tasks when CPU is enabled?

The scheduler probably thinks the CPU is faster than the GPU, so sends work to that first, why?, don't know, but probably because of the Dodgy DCF,

Claggy


I find at times that the "to completion" times on CPU WU's is shorter than the times for the same type of WU on GPU.


Do you have <flops> values in your app_info?

Claggy

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177230 - Posted: 11 Dec 2011, 12:06:39 UTC - in response to Message 1177227.

Seems as the above talked about work around is working, interesting. Any idea why it doesn't like to download GPU tasks when CPU is enabled?

The scheduler probably thinks the CPU is faster than the GPU, so sends work to that first, why?, don't know, but probably because of the Dodgy DCF,

Claggy


I find at times that the "to completion" times on CPU WU's is shorter than the times for the same type of WU on GPU.


Do you have <flops> values in your app_info?

Claggy



No the only alteration I have made is to the count, To be honest calculating flops and more advanced modifications to app_info or similar type files is above my comfort level.

I keep 3 copys of app_info.txt (count 1, 0.5, 0.33) and copy and paste in the right format when I need them. I know its cheating but it works for me.

My "programing" abilities are very low, so I have a bad habit of finding other ways that seem to work for me.




____________
Kevin


Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3624
Credit: 48,552,009
RAC: 27,058
United States
Message 1177346 - Posted: 11 Dec 2011, 22:42:51 UTC - in response to Message 1177230.


Do you have <flops> values in your app_info?

Claggy



No the only alteration I have made is to the count, To be honest calculating flops and more advanced modifications to app_info or similar type files is above my comfort level.

I keep 3 copys of app_info.txt (count 1, 0.5, 0.33) and copy and paste in the right format when I need them. I know its cheating but it works for me.

My "programing" abilities are very low, so I have a bad habit of finding other ways that seem to work for me.


Flops are fairly easy to calculate as the needed numbers are in your applications details for the machine in question.

For example on the SETI@home Enhanced (anonymous platform, nvidia GPU)
It shows 156.60088862089

All you have to do is add a e09 to it and you get
<flops>156.60088862089e09</flops>

Your CPU is
<flops>62.12696862414e09</flops>

and AP is
<flops>32.098248233676e09</flops>
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4072
Credit: 32,911,093
RAC: 7,780
United Kingdom
Message 1177354 - Posted: 11 Dec 2011, 23:16:21 UTC - in response to Message 1177346.
Last modified: 11 Dec 2011, 23:17:08 UTC


Do you have <flops> values in your app_info?

Claggy



No the only alteration I have made is to the count, To be honest calculating flops and more advanced modifications to app_info or similar type files is above my comfort level.

I keep 3 copys of app_info.txt (count 1, 0.5, 0.33) and copy and paste in the right format when I need them. I know its cheating but it works for me.

My "programing" abilities are very low, so I have a bad habit of finding other ways that seem to work for me.


Flops are fairly easy to calculate as the needed numbers are in your applications details for the machine in question.

For example on the SETI@home Enhanced (anonymous platform, nvidia GPU)
It shows 156.60088862089

All you have to do is add a e09 to it and you get
<flops>156.60088862089e09</flops>

Your CPU is
<flops>62.12696862414e09</flops>

and AP is
<flops>32.098248233676e09</flops>


I wouldn't use the CPU MB flops value as CPU work has been rebranded to the GPU, and the APR of the CPU has been inflated, I'd use a figure of around 15e09 instead of 62.12e09

Claggy

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177366 - Posted: 12 Dec 2011, 0:17:16 UTC - in response to Message 1177354.


Flops are fairly easy to calculate as the needed numbers are in your applications details for the machine in question.

For example on the SETI@home Enhanced (anonymous platform, nvidia GPU)
It shows 156.60088862089

All you have to do is add a e09 to it and you get
<flops>156.60088862089e09</flops>

Your CPU is
<flops>62.12696862414e09</flops>

and AP is
<flops>32.098248233676e09</flops>


I wouldn't use the CPU MB flops value as CPU work has been rebranded to the GPU, and the APR of the CPU has been inflated, I'd use a figure of around 15e09 instead of 62.12e09

Claggy


OK, I can see where the figures are coming from and the potential problems that could be caused by my prior actions, but I am sorry I am still lost as to where in the app_info file I have to put this information.

I apologize for being a pain, but my computing skills (software, programing etc) are extremely limited.


____________
Kevin


Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3624
Credit: 48,552,009
RAC: 27,058
United States
Message 1177374 - Posted: 12 Dec 2011, 1:00:09 UTC

I stick mine right under the version number.

<app_info> <app> <name>setiathome_enhanced</name> </app> <file_info> <name>AK_v8b2_win_SSE3_AMD.exe</name> <executable/> </file_info> <app_version> <app_name>setiathome_enhanced</app_name> <version_num>603</version_num> <flops>15.812442866308e09</flops> <platform>windows_intelx86</platform> <file_ref> <file_name>AK_v8b2_win_SSE3_AMD.exe</file_name> <main_program/> </file_ref> </app_version> <app_version> <app_name>setiathome_enhanced</app_name> <version_num>603</version_num> <flops>15.812442866308e09</flops> <platform>windows_x86_64</platform> <file_ref> <file_name>AK_v8b2_win_SSE3_AMD.exe</file_name> <main_program/> </file_ref> </app_version> <app> <name>setiathome_enhanced</name> </app> <file_info> <name>Lunatics_x41g_win32_cuda41rc1.exe</name> <executable/> </file_info> <file_info> <name>cudart32_41_15.dll</name> <executable/> </file_info> <file_info> <name>cufft32_41_15.dll</name> <executable/> </file_info> <app_version> <app_name>setiathome_enhanced</app_name> <version_num>610</version_num> <flops>203.99134720153e09</flops> <platform>windows_intelx86</platform> <plan_class>cuda_fermi</plan_class> <avg_ncpus>0.040000</avg_ncpus> <max_ncpus>0.040000</max_ncpus> <coproc>


You will need to add those numbers to all 6 occurrences of the GPU part.
____________

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177375 - Posted: 12 Dec 2011, 1:09:55 UTC - in response to Message 1177374.

I stick mine right under the version number.

You will need to add those numbers to all 6 occurrences of the GPU part.


Thank you

I shall stop rescheduling and let the numbers stabilise and then give it a try.



____________
Kevin


Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4230
Credit: 1,043,255
RAC: 311
United States
Message 1177396 - Posted: 12 Dec 2011, 5:05:58 UTC - in response to Message 1177375.

I stick mine right under the version number.

You will need to add those numbers to all 6 occurrences of the GPU part.


Thank you

I shall stop rescheduling and let the numbers stabilise and then give it a try.

Just before you add the <flops> entries, I suggest you use the rescheduler to protect against -177 errors for maximum elapsed time exceeded on cached tasks. The expert tab has a setting called something like "limit rsc_fpops_bound" which provides that protection when you do a reschedule, even if the other settings mean that no tasks are actually rescheduled. That's as much as I know, I hope someone with practical experience will provide a clearer explanation.

Those cached tasks will have scaled down rsc_fpops_est which is used to calculate estimated time, and rsc_fpops_bound which is used to calculate the time limit. The big increase in flops will make the times much shorter, but DCF will soon increase to fix the estimates. Unfortunately DCF isn't used for the limit.
Joe

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,235,790
RAC: 1,606
United Kingdom
Message 1177418 - Posted: 12 Dec 2011, 8:25:49 UTC - in response to Message 1177396.

I stick mine right under the version number.

You will need to add those numbers to all 6 occurrences of the GPU part.


Thank you

I shall stop rescheduling and let the numbers stabilise and then give it a try.

Just before you add the <flops> entries, I suggest you use the rescheduler to protect against -177 errors for maximum elapsed time exceeded on cached tasks. The expert tab has a setting called something like "limit rsc_fpops_bound" which provides that protection when you do a reschedule, even if the other settings mean that no tasks are actually rescheduled. That's as much as I know, I hope someone with practical experience will provide a clearer explanation.

Those cached tasks will have scaled down rsc_fpops_est which is used to calculate estimated time, and rsc_fpops_bound which is used to calculate the time limit. The big increase in flops will make the times much shorter, but DCF will soon increase to fix the estimates. Unfortunately DCF isn't used for the limit.
Joe


It is one of the main reasons that I use reschedular for, the other is to get an estimate of how many VHAR's to regulars are residing in my cache.


____________
Kevin


Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 36,077,193
RAC: 56,725
Canada
Message 1178212 - Posted: 15 Dec 2011, 5:30:56 UTC

Well, I finally got some GPU work for this machine. In fact 6 AP WU's no less!
The only change I made (2 days ago) was to downgrade to 6.10.60. :)

____________

AndyJ
Avatar
Send message
Joined: 17 Aug 02
Posts: 248
Credit: 27,380,797
RAC: 0
United Kingdom
Message 1178272 - Posted: 15 Dec 2011, 11:56:34 UTC - in response to Message 1177396.

I stick mine right under the version number.

You will need to add those numbers to all 6 occurrences of the GPU part.


Thank you

I shall stop rescheduling and let the numbers stabilise and then give it a try.

Just before you add the <flops> entries, I suggest you use the rescheduler to protect against -177 errors for maximum elapsed time exceeded on cached tasks. The expert tab has a setting called something like "limit rsc_fpops_bound" which provides that protection when you do a reschedule, even if the other settings mean that no tasks are actually rescheduled. That's as much as I know, I hope someone with practical experience will provide a clearer explanation.

Those cached tasks will have scaled down rsc_fpops_est which is used to calculate estimated time, and rsc_fpops_bound which is used to calculate the time limit. The big increase in flops will make the times much shorter, but DCF will soon increase to fix the estimates. Unfortunately DCF isn't used for the limit.
Joe


Good info Joe, I just put <flops> onto my main cruncher, got -17's. Will give this a try. Or is there a way to edit fpops in the app_info?

Regards,

A
____________

1 · 2 · Next

Message boards : Number crunching : No GPU work on one machine

Copyright © 2014 University of California