Task takes too lomg error

Message boards : Number crunching : Task takes too lomg error
Message board moderation

To post messages, you must log in.

AuthorMessage
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1231924 - Posted: 14 May 2012, 7:24:45 UTC

My main cruncher is fitted with two different graphics cards a GTX460 and a GT430. There is a significant speed difference (about a factor of 7) between these two graphics cards. I am getting a few computation errors due to tasks being processed by the GT430 exceeding the time allowed. I assume the calculation of the time allowed is dominated by the performance of the GTX460. Most of the time GT430 tasks complete without difficulty, the problem seems to be limitted to those with a particularly difficuly angle range which take longer to crunch and is resulting in around 2 errors a day.

Is this a case where flops might be useful? I have never experimented with the flops command so any advice would be helpful. Is it directly proportional to processing power? i.e. if I work out my current flops value and halve it will BOINC think my graphics cards are now half as powerful and double the allotted time for each WU? This would be enough to stop the computation errors.
ID: 1231924 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1231928 - Posted: 14 May 2012, 7:40:05 UTC

I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching.

Does GPUZ show that the cards are running around 100% all the time?

ID: 1231928 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1231933 - Posted: 14 May 2012, 9:04:33 UTC - in response to Message 1231928.  
Last modified: 14 May 2012, 9:05:43 UTC

I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching.

Does GPUZ show that the cards are running around 100% all the time?


Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long.
ID: 1231933 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1231941 - Posted: 14 May 2012, 10:31:08 UTC

I had a go at running two at a time on my GT430 and it was a big backwards step
Tasks took nearly three times as long to complete,
ID: 1231941 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1232012 - Posted: 14 May 2012, 14:40:15 UTC - in response to Message 1231924.  

...
Is this a case where flops might be useful? I have never experimented with the flops command so any advice would be helpful. Is it directly proportional to processing power? i.e. if I work out my current flops value and halve it will BOINC think my graphics cards are now half as powerful and double the allotted time for each WU? This would be enough to stop the computation errors.

No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started.

Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure.
                                                                   Joe


ID: 1232012 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1232058 - Posted: 14 May 2012, 16:35:29 UTC - in response to Message 1231933.  

I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching.

Does GPUZ show that the cards are running around 100% all the time?


Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long.


The GT430 is a really cheap GPU, low power, low cost... low performance.
Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1.
I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7)


ID: 1232058 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1232075 - Posted: 14 May 2012, 17:06:56 UTC - in response to Message 1231941.  

I had a go at running two at a time on my GT430 and it was a big backwards step
Tasks took nearly three times as long to complete,


Unfortunately BOINC does not let you set different numbers of simultaneous tasks for different graphics cards. I want to run 2 at a time on the GTX460 to get the most out of that card which dominates the system so I have to run 2 at a time on the GT430.
ID: 1232075 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1232076 - Posted: 14 May 2012, 17:08:02 UTC - in response to Message 1232058.  

I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching.

Does GPUZ show that the cards are running around 100% all the time?


Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long.


The GT430 is a really cheap GPU, low power, low cost... low performance.
Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1.
I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7)



Nice idea but the PC with the 9500 in it only has one PCIe slot:(
ID: 1232076 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1232098 - Posted: 14 May 2012, 18:15:12 UTC - in response to Message 1232076.  

I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching.

Does GPUZ show that the cards are running around 100% all the time?


Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long.


The GT430 is a really cheap GPU, low power, low cost... low performance.
Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1.
I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7)



Nice idea but the PC with the 9500 in it only has one PCIe slot:(


Put the 460 in that machine and the 9500 in this one.


ID: 1232098 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1232099 - Posted: 14 May 2012, 18:17:33 UTC - in response to Message 1232098.  


Nice idea but the PC with the 9500 in it only has one PCIe slot:(


Put the 460 in that machine and the 9500 in this one.

Wish I could but the dual GPU PC is my office machine whilst the single GPU PC is at home. Cannot take GPUs from work:(
ID: 1232099 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1232102 - Posted: 14 May 2012, 18:24:32 UTC - in response to Message 1232076.  
Last modified: 14 May 2012, 18:25:49 UTC

I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7)

Nice idea but the PC with the 9500 in it only has one PCIe slot:(

Another option will be to swap GPUs between your hosts, the 460 in the 1 PCI host and the others on another... (It might not the best option, as I guess you want the best GPU on the faster host for other uses besides crunching... but...)

EDIT: Im late ... LOL
ID: 1232102 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1232106 - Posted: 14 May 2012, 18:39:13 UTC
Last modified: 14 May 2012, 19:21:24 UTC

I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening more and more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0.

I think you can do similar with Fred’s rescheduler, but have never used it.

The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one for me now is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility.
ID: 1232106 · Report as offensive
Profile Area 51
Avatar

Send message
Joined: 31 Jan 04
Posts: 965
Credit: 42,193,520
RAC: 0
United Kingdom
Message 1232112 - Posted: 14 May 2012, 19:00:19 UTC - in response to Message 1232106.  

I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening far more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0.

I think you can do similar with Fred’s rescheduler, but have never used it.

The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility.


This sort of thing is a pet hate of mine. BOINC does not work at the correct granularity. Each processing resource should have its own set of values. I think the current situation is as a result of the knee jerk reaction when CUDA processing first came about (ie, make it happen). In the meantime, nobody has gone back and re-coded to do resource management properly. Frustrating, but I don't think it will change very soon - if at all.

I guess the real problem is the legacy apps some people may be running, such that if the structure of the XML files was changed significantly - the old apps still in use would break. This all seems to stem from a fundamental design decision years ago (pre GPU), in the sense that BOINC can't handle the current possibility that newer machines may have multiple processing resources, each with their own characteristics.

IMHO, BOINC needs a complete re-write - but that would be a huge task - and that's even less likely to happen!
ID: 1232112 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1232113 - Posted: 14 May 2012, 19:00:29 UTC - in response to Message 1232012.  
Last modified: 14 May 2012, 19:18:26 UTC

No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started.

Thank you for this Joe. I keep getting told I need to set the FLOPS and when I do so it works for a while and then the ERR_RSC_LIMIT_EXCEEDED errors start again. I suspected this was happening, but was unsure.
ID: 1232113 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1232121 - Posted: 14 May 2012, 19:15:19 UTC - in response to Message 1232112.  
Last modified: 14 May 2012, 19:28:29 UTC

This sort of thing is a pet hate of mine.

+1 (one who if very tempted to at least fix BOINC to allow for different GPU speeds).
ID: 1232121 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1232124 - Posted: 14 May 2012, 19:21:12 UTC - in response to Message 1232106.  

I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening more and more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0.

I think you can do similar with Fred’s rescheduler, but have never used it.

The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility.


I agree that the problem seems to have got worse recently (since the limits were raised). I bought the GT430 in Feburary and had no problems until about a week ago. Then it has been about 2WU a day that fail to complete.

I have now removed the GT430 card until someone can come up with a solution or I better understand what is going on. I thought I understood flops but apparently not and these there parameters are beyond me.
ID: 1232124 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1232127 - Posted: 14 May 2012, 19:22:47 UTC - in response to Message 1232121.  

This sort of thing is a pet hate of mine.


+1 (one who if very tempted to at least fix BOINC to allow for differen GPU speeds).


+2
ID: 1232127 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1232432 - Posted: 15 May 2012, 11:03:49 UTC - in response to Message 1232012.  

No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started.

Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure.
                                                                   Joe




I thought flops override the APR mechanism?
what's the use of adding flops then at all, if the server will compensate?
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1232432 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1232444 - Posted: 15 May 2012, 11:58:45 UTC - in response to Message 1232432.  

No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started.

Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure.
                                                                   Joe


I thought flops override the APR mechanism?
what's the use of adding flops then at all, if the server will compensate?

Good question. I think I started the whole cottage industry of adding <flops> entries with this beta post - but that was back in March 2009, just a couple of months after CUDA (no ATI at that stage) crunching got properly under way. At that point, just after the launch of the first BOINC version which attempted to support CUDA properly, there was no way to stabilise DCF except by manually inserting flops estimates.

About 15 months after that initial suggestion, CreditNew arrived here, and with it the automatic APR mechanism - which replaces both DCF and <flops>. My advice then was, and remains, not to use any manual <flops> entries: though if anybody has been using them as a temporary measure while APR has been hors de combat, they should probably think carefully before removing them.
ID: 1232444 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1232528 - Posted: 15 May 2012, 15:35:30 UTC - in response to Message 1232432.  

No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started.

Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure.
                                                                   Joe


I thought flops override the APR mechanism?
what's the use of adding flops then at all, if the server will compensate?

No, for anonymous platform the servers scale rsc_fpops_est and rsc_fpops_bound by the ratio of flops/(APR*1e09). Adding <flops> in app_info.xml keeps the client from sending a hugely wrong guesstimate of what they should be. Particularly for new applications it provides a way to keep DCF stable and estimates of runtime reasonable.

For stock, once APR is established, APR*1e09 is sent as the flops for each app_version. Adding <flops> in app_info.xml is the available method to simulate that on anonymous platform, otherwise the client is sending flops based on the Whetstone CPU benchmark.
                                                                  Joe
ID: 1232528 · Report as offensive

Message boards : Number crunching : Task takes too lomg error


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.