Task takes too lomg error

Author	Message
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1231924 - Posted: 14 May 2012, 7:24:45 UTC My main cruncher is fitted with two different graphics cards a GTX460 and a GT430. There is a significant speed difference (about a factor of 7) between these two graphics cards. I am getting a few computation errors due to tasks being processed by the GT430 exceeding the time allowed. I assume the calculation of the time allowed is dominated by the performance of the GTX460. Most of the time GT430 tasks complete without difficulty, the problem seems to be limitted to those with a particularly difficuly angle range which take longer to crunch and is resulting in around 2 errors a day. Is this a case where flops might be useful? I have never experimented with the flops command so any advice would be helpful. Is it directly proportional to processing power? i.e. if I work out my current flops value and halve it will BOINC think my graphics cards are now half as powerful and double the allotted time for each WU? This would be enough to stop the computation errors. ID: 1231924 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1231928 - Posted: 14 May 2012, 7:40:05 UTC I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching. Does GPUZ show that the cards are running around 100% all the time? ID: 1231928 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1231933 - Posted: 14 May 2012, 9:04:33 UTC - in response to Message 1231928. Last modified: 14 May 2012, 9:05:43 UTC I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching. Does GPUZ show that the cards are running around 100% all the time? Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long. ID: 1231933 ·

.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69	Message 1231941 - Posted: 14 May 2012, 10:31:08 UTC I had a go at running two at a time on my GT430 and it was a big backwards step Tasks took nearly three times as long to complete, ID: 1231941 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1232012 - Posted: 14 May 2012, 14:40:15 UTC - in response to Message 1231924. ... Is this a case where flops might be useful? I have never experimented with the flops command so any advice would be helpful. Is it directly proportional to processing power? i.e. if I work out my current flops value and halve it will BOINC think my graphics cards are now half as powerful and double the allotted time for each WU? This would be enough to stop the computation errors. No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started. Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure. Joe ID: 1232012 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1232058 - Posted: 14 May 2012, 16:35:29 UTC - in response to Message 1231933. I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching. Does GPUZ show that the cards are running around 100% all the time? Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long. The GT430 is a really cheap GPU, low power, low cost... low performance. Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1. I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7) ID: 1232058 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1232075 - Posted: 14 May 2012, 17:06:56 UTC - in response to Message 1231941. I had a go at running two at a time on my GT430 and it was a big backwards step Tasks took nearly three times as long to complete, Unfortunately BOINC does not let you set different numbers of simultaneous tasks for different graphics cards. I want to run 2 at a time on the GTX460 to get the most out of that card which dominates the system so I have to run 2 at a time on the GT430. ID: 1232075 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1232076 - Posted: 14 May 2012, 17:08:02 UTC - in response to Message 1232058. I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching. Does GPUZ show that the cards are running around 100% all the time? Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long. The GT430 is a really cheap GPU, low power, low cost... low performance. Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1. I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7) Nice idea but the PC with the 9500 in it only has one PCIe slot:( ID: 1232076 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1232098 - Posted: 14 May 2012, 18:15:12 UTC - in response to Message 1232076. I am wondering if those tasks are falling back to the CPU for some reason as that is a very long time for a Fermi class GPU to try crunching. Does GPUZ show that the cards are running around 100% all the time? Both cards are running 100% of the time and are >99% used. I run 2WU at a time on both cards as the system is optimized for the GTX460. This may explain why the GT430 tasks take so long. The GT430 is a really cheap GPU, low power, low cost... low performance. Anyway I have 2 of them both running 2 Wus each and it works for me because the crunching times of running 2 are really less than twice the time of running just 1. I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7) Nice idea but the PC with the 9500 in it only has one PCIe slot:( Put the 460 in that machine and the 9500 in this one. ID: 1232098 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1232099 - Posted: 14 May 2012, 18:17:33 UTC - in response to Message 1232098. Nice idea but the PC with the 9500 in it only has one PCIe slot:( Put the 460 in that machine and the 9500 in this one. Wish I could but the dual GPU PC is my office machine whilst the single GPU PC is at home. Cannot take GPUs from work:( ID: 1232099 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1232102 - Posted: 14 May 2012, 18:24:32 UTC - in response to Message 1232076. Last modified: 14 May 2012, 18:25:49 UTC I guess that your best option (without babysitting) is to not mix the 430 with the 460, Ive seen that you have a 9500 in another system, pairing the 430 with that could work better as the crunching times will not be so different (at least not by a factor of 7) Nice idea but the PC with the 9500 in it only has one PCIe slot:( Another option will be to swap GPUs between your hosts, the 460 in the 1 PCI host and the others on another... (It might not the best option, as I guess you want the best GPU on the faster host for other uses besides crunching... but...) EDIT: Im late ... LOL ID: 1232102 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1232106 - Posted: 14 May 2012, 18:39:13 UTC Last modified: 14 May 2012, 19:21:24 UTC I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening more and more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0. I think you can do similar with Fredâ€™s rescheduler, but have never used it. The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one for me now is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility. ID: 1232106 ·

Area 51 Send message Joined: 31 Jan 04 Posts: 965 Credit: 42,193,520 RAC: 0	Message 1232112 - Posted: 14 May 2012, 19:00:19 UTC - in response to Message 1232106. I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening far more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0. I think you can do similar with Fredâ€™s rescheduler, but have never used it. The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility. This sort of thing is a pet hate of mine. BOINC does not work at the correct granularity. Each processing resource should have its own set of values. I think the current situation is as a result of the knee jerk reaction when CUDA processing first came about (ie, make it happen). In the meantime, nobody has gone back and re-coded to do resource management properly. Frustrating, but I don't think it will change very soon - if at all. I guess the real problem is the legacy apps some people may be running, such that if the structure of the XML files was changed significantly - the old apps still in use would break. This all seems to stem from a fundamental design decision years ago (pre GPU), in the sense that BOINC can't handle the current possibility that newer machines may have multiple processing resources, each with their own characteristics. IMHO, BOINC needs a complete re-write - but that would be a huge task - and that's even less likely to happen! ID: 1232112 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1232113 - Posted: 14 May 2012, 19:00:29 UTC - in response to Message 1232012. Last modified: 14 May 2012, 19:18:26 UTC No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started. Thank you for this Joe. I keep getting told I need to set the FLOPS and when I do so it works for a while and then the ERR_RSC_LIMIT_EXCEEDED errors start again. I suspected this was happening, but was unsure. ID: 1232113 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1232121 - Posted: 14 May 2012, 19:15:19 UTC - in response to Message 1232112. Last modified: 14 May 2012, 19:28:29 UTC This sort of thing is a pet hate of mine. +1 (one who if very tempted to at least fix BOINC to allow for different GPU speeds). ID: 1232121 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1232124 - Posted: 14 May 2012, 19:21:12 UTC - in response to Message 1232106. I ran mixed speed GPUs on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 without getting this issue for 3 months but over the past few weeks it is happening more and more. I have just written a small C program that updates all the <rsc_fpops_bound> values in the client_state.xml to be 25 rather than 10 times the <rsc_fpops_est> values which should resolve this, I just need to stop BOINC and run it every few days. It also resets the DCF to 1.0. I think you can do similar with Fredâ€™s rescheduler, but have never used it. The real problem is that BOINC thinks all the GPUs in a system must be the same speed and unless this is fixed running mixed speed GPUs will have issues. The main one is that it's impossible to get a stable DCF. I just hope set@home starts using the <dont_use_dcf/> facility. I agree that the problem seems to have got worse recently (since the limits were raised). I bought the GT430 in Feburary and had no problems until about a week ago. Then it has been about 2WU a day that fail to complete. I have now removed the GT430 card until someone can come up with a solution or I better understand what is going on. I thought I understood flops but apparently not and these there parameters are beyond me. ID: 1232124 ·

MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1232127 - Posted: 14 May 2012, 19:22:47 UTC - in response to Message 1232121. This sort of thing is a pet hate of mine. +1 (one who if very tempted to at least fix BOINC to allow for differen GPU speeds). +2 ID: 1232127 ·

LadyL Volunteer tester Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0	Message 1232432 - Posted: 15 May 2012, 11:03:49 UTC - in response to Message 1232012. No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started. Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure. Joe I thought flops override the APR mechanism? what's the use of adding flops then at all, if the server will compensate? I'm not the Pope. I don't speak Ex Cathedra! ID: 1232432 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1232444 - Posted: 15 May 2012, 11:58:45 UTC - in response to Message 1232432. No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started. Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure. Joe I thought flops override the APR mechanism? what's the use of adding flops then at all, if the server will compensate? Good question. I think I started the whole cottage industry of adding <flops> entries with this beta post - but that was back in March 2009, just a couple of months after CUDA (no ATI at that stage) crunching got properly under way. At that point, just after the launch of the first BOINC version which attempted to support CUDA properly, there was no way to stabilise DCF except by manually inserting flops estimates. About 15 months after that initial suggestion, CreditNew arrived here, and with it the automatic APR mechanism - which replaces both DCF and <flops>. My advice then was, and remains, not to use any manual <flops> entries: though if anybody has been using them as a temporary measure while APR has been hors de combat, they should probably think carefully before removing them. ID: 1232444 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1232528 - Posted: 15 May 2012, 15:35:30 UTC - in response to Message 1232432. No, flops would not help because they can't be set differently for the different cards. If you halved the flops, it would only help for work already in cache. The servers would compensate and halve the rsc_fpops_est values for new work so you'd be right back where you started. Boosting the rsc_fpops_bound values using the rescheduler is probably the simplest cure. Joe I thought flops override the APR mechanism? what's the use of adding flops then at all, if the server will compensate? No, for anonymous platform the servers scale rsc_fpops_est and rsc_fpops_bound by the ratio of flops/(APR1e09). Adding <flops> in app_info.xml keeps the client from sending a hugely wrong guesstimate of what they should be. Particularly for new applications it provides a way to keep DCF stable and estimates of runtime reasonable. For stock, once APR is established, APR1e09 is sent as the flops for each app_version. Adding <flops> in app_info.xml is the available method to simulate that on anonymous platform, otherwise the client is sending flops based on the Whetstone CPU benchmark. Joe ID: 1232528 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.