Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1888039 - Posted: 6 Sep 2017, 2:32:46 UTC - in response to Message 1888029. Last modified: 6 Sep 2017, 3:32:29 UTC Seems to me you probably hit the APR threshold that TBar warned you about a week or so ago, rescheduling too many tasks that were originally assigned to the CPU to run on the GPU. If you look at the estimated run time of a similar task in BOINC Manager right after you reschedule it, it might be less than minute. ID: 1888039 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888070 - Posted: 6 Sep 2017, 6:11:55 UTC - in response to Message 1888039. Seems to me you probably hit the APR threshold that TBar warned you about a week or so ago, rescheduling too many tasks that were originally assigned to the CPU to run on the GPU. If you look at the estimated run time of a similar task in BOINC Manager right after you reschedule it, it might be less than minute. . . Hi Rob, . . Thanks for the input but I cannot see that being the reason. The shortest runtime estimates for any task on my rigs is over 1 min and even then it would have had to exceed 20 mins for that threshold to become an issue. I am pretty sure there is something else that caused the problem. It would be interesting to find out that you are correct because it would totally invalidate my understanding of that relationship. I hope someone can offer a way of determining a definitive cause. Stephen ?? ID: 1888070 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1888079 - Posted: 6 Sep 2017, 7:24:20 UTC You are experiencing exactly what Tbar warned you about the other day - they ran for 8 minutes then exceeded the the maximum time allowed. (I thought it was a factor of 10, which would explain you hitting the buffers before the expected factor of twenty.) The best thing you can do to recover the situation is to stop rescheduling for a few days to let the APRs get back to normal. btw. - your comments should have been addressed to "Jeff" not "Rob", as it was Jeff's post you were referring to, not one of mine. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1888079 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888081 - Posted: 6 Sep 2017, 8:04:10 UTC - in response to Message 1888079. Last modified: 6 Sep 2017, 8:07:15 UTC You are experiencing exactly what Tbar warned you about the other day - they ran for 8 minutes then exceeded the the maximum time allowed. (I thought it was a factor of 10, which would explain you hitting the buffers before the expected factor of twenty.) The best thing you can do to recover the situation is to stop rescheduling for a few days to let the APRs get back to normal. btw. - your comments should have been addressed to "Jeff" not "Rob", as it was Jeff's post you were referring to, not one of mine. . . Firstly my apologies to both Jeff and yourself, I don't know why I made that error. . . Maybe I was being prescient about your eventual reply <joke> . . And secondly, if that is so then I definitely do NOT understand how this time limit relationship works. Since expected time to exit is supposedly the criterion, to time out at 8 mins the expected run time would have to be less than 45 or 50 seconds. None of my tasks show an expected run time anywhere near that low. Even Arecibo halflings have an estimate of well over 1 min and those tasks while in the cpu Q were showing an estimate of over 7 mins. Since I only re-schedule for the weekly outage I don't think there is much I can do to reduce the chance of this recurring. Those VLAR tasks were in that CPU Q since last week when I had an issue getting new work and opened up the CPU to download because I was getting NO GPU tasks. There were originally almost 100 of them and I was letting them run down over many days, taking about 60 mins each, to increase the average runtime in the stats. So if their eta was 7 mins and they timed out at 8 mins then I would say the ratio is not 20:1 or even 10:1 but more like 1:1. Oh well .... . . To confound my understanding even more surely with an eta of 7 mins they would be more likely to have timed out in the CPU Q where they were taking 60 mins than in the GPU Q where they timed out at 8 mins. So confusing. Stephen :( ID: 1888081 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1888085 - Posted: 6 Sep 2017, 9:10:37 UTC Your basic problem is that by having done a large amount of rescheduling the servers are getting very confused as is your local BOINC manager. Estimated run times are now way too short, and BOINC is seeing this, so is abandoning these tasks due to the run-caclulated expected time being substantially greater than the initial estimate. You now have to reset the APR, so the estimated times are in line with the actual experienced run times. This can possibly done by editing one of the many configuration files, but I can't say which one, and refuse to guess which one. The safest way to achieve this is to stop rescheduling for a good few days and gradually you will find that the APRs return to sane values. A detach and reattach (restarting BOINC between each operation), or project reset may work, but both would probably trash a pile of tasks - but many are are going to be trashed anyway so they are worth a try. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1888085 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1888090 - Posted: 6 Sep 2017, 10:20:29 UTC - in response to Message 1888081. So your estimated CPU runtimes are around 7 minutes? Well, that's IF you were to run them on the CPU, which you didn't. When you assign them to the GPU BOINC automatically reduces the runtime again, that is why they exceeded the time estimates on the GPU. As far as I know you can't reset the APR, that value is saved somewhere on the SETI server and used to determine tasks assignments and even credit rewards. You just have to run the tasks on your CPU so it pulls down the APR. It would help if you were to enable hyper-threading and ran more than one fast CPU task. Running more slower instances on your CPUs would speed up the process dramatically. You should be able to run 3 or 4 CPU tasks without it slowing down your GPUs. It would also help if you looked into that post where I explained how to download and run GPU tasks on your GPUs instead of CPU tasks. Destroying your CPU APR is common with large amounts of rescheduling. ID: 1888090 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1888092 - Posted: 6 Sep 2017, 10:43:38 UTC ...and since APR is used in the calculation of credits it can have an adverse impact on them as well. Tbar: Thanks for confirming what I thought about where APR is stored, I was being a tad optimistic that there would be a local copy that could be used to "resume normality" Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1888092 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1888095 - Posted: 6 Sep 2017, 11:06:56 UTC - in response to Message 1888092. You can't change the APR but you can change the local rsc_fpops value to change the runtime. I think moving the decimal point to the right will lengthen runtimes. The values are usually different for each <workunit> which makes finding & replacing ineffective. So, it's best to just not bother with it, and try to avoid needing to. It becomes annoying very quickly. ID: 1888095 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1888096 - Posted: 6 Sep 2017, 11:12:36 UTC - in response to Message 1888095. Last modified: 6 Sep 2017, 11:17:34 UTC You can do a search and replace for <rsc_fpops_bound> and replace with <rsc_fpops_bound>1 Essentially that does shift the decimal place by 1 (roughly), increasing the runtime limit from 20x to ~200 flops. EDIT: With my 750Ti's I found that sometimes that is not enough for Arecibo vlars, so I used '9'. It doesn't seem to have any adverse effect by going too high. When resecheduling I try to hold back the vlars (but putting them in Waiting state) to not run into that. but if you get too many you 'just have to' move them. ID: 1888096 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888098 - Posted: 6 Sep 2017, 11:55:07 UTC - in response to Message 1888090. So your estimated CPU runtimes are around 7 minutes? Well, that's IF you were to run them on the CPU, which you didn't. When you assign them to the GPU BOINC automatically reduces the runtime again, that is why they exceeded the time estimates on the GPU. As far as I know you can't reset the APR, that value is saved somewhere on the SETI server and used to determine tasks assignments and even credit rewards. You just have to run the tasks on your CPU so it pulls down the APR. It would help if you were to enable hyper-threading and ran more than one fast CPU task. Running more slower instances on your CPUs would speed up the process dramatically. You should be able to run 3 or 4 CPU tasks without it slowing down your GPUs. It would also help if you looked into that post where I explained how to download and run GPU tasks on your GPUs instead of CPU tasks. Destroying your CPU APR is common with large amounts of rescheduling. . . Hi TBar, . . Prior to Laurent's upgrade to the re-scheduler I was d/l'ing to the CPU and moving to the GPU but only did that for one or two weeks and only on the Tuesday for the outage so there were not such a large number of tasks moved. Other times I am not crunching on the CPU as it is an i5 (no hyper threading) and the CPU is busy enough with the 3 GPUs in there. Since the upgrade I can now d/l to the GPU and stash behind the CPU pending the start of the outage, these tasks have been assigned as GPU tasks to begin with. The exceptions, as I explained earlier, were these Arecibo VLARs that I ended up with when there was that server problem last week where for half a day or so it was only sending to my CPUs. I left those tasks on the CPU to do what you suggested and run them very slowly (it has taken 3 or 4 days to process 50 or 60 odd tasks so they were running very slowly, especially as I have the last CPU core set to run at 50% to prevent over commitment on the CPU. Yet despite that manager still seems to regard the est run time as very short. . . I guess the run time cut off is at least established now as 8 mins. Well all I can do is refrain from d/l'ing any CPU tasks at all. Stephen <shrug> ID: 1888098 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888099 - Posted: 6 Sep 2017, 12:04:16 UTC - in response to Message 1888096. You can do a search and replace for <rsc_fpops_bound> and replace with <rsc_fpops_bound>1 Essentially that does shift the decimal place by 1 (roughly), increasing the runtime limit from 20x to ~200 flops. . . Hi Brent, . . Is that search & replace on the client_state.xml file? . . I am going to refrain from downloading any CPU tasks in future but it may be worth remembering this in case of an emergency. Stephen . ID: 1888099 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1888105 - Posted: 6 Sep 2017, 12:44:18 UTC The setting of rsc_fpops_bound will not cure your problem of a very low CPU APR. Your APR will not recover unless you process CPU tasks on the CPU, so just allow tasks to come to both CPU and GPU. The Petri/Tbar application does not need as much CPU support as the SoG application, so, given you have an i5 you can safely increase the CPU usage to 50%, doubling the CPU throughput while having no noticeable impact on the GPU. Further you could enable hyper-threading and get an even bigger improvement in the CPU throughput, again with little or no impact on the GPUs (I run hyper-threading on my Xeon, and that's supporting three GTX1080s and that rig is currently the number three cruncher, running "stock speeds" on CPU & GPU, and no rescheduling.) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1888105 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1888108 - Posted: 6 Sep 2017, 12:58:25 UTC - in response to Message 1888105. The setting of rsc_fpops_bound will not cure your problem of a very low CPU APR. True Rob that doing so is not a cure, but more a way out of a "hole you've dug yourself" to get the tasks to process to completion. Yes Steven, that is in the client_state file. ID: 1888108 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888111 - Posted: 6 Sep 2017, 13:29:26 UTC - in response to Message 1888105. The setting of rsc_fpops_bound will not cure your problem of a very low CPU APR. Your APR will not recover unless you process CPU tasks on the CPU, so just allow tasks to come to both CPU and GPU. The Petri/Tbar application does not need as much CPU support as the SoG application, so, given you have an i5 you can safely increase the CPU usage to 50%, doubling the CPU throughput while having no noticeable impact on the GPU. Further you could enable hyper-threading and get an even bigger improvement in the CPU throughput, again with little or no impact on the GPUs (I run hyper-threading on my Xeon, and that's supporting three GTX1080s and that rig is currently the number three cruncher, running "stock speeds" on CPU & GPU, and no rescheduling.) . . Thanks again Rob for the input but I have to repeat, the CPU is an i5-6600, there is NO hyper-threading to turn on. Sadly that is not an option. So my 4 cores are running at 80% supporting the 3 x GPUs ( 2 x GTX 970s and 1 x GTX 1050). With -nobs and performance level pushed to P0 they are doing well but do not leave much CPU resource untapped. With CPU crunching on, using just 50% of the CPU time, the CPU usage varies between 80 and 100 % like a the elbow of a fiddler on speed and with a caffeine addiction. That is why I only crunch on the CPU during the build up to the weekly outrage. Those VLAR tasks were taking nearly 2 hours to complete but registering only 1 hour because manager only counts active CPU time. if I turned off -nobs I could crunch on one CPU core full time but that would drop productivity to a noticeable degree. To me it looks like a severe case of a no win situation. Stephen :( ID: 1888111 ·

W3Perl Volunteer tester Send message Joined: 29 Apr 99 Posts: 251 Credit: 3,696,783,867 RAC: 12,606	Message 1888123 - Posted: 6 Sep 2017, 14:05:19 UTC - in response to Message 1887979. I also has problems running the script in windows, first 2 times in was making ghost WUs, 3rd time I made a backup of BOINC before I tried, here it worked. So I tried again just now, suspended 50 WUs and ran the script: cpu2gpu.pl -s GPU version found : 821 710 - (class : opencl_nvidia_SoG cuda_opencl_100) CPU version found : 800 AP version found : 703 Found GPU version used : 821 in D:/ProgramData/BOINC/client_state.xml (opencl_nvidia_SoG / windows_x86_64) Found CPU Version used : 800 in D:/ProgramData/BOINC/client_state.xml 219 wu found ----- Your current D:/ProgramData/BOINC/client_state.xml : - CPU (MB8_win_x64_AVX_VS2010_r3330.exe) (800) : 50 wu ==> Converted wu from CPU to GPU : 50 But when I ran BOINC again, the 50 WUs dissarired into ghost state :( Yes bug ! I had the same problem....cpu2gpu.pl have been updated to fix this problem. You can get it at the same location ID: 1888123 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1888139 - Posted: 6 Sep 2017, 16:24:46 UTC - in response to Message 1888111. Stephen, Sorry if I missed what CPU you are using. But you have a problem that you need to sort, you can either stick a band-aid on it by playing with things in client_state.xml, or you can fix it on a more permanent basis by letting a good few CPU tasks run to completion on the CPU, and, while doing so stop rescheduling. Think about it - if rescheduling were really such a good thing Brent's computer with its GPUs overclocked quite hard (10% I think) and running in P0, would be far more than 10k ahead of mine which runs stock setting and P?? (as decided by the thermal load). Yes, I do loose a bit during the weekly outrage, but overall I'm about 5% behind him, and mine just sits there without any interference from me. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1888139 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 1888146 - Posted: 6 Sep 2017, 17:14:29 UTC - in response to Message 1887980. When I try running the script in mint, I get this error, what do I do? ./cpu2gpu.pl -r bash: ./cpu2gpu.pl: /usr/bin/perl^M: bad interpreter: No such file or directory . . It is a "perl" script and you need to have perl installed to run it. Stephen .. I did install Perl using instructions from script :apt-get install libxml-simple-perl ID: 1888146 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1888155 - Posted: 6 Sep 2017, 17:35:41 UTC - in response to Message 1888146. Last modified: 6 Sep 2017, 17:38:08 UTC When I try running the script in mint, I get this error, what do I do? ./cpu2gpu.pl -r bash: ./cpu2gpu.pl: /usr/bin/perl^M: bad interpreter: No such file or directory . . It is a "perl" script and you need to have perl installed to run it. Stephen .. I did install Perl using instructions from script :apt-get install libxml-simple-perl In script ... /usr/bin/perl^M ... not found or something try typing whereis perl <enter> on a terminal window. I guess it finds it ok. Then edit the ^M (Ctrl+M) away from the script file. It is a DOS remnant. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1888155 ·

W3Perl Volunteer tester Send message Joined: 29 Apr 99 Posts: 251 Credit: 3,696,783,867 RAC: 12,606	Message 1888184 - Posted: 6 Sep 2017, 19:55:47 UTC - in response to Message 1888146. I did install Perl using instructions from script :apt-get install libxml-simple-perl 'apt-get install libxml-simple-perl' will install the xml perl library...not perl ! On linux, Perl is already installed (type 'whereis perl' to know where the perl binary is located). so change the first line "#!/usr/bin/perl" to your perl path "#!/<mypath>/perl" where mypath is your perl path. or you can run the script using : 'perl cpu2gpu.pl -i' http://www.w3perl.com/seti/ ID: 1888184 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1888203 - Posted: 6 Sep 2017, 21:15:37 UTC - in response to Message 1888146. When I try running the script in mint, I get this error, what do I do? ./cpu2gpu.pl -r bash: ./cpu2gpu.pl: /usr/bin/perl^M: bad interpreter: No such file or directory . . It is a "perl" script and you need to have perl installed to run it. Stephen .. I did install Perl using instructions from script :apt-get install libxml-simple-perl . . Then maybe you need to change the access permissions for the perl folder, something like sudo chmod -R 777 /usr/bin/perl . . And you could check that the perl interpreter itself has run permission set to on. Stephen .. ID: 1888203 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.