Setting up Linux to crunch CUDA90 and above for Windows users

Author	Message
Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1960951 - Posted: 19 Oct 2018, 15:49:54 UTC - in response to Message 1960915. Are we using command line parameters on any of the cpu apps? NO What does your "app_config.xml" file look like? See above Are you doing any "tuning" to your motherboard? NO Are you doing any "tuning" to your gpu/cpu? Just the ones amply covered by Petri and TBar before the B1/B2 releases; -pfl 256 and -pfe which are now discarded in the aforementioned releases and -nobs. Stephen . . Stephen, I am sorry. I couldn't find the "app_config.xml" on this page or the previous page using the browser string search. Tom A proud member of the OFA (Old Farts Association). ID: 1960951 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1961046 - Posted: 20 Oct 2018, 0:41:18 UTC - in response to Message 1960951. Are we using command line parameters on any of the cpu apps? NO What does your "app_config.xml" file look like? See above Are you doing any "tuning" to your motherboard? NO Are you doing any "tuning" to your gpu/cpu? Just the ones amply covered by Petri and TBar before the B1/B2 releases; -pfl 256 and -pfe which are now discarded in the aforementioned releases and -nobs. Stephen . . Stephen, I am sorry. I couldn't find the "app_config.xml" on this page or the previous page using the browser string search. Tom . . My apologies Tom, that was my oblique way of saying I am not using app_config.xml. The see above comment referred to the NO on the line above. Stephen :( my bad ID: 1961046 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1961088 - Posted: 20 Oct 2018, 5:43:23 UTC - in response to Message 1961046. Are we using command line parameters on any of the cpu apps? NO What does your "app_config.xml" file look like? See above Are you doing any "tuning" to your motherboard? NO Are you doing any "tuning" to your gpu/cpu? Just the ones amply covered by Petri and TBar before the B1/B2 releases; -pfl 256 and -pfe which are now discarded in the aforementioned releases and -nobs. Stephen . . Stephen, I am sorry. I couldn't find the "app_config.xml" on this page or the previous page using the browser string search. Tom . . My apologies Tom, that was my oblique way of saying I am not using app_config.xml. The see above comment referred to the NO on the line above. Stephen :( my bad I am more than a little thick in the brain on occasion. You should not be apologizing for my inability to read between the "lions" (NY Library joke). Tom A proud member of the OFA (Old Farts Association). ID: 1961088 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1961094 - Posted: 20 Oct 2018, 6:11:12 UTC I see someone else has joined the 2 Gtx 1060 parade using CUD91 (I think). https://setiathome.berkeley.edu/show_host_detail.php?hostid=8563839 And made it onto the 2nd page of the LeaderBoard. :) A proud member of the OFA (Old Farts Association). ID: 1961094 · Reply Quote

Juhani Karjanlahti Volunteer tester Send message Joined: 23 Jan 03 Posts: 15 Credit: 83,675,733 RAC: 149	Message 1961096 - Posted: 20 Oct 2018, 7:06:14 UTC - in response to Message 1961094. I see someone else has joined the 2 Gtx 1060 parade using CUD91 (I think). https://setiathome.berkeley.edu/show_host_detail.php?hostid=8563839 And made it onto the 2nd page of the LeaderBoard. :) Just doing my best with old gtx 970 and 1060 3gb. ID: 1961096 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1962342 - Posted: 29 Oct 2018, 0:47:14 UTC . . Hi to everyone who reads this thread ... . . It had to happen, I would have died unsatisfied if I had not done it ... . . I disabled the SMT (Hyperthreading) on the Ryzen 7 (which for the time being is still running Windows and SoG) to see how much effect it was having on the ouput of the CPU. I have to confess I was gobsmacked with the results. I really expected the run times to reduce significantly (especially as I am only crunching 5 WUs at a time now instead of the 8 running at a time previously) but they only dropped by about 3 to 5 mins. They are now about 59 to 60 mins per task (x5) compared to 61 to 65 mins per task (x8). I could increase that to 6 at a time but the run times would only get longer. I am running 4xGPU tasks as well. This represents approx a 25% reduction in productivity ... as I said, gobsmacked! Oh well, at least the GPU tasks will benefit and run a little quicker, right? Nah! The run times are very consistent and much the same as before, but if there is any change they are actually taking a few seconds longer ( 5 to 10 seconds in 19 min plus run times). . . The reason for the success with Hyperthreading I put down to crunching only on the physical threads and using the virtual Hyperthreads to support the GPU tasks and handle any overflow or other demands. I knew this would improve the efficiency of SMT mode but I had not expected it to be so dramatic. I am using 'Process Lasso' to manage the control of the thread assignments and it works very well. In that configuration the CPU was performing as well as if I was running without SMT and crunching on all 8 cores. Yet at the same time having the CPU support the 4 GPU tasks and everything else. Talk about smoke and mirrors, it's like magic! So now it's back to SMT mode ... :) Stephen . . And yes Grant, I know, you told me so :) :) ID: 1962342 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1963587 - Posted: 6 Nov 2018, 6:54:48 UTC I was poking around and noticed that one of our colleagues is running "-pfb 32". I thought that was dropped as "experimental" but if it hasn't, does it speed up the processing? Tom A proud member of the OFA (Old Farts Association). ID: 1963587 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1963589 - Posted: 6 Nov 2018, 7:11:35 UTC - in response to Message 1963587. That is the default in the app_info in the special app packages by TBar. Best for Pascal and for 1070 and greater. I think you were thinking of the -pfp parameter that was 'experimental' Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1963589 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1963590 - Posted: 6 Nov 2018, 7:18:00 UTC Stephen, if you had bothered to read just a little bit about AMD architecture, you would known this from the beginning. This was well known as far back as Bulldozer. Continues to present architecture but not as obvious since the module architecture changed considerably with Zen. If you give the core's FPU register exclusive access to the data, it processes it faster of course. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1963590 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1963592 - Posted: 6 Nov 2018, 7:23:59 UTC - in response to Message 1963589. That is the default in the app_info in the special app packages by TBar. Best for Pascal and for 1070 and greater. I think you were thinking of the -pfp parameter that was 'experimental' Maybe, but this task certainly has the command line I mentioned. Does it make a difference? https://setiathome.berkeley.edu/result.php?resultid=7108265328 Tom A proud member of the OFA (Old Farts Association). ID: 1963592 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1963607 - Posted: 6 Nov 2018, 11:51:03 UTC - in response to Message 1963587. Last modified: 6 Nov 2018, 12:09:09 UTC I was poking around and noticed that one of our colleagues is running "-pfb 32". I thought that was dropped as "experimental" but if it hasn't, does it speed up the processing? Tom . . There is disagreement about whether any of those parameters have any or much effect on run times. TBar has run many benchmarking tests with and without those parameters and I believe he concluded they were of little or no benefit. I believe that the "experimental" parameters you are thinking about were -pfl nnn and -pfe. Which, while not making significant improvements in run times, were associated with an increase in inconclusive results so the command line switches were removed from the latter versions. . . The -pfp command is no longer necessary because it was made superfluous by the -unroll autotune function which automagically sets the right value to match the CUs of the GPU it is assigned to. . . I think even Petri believes that -pfb 32 does not have a very great effect but I still like to use it myself. . . But the old adage applies, give it a try and see if you observe any measurable changes in run times. Stephen :) ID: 1963607 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1964156 - Posted: 9 Nov 2018, 16:33:00 UTC I am wondering. Does setting the 1 cpu core/thread to 1 gpu card make a difference? For instance, if you let "it" whatever the default is 0.017? and used the -nobs you would get all the support CPU support you wanted and it might free up one CPU per gpu card for more CPU task processing? I think I remember some kind of discussion about "threads jumping around" that might or should or does slow down the gpu or cpu processing? Just wondering. :) Tom A proud member of the OFA (Old Farts Association). ID: 1964156 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1964163 - Posted: 9 Nov 2018, 17:03:19 UTC - in response to Message 1964156. Try it and see. Any gpu app will use as much cpu resource as it needs. The setting for reserving a cpu core is just for scheduling. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1964163 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1964230 - Posted: 9 Nov 2018, 23:04:33 UTC - in response to Message 1964156. I am wondering. Does setting the 1 cpu core/thread to 1 gpu card make a difference? For instance, if you let "it" whatever the default is 0.017? and used the -nobs you would get all the support CPU support you wanted and it might free up one CPU per gpu card for more CPU task processing? I think I remember some kind of discussion about "threads jumping around" that might or should or does slow down the gpu or cpu processing? Just wondering. :) Tom . . As someone said before, that number is not a limitation on the CPU support for the GPU task but simply a target. When the GPU app is running it will ask for as much CPU support as it needs, and that is determined by the settings for the GPU app. If you are running special sauce with the -nobs option the app will ask for 100% of a CPU core resources and sometimes more. Changing that value is purely arbitrary but I have found the observable effect is that if your aggregate of CPU portions assigned to GPU tasks is greater than one, BOINC will reserve an extra CPU core and the running CPU tasks will reduce by 1, so if you have set it to use 6 cores it will drop back to 5. Running 2 gpus with -nobs it is best to allow 1 CPU for each GPU and one for the pot. I have seen some people run 2 CPU cores per GPU task but I am not sure that would be all that effective nor an efficient use of resources, but you would need to talk to one of them about their results. . . As for the "threads jumping" thing I am not a programmer but I think that was a discussion about limiting the CPU crunching to the "physical threads" and allowing the GPU to call on support only from the virtual cores. That is a very sound strategy and works well in both Windows and Linux. The jumping part is because (I believe) when the GPU app releases a CPU when a function or call is complete it can be re-assigned to another app that is running, so you can try locking threads (too complicated for me) or doing the above. Stephen . . ID: 1964230 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1967000 - Posted: 25 Nov 2018, 15:18:19 UTC . . Just to keep threaad alive... Stephen . . ID: 1967000 · Reply Quote

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65738 Credit: 55,293,173 RAC: 49	Message 1967010 - Posted: 25 Nov 2018, 16:45:28 UTC It's too bad CUDA 90 or whatever can't be used by Windows users, I'm really comfortable with Windows 7. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1967010 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1967014 - Posted: 25 Nov 2018, 17:15:53 UTC - in response to Message 1967010. You'll just have to persuade the Windows app developer to make a modern CUDA app. Maybe say 'pretty please'? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1967014 · Reply Quote

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380	Message 1967058 - Posted: 25 Nov 2018, 20:19:25 UTC It's not only a case of getting a Windows CUDA developer to produce a new version of the "current" application - that's easy, and wouldn't yield much of a performance gain. The big hurdle is Windows effectively very blocks the synchronization process than Perti used in developing the ultra-performing Linux application. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1967058 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1967084 - Posted: 25 Nov 2018, 21:17:17 UTC - in response to Message 1967010. It's too bad CUDA 90 or whatever can't be used by Windows users, I'm really comfortable with Windows 7. . . I would have to say that a version that runs under Win7 would be a very nice thing :) Stephen :) ID: 1967084 · Reply Quote

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1967092 - Posted: 25 Nov 2018, 21:35:52 UTC - in response to Message 1967058. It's not only a case of getting a Windows CUDA developer to produce a new version of the "current" application - that's easy, and wouldn't yield much of a performance gain. The big hurdle is Windows effectively very blocks the synchronization process than Perti used in developing the ultra-performing Linux application. Not being much of a programmer myself, I have to wonder if we need to start with some kind of library that replaces or adds on to the "stuff" where that syncronization code lives. Yes I know all that is way over my head (not even ducking, its that high) but I have heard a comment about in "open source" development if you need something specific that is not already done you clone and create a modified libary to do it. Yes I am aware that M$ librarys are unlikely to be clonable. :( Much less have the source code available. So I expect we need someone who does enough assembler coding to be aware of what can be hand otimized as well as depending on an optimizing compiler for the rest of the speed up. And an investigation into speeding up the FFT (Fast Fourier Transform?) library. Tom A proud member of the OFA (Old Farts Association). ID: 1967092 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.