Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 82 · 83 · 84 · 85 · 86 · 87 · 88 . . . 162 · Next
Author | Message |
---|---|
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Stephen, I am sorry. I couldn't find the "app_config.xml" on this page or the previous page using the browser string search. Tom A proud member of the OFA (Old Farts Association). |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . My apologies Tom, that was my oblique way of saying I am not using app_config.xml. The see above comment referred to the NO on the line above. Stephen :( my bad |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I am more than a little thick in the brain on occasion. You should not be apologizing for my inability to read between the "lions" (NY Library joke). Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I see someone else has joined the 2 Gtx 1060 parade using CUD91 (I think). https://setiathome.berkeley.edu/show_host_detail.php?hostid=8563839 And made it onto the 2nd page of the LeaderBoard. :) A proud member of the OFA (Old Farts Association). |
Juhani Karjanlahti Send message Joined: 23 Jan 03 Posts: 15 Credit: 83,675,733 RAC: 149 |
I see someone else has joined the 2 Gtx 1060 parade using CUD91 (I think). Just doing my best with old gtx 970 and 1060 3gb. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Hi to everyone who reads this thread ... . . It had to happen, I would have died unsatisfied if I had not done it ... . . I disabled the SMT (Hyperthreading) on the Ryzen 7 (which for the time being is still running Windows and SoG) to see how much effect it was having on the ouput of the CPU. I have to confess I was gobsmacked with the results. I really expected the run times to reduce significantly (especially as I am only crunching 5 WUs at a time now instead of the 8 running at a time previously) but they only dropped by about 3 to 5 mins. They are now about 59 to 60 mins per task (x5) compared to 61 to 65 mins per task (x8). I could increase that to 6 at a time but the run times would only get longer. I am running 4xGPU tasks as well. This represents approx a 25% reduction in productivity ... as I said, gobsmacked! Oh well, at least the GPU tasks will benefit and run a little quicker, right? Nah! The run times are very consistent and much the same as before, but if there is any change they are actually taking a few seconds longer ( 5 to 10 seconds in 19 min plus run times). . . The reason for the success with Hyperthreading I put down to crunching only on the physical threads and using the virtual Hyperthreads to support the GPU tasks and handle any overflow or other demands. I knew this would improve the efficiency of SMT mode but I had not expected it to be so dramatic. I am using 'Process Lasso' to manage the control of the thread assignments and it works very well. In that configuration the CPU was performing as well as if I was running without SMT and crunching on all 8 cores. Yet at the same time having the CPU support the 4 GPU tasks and everything else. Talk about smoke and mirrors, it's like magic! So now it's back to SMT mode ... :) Stephen . . And yes Grant, I know, you told me so :) :) |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I was poking around and noticed that one of our colleagues is running "-pfb 32". I thought that was dropped as "experimental" but if it hasn't, does it speed up the processing? Tom A proud member of the OFA (Old Farts Association). |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
That is the default in the app_info in the special app packages by TBar. Best for Pascal and for 1070 and greater. I think you were thinking of the -pfp parameter that was 'experimental' Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Stephen, if you had bothered to read just a little bit about AMD architecture, you would known this from the beginning. This was well known as far back as Bulldozer. Continues to present architecture but not as obvious since the module architecture changed considerably with Zen. If you give the core's FPU register exclusive access to the data, it processes it faster of course. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
That is the default in the app_info in the special app packages by TBar. Best for Pascal and for 1070 and greater. Maybe, but this task certainly has the command line I mentioned. Does it make a difference? https://setiathome.berkeley.edu/result.php?resultid=7108265328 Tom A proud member of the OFA (Old Farts Association). |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I was poking around and noticed that one of our colleagues is running "-pfb 32". . . There is disagreement about whether any of those parameters have any or much effect on run times. TBar has run many benchmarking tests with and without those parameters and I believe he concluded they were of little or no benefit. I believe that the "experimental" parameters you are thinking about were -pfl nnn and -pfe. Which, while not making significant improvements in run times, were associated with an increase in inconclusive results so the command line switches were removed from the latter versions. . . The -pfp command is no longer necessary because it was made superfluous by the -unroll autotune function which automagically sets the right value to match the CUs of the GPU it is assigned to. . . I think even Petri believes that -pfb 32 does not have a very great effect but I still like to use it myself. . . But the old adage applies, give it a try and see if you observe any measurable changes in run times. Stephen :) |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I am wondering. Does setting the 1 cpu core/thread to 1 gpu card make a difference? For instance, if you let "it" whatever the default is 0.017? and used the -nobs you would get all the support CPU support you wanted and it might free up one CPU per gpu card for more CPU task processing? I think I remember some kind of discussion about "threads jumping around" that might or should or does slow down the gpu or cpu processing? Just wondering. :) Tom A proud member of the OFA (Old Farts Association). |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Try it and see. Any gpu app will use as much cpu resource as it needs. The setting for reserving a cpu core is just for scheduling. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I am wondering. Does setting the 1 cpu core/thread to 1 gpu card make a difference? . . As someone said before, that number is not a limitation on the CPU support for the GPU task but simply a target. When the GPU app is running it will ask for as much CPU support as it needs, and that is determined by the settings for the GPU app. If you are running special sauce with the -nobs option the app will ask for 100% of a CPU core resources and sometimes more. Changing that value is purely arbitrary but I have found the observable effect is that if your aggregate of CPU portions assigned to GPU tasks is greater than one, BOINC will reserve an extra CPU core and the running CPU tasks will reduce by 1, so if you have set it to use 6 cores it will drop back to 5. Running 2 gpus with -nobs it is best to allow 1 CPU for each GPU and one for the pot. I have seen some people run 2 CPU cores per GPU task but I am not sure that would be all that effective nor an efficient use of resources, but you would need to talk to one of them about their results. . . As for the "threads jumping" thing I am not a programmer but I think that was a discussion about limiting the CPU crunching to the "physical threads" and allowing the GPU to call on support only from the virtual cores. That is a very sound strategy and works well in both Windows and Linux. The jumping part is because (I believe) when the GPU app releases a CPU when a function or call is complete it can be re-assigned to another app that is running, so you can try locking threads (too complicated for me) or doing the above. Stephen . . |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Just to keep threaad alive... Stephen . . |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65738 Credit: 55,293,173 RAC: 49 |
It's too bad CUDA 90 or whatever can't be used by Windows users, I'm really comfortable with Windows 7. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
You'll just have to persuade the Windows app developer to make a modern CUDA app. Maybe say 'pretty please'? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
rob smith Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380 |
It's not only a case of getting a Windows CUDA developer to produce a new version of the "current" application - that's easy, and wouldn't yield much of a performance gain. The big hurdle is Windows effectively very blocks the synchronization process than Perti used in developing the ultra-performing Linux application. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
It's too bad CUDA 90 or whatever can't be used by Windows users, I'm really comfortable with Windows 7. . . I would have to say that a version that runs under Win7 would be a very nice thing :) Stephen :) |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
It's not only a case of getting a Windows CUDA developer to produce a new version of the "current" application - that's easy, and wouldn't yield much of a performance gain. The big hurdle is Windows effectively very blocks the synchronization process than Perti used in developing the ultra-performing Linux application. Not being much of a programmer myself, I have to wonder if we need to start with some kind of library that replaces or adds on to the "stuff" where that syncronization code lives. Yes I know all that is way over my head (not even ducking, its that high) but I have heard a comment about in "open source" development if you need something specific that is not already done you clone and create a modified libary to do it. Yes I am aware that M$ librarys are unlikely to be clonable. :( Much less have the source code available. So I expect we need someone who does enough assembler coding to be aware of what can be hand otimized as well as depending on an optimizing compiler for the rest of the speed up. And an investigation into speeding up the FFT (Fast Fourier Transform?) library. Tom A proud member of the OFA (Old Farts Association). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.