Message boards :
Number crunching :
GBT ('guppi') .vlar tasks will be send to GPUs, what you think about this?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
OK, jag ska vänta på det. OK, I'll wait. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, resurrected 780 on Mac pro, while I snoozed running 2 up with baseline code looks like an hour-ish, so 28-30 mins. Looks to me, if it continues that way, we might all have the tasks equalising on the VRAM speeds (different apps/cards, similar VRAM similar tasks). @Mark, yeah 3 might be saturating with these kindof tasks (if settings don;t help much). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
If you chose to do 3 make sure to use commandlines and have a -use_sleep to try and help. I don't use it as I have a large enough CPU and don't max it out. Runs 75% |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
looks like on my 980 host, (better settings, but slow CPU host) does 2 up in ~34 mins while relatively idle (~17mins each effective), longer when watching twitch streams or youtube vids on the second monitor. No particularly noticeable screen-lag, which I'm surprised about. Firing the 680 Linux host back up to compare. Ignoring the credits for now, since they look scary, lol. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I've got a feeling .... The credit will turn up now since all kind of hosts and especially standard NV app is doing Guppi too. That is a feeling, but I hope I'll wake up to a better world, if not tomorrow then in three weeks. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
ace_quaker Send message Joined: 12 Oct 01 Posts: 17 Credit: 33,678,474 RAC: 1 |
I came here to figure out why my quad 780 system video response was so slow and power usage down so much. (Creditscrew down too, but with any change at all from this project that's a given.) But look! Its another poorly optimized application thrown into the wild WAY before its ready! You should merge Seti@home and Seti@home beta because there is obviously no functional separation of the two. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
In SETI Beta I saw that guppi .vlars crunched with atiapu_SoG on my Linux box with an AMD HD 7770 take twice the time of those crunched with ati5_cal132 and get the double of credits. Tullio |
Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597 |
I'm seeing times around 1 hour for processing of guppi vlars. Running 2 per card across 2 cards (4 gpu guppies at a time). It seems to be similar regardless of card; 780 TI, 680 Classified, or 580. I have all the boxes set to 0.25 cpu for gpu wus so that I would end up with a single core feeding the gpus and the rest of the cores free for cpu wus. HT is turned off. It has worked well to date, but does not seem to be at the moment given some of the other comments around processing times. I was wondering if I set core usage to 0.50 cpu for gpu wus whether that might improve things a little by throwing a bit more freedom into the systems. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Both Raistmer's OpenCL SoG builds, and Petri's optimisations seem to be saying that throwing some CPU at the VLARs helps. Provided you reserve enough CPU cores, for the older generic baseline Cuda builds you can achieve something along those lines by setting the -poll command line option via app_info.xml. Still working out what's going on here, but yeah, seems the long pulsefinding on recent cards and generic builds is holding up to the changes, but will tend to make cards more equal because of its dependence on VRAM speed (as opposed to core). That just means different sets of optimisations/apps will prove better for different kinds of tasks. Complicates the short term figuring out phase, but probably will lead to more powerful applications. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
I'm seeing times around 1 hour for processing of guppi vlars. Running 2 per card across 2 cards (4 gpu guppies at a time). It seems to be similar regardless of card; 780 TI, 680 Classified, or 580. My GTX 750Tis appear to be doing the Guppie VLARs in approx. 1hr 10min on 2 different systems with 2 different settings. i7 system, Win10 64bit- 2 WUs at a time, 1 CPU core per WU using the -poll option (general use system, 2 video cards). C2D system, Win Vista 32bit- 2 WUs at a time, processpriority=high (dedicated cruncher, and now only 1 video card). I've tried the various other mbcuda.cfg settings, but found that they had little if any effect. It's interesting looking at GPU-Z while the Guppie VLARs are running. As others have noted, power use drops off when running the VLARs. Interestingly, that coincides with 99% GPU load. When the load drops down occasionally to more normal 85-95% levels, power usage increases again. So those extended periods of 99% GPU load, while the GPU is busy, it's not actually doing much work. Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I've tried the various other mbcuda.cfg settings, but found that they had little if any effect. Those pulsefinding settings relate to filling computational cores on large devices, so no effect if the VRAM bus is saturated (the bottleneck) instead, is as expected. You can confirm this probably by first lowering the core clock to see little effect, then putting it back, then lowering the VRAM clock to see if the change is more significant. What's needed to relieve the memory subsystem load for the new data character, are applications better suited to this search. (full picture of which direction to head isn't clear yet, though likely a combination of supplied optimised code + some CPU use + configuration will result) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
What's needed to relieve the memory subsystem load for the new data character, are applications better suited to this search. (full picture of which direction to head isn't clear yet, though likely a combination of supplied optimised code + some CPU use + configuration will result) Why is it that the GPUs ability to process work is so heavily dependant on the CPU? Bus Mastering & the like has been around for decades. I would have thought that the CUDA/OpenCL application would execute on the GPU. The only times the CPU would be required would be when starting a new WU, to get that WU, load it up & then off it goes on the GPU and once crunching has completed to transfer the result file to the BOINC manager to upload & report it, and then get the next WU. While crunching, the GPU would need to send out progress reports (for the progress bar & estimated completion time), but I would have thought those could be asynchronous- no need to wait for a response from the CPU. The only input from the CPU during crunching would be to monitor if crunching was to pause (Snooze GPU), start another project to honor resource shares (ideally that should be able to wait till the WU is done in the case of the short runtimes for Seti work) or to exit (BOINC shutting down or system shutting down). There would be the usual communication traffic between the CPU & GPU for display purposes, but why the significant need for CPU resources for something being executed on the GPU? Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Primarily an increase over time (since ye old XPDM days) in latencies associated with feeding the GPUs (mostly due to memory model unification, i.e. virtualisation, unification and added complexity). These are the considerations driving DirectX12 and Vulkan (since AMDs Çlose to Metal and Mantle bypassing bloated DirectX and OpenGL became a thing) Add to that, you have more less or relatively static CPU/host performance over the last 5 years, and increasing computational capacity in the GPUs by orders of magnitude .... So what was once very little CPU, now looks like a lot :D A CPU portion reducing results that cannot be so easily parallelised, while GPU portion shrinks. Shifting to new (reduced) synchronisation techniques is the next stage, and Raistmer's SoG builds give some hints as what can help there. Here though we start to move into territory there aren't any books for (yet :) ) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Bruce Send message Joined: 15 Mar 02 Posts: 123 Credit: 124,955,234 RAC: 11 |
Looking over my current valid guppi vlars, and this is just a rough eyeball estimate, the run times seem to be about 555 seconds (9 minute 15 seconds). Some are slower, some are faster. Just a rough average guesstimate. I run a single WU at a time with the 3430 SoG app and a optimized tune. Feel free to check out my results My v8 Tasks. Bruce |
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182 |
It's interesting looking at GPU-Z while the Guppie VLARs are running. As others have noted, power use drops off when running the VLARs. Interestingly, that coincides with 99% GPU load. When the load drops down occasionally to more normal 85-95% levels, power usage increases again. I am running 3 per card on my 750Tis. I am seeing similar times as others, but what I see is that my GPU memory usage is constantly changing from 30 to 90%,(IN the past it was steady at about 85%) My GPU Meter on the desktop shows this as well as GPUZ. I see mixed units running, both Guppie and non-Guppie. I was wondering if this is normal or has some meaning. The GPU usage shows 85-95% usage most of the time. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
In SETI Beta I saw that guppi .vlars crunched with atiapu_SoG on my Linux box with an AMD HD 7770 take twice the time of those crunched with ati5_cal132 and get the double of credits. And link to your report about that on beta is?.... If someone think that we have any manpower to look ALL results... well, saying very politely you are wrong. Beta testers (instead of beta hosts) assumed to be sentient beings actively participating in project, not just mentioning peculiarities "by the way" in chatting thread. To make it plain clear: in PulseFind area SoG performance should be exactly equal non-SoG one. Taking into account that VLAR is almost PulseFind, SoG and non-SoG times should be close. All else worth reporting. That's beta about. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Yep, that's once more reminds to think about what particular metric tells in reality, not what one thinks it should tell. Just as with RAC as host performance metric, GPU load metric loosely related to processing performance. Higher load just means more time spend in kernel processing. Nothing more. What kernel does, how effectively kernel did what it does - no reflected in GPU usage metric. So, "at all other equal conditions" only GPU usage relates to performance. VLAR task imply very long kernels in PulseFind. That's the reason sluggishness occur. So, naturaly it results in more "busy with kernel"/"idle" time ratio. I suspect even how many CUs loaded not reflected in that metric... |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
In very short: cause there are parts of control code that should be executed serially. And cause currently they executed on CPU. One should realize that GPU is distinctly another device than CPU is. Most modern CPUs have no more then 32(? ) threads. Even low-end GPU has thousands threads. Go figure. All these "it should" and "it shouldn't" make me boil ... :P And then , when communication CPU<=>GPU really decreased, "my BOINC doesn't tick " appears... |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I suspect even how many CUs loaded not reflected in that metric... Correct, For Cuda devices it's time slice occupation on the first core (SMX), so pretty misleading. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Depends on what this particular metrics actually measure. If it measure device global memory domain uncached accesses then it roughly equal to "cache misses" in CPU world. And then, apparently, it's bad thing. If it measures just memory accesses w/o distinction between cache hits/misses... then it's bad thing too but for another reason and quite non-fixable one. It would mean low to not compute-intensive parts of code prevail. In other words, too few arithmetic operations on single memory load. Just the very thing that make app to be memory-constrained. Instead of Gaussian where quite high computation load per each sample implied, PulseFind very similar to AstroPulse's algorithms. Maybe that's why AMD owners feel different then NV ones with VLAR and AP - better memory subsystem (in older generations at least). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.