Message boards :
Number crunching :
OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 21 · Next
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Was observing the SoG and noticed they start off fast but when they hit 70% complete they start to crawl.....dragging out the completion example... 6 minutes into computations, 70% done then they slow down and take another 6 minutes to finish those last 30% I'm wondering if this the fix Raistmer came up with to prevent the 100% usage of the CPU when nearing the completion of the work unit. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14474 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Which precise SoG version was this, please? If it was for NV card, please cite r number and I'll check the actual <frac_done> values with my modded client - it was a progress reporting problem last time. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
I'm using the last one Raistmer had on his dropbox Mb8_win_x86_SSE3_OpenCl_NV_r3430_SoG.7z |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14474 Credit: 200,643,578 RAC: 874 ![]() ![]() |
OK, got it, thanks. Did you happen to notice any task details - what AR, etc.? Hooked it up to Beta and got guppi_VLAR as the first fetch. It does seem to be an issue related to the one we were discussing at Beta - but at VLAR, perhaps not so drastic. wu_name: blc3_2bit_guppi_57451_20612_HIP62472_0007.12525.831.18.21.146.vlar WU true angle range is : 0.008175 <prog> <fraction_done> 17:13:25 17:14:25 0.01748931 0.043302 17:15:25 0.04244621 0.098808 17:16:26 0.06566018 0.159925 17:17:27 0.09050095 0.215225 17:18:32 0.11596886 0.277116 17:19:36 0.14290232 0.341525 17:20:36 0.16774924 0.396144 17:21:43 0.18374324 0.442169 17:22:50 0.20764196 0.497730 17:23:56 0.23405194 0.559920 17:24:59 0.26072407 0.622799 17:26:03 0.28700744 0.684736 17:27:04 0.32917878 0.704677 17:28:05 0.37018072 0.726249 17:29:06 0.41237318 0.745490 17:30:07 0.45388002 0.766894 17:31:08 0.49561945 0.785576 17:32:08 0.53885943 0.807092 17:33:08 0.58042432 0.826390 17:34:10 0.62295983 0.845603 17:35:10 0.66387519 0.866467 17:36:11 0.70599017 0.885643 17:37:12 0.74688920 0.906469 The SoG application keeps two different counters for how far it's got: "Progress" and "Fraction done". We're seeing fraction done in the progress column in BOINC Manager. You'd expect them to be the same, but clearly either fraction done is counting up too fast in the early stages, or progress is counting more slowly (which is what we're used to from the CUDA apps, which - in reporting terms - start slow and speed up). I don't think there's any definite answer for which is right - perhaps we need to arrange another developer brainstorming party, if that desert island is still free. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Looking all the SoG I've run this am, they are all 0.008175 I'll look farther back to last night Edit.. If it is just the progress we see versus the fraction done than I guess there is nothing we can do. I do like that they don't suddenly jump up to 100% of CPU like they were doing. Still 13 minutes GPU is much better than 55 minutes CPU |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14474 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Now I'm back upstairs, I can see that SoG is still using a heck of a lot of CPU: current display of "CPU efficiency" in BoincView is 0.9788, or 97.88% of a core - pretty good for a CPU application, but compare with ~22% for GPUGrid cuda65, ~11% for SETI cuda50, or <2% for Einstein intel_gpu tasks. Note that CPU efficiency is (some form of) direct measurement: I think BoincTasks has something similar, though I don't know exactly how either of them work. But they're certainly more realistic than BOINC Manager's echo back of whatever value is written in app_info.xml or estimated by the server. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Yes, I use BoincTask as well and they all use 97% of a core each. It used to be running more than 1 per card reduced that amount but I just finished testin 2 at a time on the cards and there is no difference. About to try 3 at a time and will see if that reduces any time to complete or % of CPU. If neither of those happen then I may run r3366 again to see how it compares. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
So 3 per Card generates 36-37 minutes, about 1 minute faster than running a single Guppi by itself on the GPU CPU utilization is down anywhere from 80-92% of a CPU rather than 97% Only problem is, this require large number of CPU cores to make this happen. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51445 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
So 3 per Card generates 36-37 minutes, about 1 minute faster than running a single Guppi by itself on the GPU Not looking very good, is it? I gotz GPU power, but CPU power is a limitation. This may be a hard nut to crack for the optimization crew. "Freedom is just Chaos, with better lighting." Alan Dean Foster ![]() |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Kinda, but in the long run, it is still faster than the CPU version. Regarding Guppi... 1 work unit CPU is 55 minutes vs 14 minutes GPU where issues arises is when comparing how fast non vlars are crunched, especially when multiple instances. I would think a separate plan class would be need in the app_config if one planned on running the Guppi along with non-vlars on a low core system and the configurations would have to be worked out. Rough idea would be along the lines 1 GPU on a dual core, single instance 1 GPU in a 4 core CPU probably not a problem 2 GPU in a 4 core, ok as long as not running multiple instance 2 GPU in a 8 core, probably ok 3 GPU in a 8 core, might be manageable but limited instances 3 GPU on a 12 core, mangable 4 GPU on a 12 core, won't recommend it other than single instance or limited 4 GPU on a 16 core, might be possible but limited instances I didn't throw in 6 cores but they would be between the 2 and 3 GPU set ups. But here's the good news, with each year we get better equipment that allows us to build upon these things. So who is to say what we can do in 1-2 years time. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51445 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
I am running vintage equipment. And I cannot afford to upgrade. The kitty farm is what it is. 9 old rigs getting older by the day. Every day I wake up and don't find one crashed is a good day. If some think that it is OK to spend CPU cycles to support a weak GPU app, I am afraid I cannot agree. The GPU apps should be better able to stand on their own with minimal CPU support. That is why they are GPU apps. "Freedom is just Chaos, with better lighting." Alan Dean Foster ![]() |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
It's the GBT data that is requiring such large amount of CPU usage. On normal MB, the SoG uses very little CPU time. Actually, it's faster than cuda on nonvlar MB. But it's main purpose is those VLARs. So to crunch Guppi on GPU or not to crunch Guppi on GPU that is the question, lol.... |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Try with -use_sleep and increased sizes of PulseFind kernel (-sbs 512 for example) |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
So 3 per Card generates 36-37 minutes, about 1 minute faster than running a single Guppi by itself on the GPU I need to correct this.. I was using commandlines supplied by Mike at this point When I ran this same experiment without commandlines, the time to complete were 46-47 minutes Going to try Raistmer's recommendations now... |
Bruce Send message Joined: 15 Mar 02 Posts: 123 Credit: 124,955,234 RAC: 11 ![]() |
I'm not seeing the same results that you are, but then I am using AMD cpus and not Intel. The r3430_SoG seems to be running just like the r3401_SoG. The shorties run pretty quick and use very little cpu resources. When you get past those and into the mid and lower AR it takes a full core per WU. This seems to be the same problem that I have always had running APs. I am still using the same command lines that I used for r3401, so may need to retune, but I don't expect any reduction in cpu usage. Just tried a quick test using the sleep switch and it does not seem to work for me, still used a full core per. Bruce |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Hey Bruce, Sorry I should have specified that I was talking about SoG for Nvidia. I don't know how they do for ATI.. I'm restarting my test with the -use_sleep and nothing else and will progress over the evening to see how it does. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
-use_sleep can be used along with full tuning line. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Was just testing them to see how they all combine. I've found that -use_sleep with -sbs 512 along with the command line Mike gave me works the best if I use the -use_sleep 16 minutes with 3-5% CPU usage running 1 work unit per card at 3 work units per card 38 minutes average with 3% CPU along with -use_sleep -sbs 512 and command line I'm going to try again 1, 2, and 3 at a time per card but without the -use_sleep Tomorrow I will post result in a better format. I've run into a problem this evening with over 12 errors, not sure why it occurred was only using -use_sleep and nothing else. Maybe a bad batch of work,not sure. Will have to wait and see if wingmen also error out or if they complete the work units. Edit... Just check those that error, some of the wingman also errored out, only arm wingmen seemed to have completed them but in very very short time. So I have to doubt those results. |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5516 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Looks like a bad batch of GBT on beta... They start off normal then quickly CPU usage does all the way down to below 1%...and I'm not using -use_sleep on these so they should be using close to 97% of a core each... I've seen this happen with all the ones that have errored out tonight. At first I thought it was the -use_sleep but I removed it and restarted the machine. But the errors continue to happen, plus I can see some of my wingmen are erroring out too. Anyone else seeing these? I'm not currently crunching any GBT CPU work units on Main so don't know if the same thing is happening here... Edit... Going to run out of work soon on Beta due to the high number of errors, restricted number of work units downloaded. |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1849 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
Looks like a bad batch of GBT on beta... On Beta, 10 of ~150 GPU WUs (both SAH & SOG) errored out with "ERROR: Possible wrong computation state on GPU, host needs reboot or maintenance" the rest validated OK. Duplicated this on at least 2 of 4 machines crunching beta. Also saw some errors like that on GBT work done on main (CPU). Guess it's a GUPPI issue, not just GPU or CPU. Vanilla setup here, no special command line info going on. ![]() ![]() |
©2022 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.