Message boards :
Number crunching :
use all gpu setting in cc_config not needed anymore?
Message board moderation
Author | Message |
---|---|
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
Ubuntu 12.04 : This is a new oddity that appeared with a kernel update, I just noticed. opening log lines wrote: Sun 11 Nov 2012 03:37:02 PM EST | | Starting BOINC client version 7.0.33 for x86_64-pc-linux-gnu I deleted my cc_config yesterday, since I had no need for the log flags or options at this time. I forgot about the <use all coprocessor> line.. well, when I re-started boinc today, to my surprise , both gpus were detected and in use :-) without a cc-config telling it to do so. well.. there is a config file present but it has no options set , only log flags. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Ubuntu 12.04 : You had coproc_debug enabled. Claggy |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
So is that why it was able to detect and use both gpu with out option <use_all_coproc> set ? Edit : coproc debug is set to try and figure out why about 5% of my gpu tasks end in error. The information it provides has not helped HOst with the issue see error page of work database Edit2 : I tested system wide memory for 48 hours, no errors. I've tryed 5 different versions of nvidia drivers. Cleaned and reseated gpu, even cleaned the contacts. Power supply tests A ok with 40 amps to spare at full load. really bugs the heck out of me that it produces normal results 95% of the time. then seemingly randomly poops out an error or two. and lately a couple invalids. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
So is that why it was able to detect and use both gpu with out option <use_all_coproc> set ? No, you're suffering the Wacky Nvidia GPU Memory reporting Bug: Sun 11 Nov 2012 03:37:02 PM EST | | NVIDIA GPU 0: GeForce GTX 460 (driver version unknown, CUDA version 5.0, compute capability 2.1, 1024MB, 134214564MB available, 941 GFLOPS peak) At the moment Both GPUs are reporting a similar amount of available GPU memory, so to Boinc they are similar enough, that half of Wacky Nvidia GPU memory reporting Bug is fixed in Boinc 7.0.36 (the other half is fixed in Boinc 7.0.38/39) Claggy |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
So is there a linux port for 7.0.36 yet ? Edit : found it , just changed the 3 to a 6 in the link ageless sent me for 7.0.33 |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
So is there a linux port for 7.0.36 yet ? There has been for almost the last two months: BOINC 7 Change Log and news One thing to note about Boinc versions 7.0.32 and later, they supply a higher internal flops value for the GPU, and this higher internal flops value puts existing GPU tasks on the verge of going Maximum Time Exceeded, so any existing GPU tasks should be completed prior to upgrading, New GPU tasks will get new higher <rsc_fpops_est> and <rsc_fpops_bound> figures and won't be affected. Claggy |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
I don't visit the boinc development forums so, I would not have known. This memory reporting bug , do you think it is whats causing the Cuda error '(cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost))' in file 'cuda/cudaAcc_summax.cu' in line 239 : unspecified launch failure.? |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
I don't visit the boinc development forums so, I would not have known. No, that's totally separate, have a read of one of Jason recent posts: Message 1302800 Claggy |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
No, that's totally separate, have a read of one of Jason recent posts: Message 1302800[/url] I am a participant in that thread already. He is saying in so many many words that they have no idea what that problem is ? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
He is saying in so many many words that they have no idea what that problem is ? Sortof correct :D (and rather funny!). The basic timeline goes something like: - Everyone uses physical device, & driver, models & is happy with the limitations. ( XP Driver Model, similar on Linux with same hardware ) - Microsoft is as Microsoft does & decides your GPU is now part of your computer, designs WDDM looking forward some 15-25 years, for which ~Vista early release we know hasn't been well coped with since, in different ways at different levels. - hardware Engineers caught up & ran with the new models, outrunning software dev capacity to read & understand the fundamental changes & change deprecated habits. To a similar question from a user here, right back around when Cuda apps were first released here from nVidia Engineers, I publicly commented on that there were parallels (pun intended) to historic supercomputer development (having worked in the field) in the code approaches & technologies that would inevitably encounter the same set of difficulties along with new ones. At the time I predicted 'Growing pains' and these gotchas are part of that process, and it seems they have been ongoing. You continue to have the choice of moving with the technologies or sticking with what works for you, both being viable approaches from time to time. As with multithreading on multicore CPUs it has taken many years for software development to evolve to catch up, lagging far behind the hardware engineering. That includes OS, Drivers, Tools, but most especially in 'programmer sensibilities' which can be very resistant to change. Being relatively 'new' (circa 1998), gpGPU is still very rapidly evolving, and what worked well last month may be marginal this month, and completely deprecated in the next. With Cuda things have stabilised quite a lot, but at the same time recent developments present an entirely new set of challenges. So much I've learned that is basically undocumented, that a series of articles directed at developers is on the cards. I'm a proponent of learning & adapting as the technology evolves, while also craving some stability. All that really says is that the technologies are still maturing. Whether that settles down anytime soon, or goes on for a lot longer, well your guess is as good as mine, though software developer inertia remains high as it has with multithreading. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
Thank you once more for the narrative. I do understand. Having just scratched the surface of my first "modern" computer in 10 years, I can fully empathise with the plight of the developer. I am very willing to step back in time/tech to make this work but my Tardis lacks the proper coordinates. This computer is for play and distributed computing only so, perhaps an older distro of linux would be more diagnostic. ? Lately, I have been mucking around with bios settings, disabling various known "new" features in hopes some one of those features is the cause of my issue. Specifically, where memory is concerned ... what memory setting would be least prone to failure ? : it is currently set to "XMP" (intel) it is my understanding that this might be up-clocking my ram on the fly, though during POST, the stated settings are in fact the OEM settings for the memory modules. Should the bios be set to allow memory re-mapping? (currently enabled) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Actually probably the RAM thing could be a big issue, as later video integrates RAM, BUS, and VIDEO RAM all quite closely compared to the past. If you're using an 'Auto' XMP profile, check the voltage supplied is correct to the RAM spec, and that the BIOS is set to a command rate of 2t. If not, you have to go to manual settings. Overeager auto settings have been a problem lately there, as most experienced system builders/overclockers find it's exceedingly rare to find a system memory cotnroller / RAM combination that will work reliably at a command rate of 1T, which unfortunately seems to be the default for many situations. In addition, you want maximum RAM signal integrity without sinking or sourcing excessive current through the memory controller. That's adjusted by a setting called Vtt, which should be around 75-80% of the RAM rating. That's basically radio antenna 'impedance matching' of the terminations & circuit traces. In an increasing number of cases, anything auto could be asking for issues. These circumstances with 'auto' settings not working right have only really become a big issue as 'mainstream enthusiast' midrange parts having become commonplace. It's a similar situation to the notorious 560ti challenges. An exaggerated analogy might be if your local used car dealer suddenly started selling refurbished Apache longbow helicopters due to excess supply & demand by iphone bearing gamer teenagers with too much money. HTH, Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
new log clip wrote:
OK one bug exterminated :) Still chucking occasional cuda memcpy errors . I looked for the timing setting Jason mentioned and could not find a 1t or 2t setting only option/s are 1N 2N 3N <-- same thing ? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...Still chucking occasional cuda memcpy errors . Yep, and there can be multiple names for 'Vtt' as well. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
argh! ... This is crazy ! .. I watched over 100 tasks complete perfectly and, the minute I went to bed it crapped out an error every 15 minutes till I woke. Now I am watching again ..guess what.. no errors. and nothing has changed except that I am actively typing and mousing. Is the sleep bug really sneaky? like, could there be drivers aflicted that are not on the list ? What would be some other names for "Vtt" ? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Sure, Linux drivers often having different build numbers something could still reside there. Those memcpy() failures though are not generally symptoms of the sleepy bug. More likely some sortof power saving settings by the sound of it, like the CPU throttling down & missing its cue. for aliases for Vtt, try http://www.masterslair.com/vcore-vtt-dram-pll-pch-voltages-explained-core-i7-i5-i3/#vtt-aka-imc-qpivtt-qpidram "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
Jason wrote:
Thanks, That web page was very useful. In fact it happens to be based on my exact model motherboard. Updating my issue : I crunched for Einstein while seti was in repair mode. This shed some light of the problems I have been having Since, I could not (and still cant) get a single GPU cuda task to even pass the initial validate sanity check at E@H. Members of that forum pointed out that my "twin" gtx460 GPU runs in SLI mode with itself. and members had reported problems with SLI ...so I disabled in xconfig. Also changed the "MultiGPU" option to off as well. Now both GPUs operate independently of each other While this did not help my E@H problem at all,... it did seem to stop the constant cuda-mem-copy errors I had been getting at S@H. (3 wus errored in 60 hours of operation VS 10 to 20 per day previously) Now, I can plainly see some tasks coming back "invalid" here at S@H and all of those so far have been issued to "gpu2" in my host. Looks like I have found the culprit to a degree. Literally, GPU 2 runs 30c hotter than GPU1 on the same card. Utilizing "coolbits" I can only control the fan for GPU1 ,GPU2's fan runs at the stock variable speeds. I cant quite figure out how to get the coolbits thermal settings to apply to both fans. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.