Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 83 · Next

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1856288 - Posted: 18 Mar 2017, 12:27:54 UTC

Hi,
here is an error in the software: https://setiathome.berkeley.edu/result.php?resultid=5594593817. It happens, just not so often..

Pulse finding detects multiple 'pulses' at certain time. Something gets overwritten I guess. Hard to debug since I do not have that wu on my computer.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1856288 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1856292 - Posted: 18 Mar 2017, 12:34:25 UTC - in response to Message 1856288.  
Last modified: 18 Mar 2017, 12:35:53 UTC

Hi,
here is an error in the software: https://setiathome.berkeley.edu/result.php?resultid=5594593817. It happens, just not so often..

Pulse finding detects multiple 'pulses' at certain time. Something gets overwritten I guess. Hard to debug since I do not have that wu on my computer.


Superficially looks like some of those pulses in the serial CPU case would [normally] coalesce. Possibly across where you unroll periods. If so, That's likely the missing reduction I mentioned. If that's the case, the issue disappears without unroll (unroll 1), then it's just a reproduction of the race condition I've been describing. Trick will probably be to drag pulse reporting out of the kernels, and either reduce from multiple pulse tables, or alternatively put atomics on the pulse results and only update if score is higher. Not sure which is easier, but probably pulling it all out will be better for the next generation, which will be even bigger.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1856292 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1856294 - Posted: 18 Mar 2017, 12:41:58 UTC - in response to Message 1856292.  
Last modified: 18 Mar 2017, 12:43:39 UTC

Maybe a better alternative description: The pulses aren't at one single 'spot', and the pulsefinding algorithm is looking for the nicest 'blob'. If you did it visually it might be less like looking through a static image, and more like a 'Enhance...Pan' like Harrison Ford from Blade Runner Intro.

[Edit:] https://youtu.be/qHepKd38pr0
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1856294 · Report as offensive
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 1856298 - Posted: 18 Mar 2017, 12:55:17 UTC - in response to Message 1856292.  

Greetings

Okay here is a workunit, that as gone to 3 different users but is a -9 overflow - spike count 30.

https://setiathome.berkeley.edu/workunit.php?wuid=2471498679

There might be some info there for you to look at maybe.

Update, have checked my pv status on my GTX 780 machine running x41p_zi3t1b, 70 tasks, with at least 14 with the -9 overflow error.

I have some tasks still in the cache to be done, but I have set to it to No New Tasks,

Regards
Mark
ID: 1856298 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1856300 - Posted: 18 Mar 2017, 13:01:31 UTC - in response to Message 1856298.  
Last modified: 18 Mar 2017, 13:05:25 UTC

The second host there looks like it broke entirely. It'll be unclear on why yours and the first host mismatch, though your 30 pulses might mean you need a reboot or something.

[Edit:] inconclusives/Invalids seem very high, even for the experimental apps. I'd look at doing the coolbits thing to get the fan up, and keep temps in check.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1856300 · Report as offensive
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 1856302 - Posted: 18 Mar 2017, 13:11:37 UTC - in response to Message 1856300.  

Greetings Jason

I seem to having trouble with the BLC workunits, I have bumped the fan up.

I am also going to reboot and see if that helps.

The invalids/errors were on a previous build.

Regards
ID: 1856302 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1856303 - Posted: 18 Mar 2017, 13:15:00 UTC - in response to Message 1856302.  

Greetings Jason

I seem to having trouble with the BLC workunits, I have bumped the fan up.

I am also going to reboot and see if that helps.

The invalids/errors were on a previous build.

Regards


Great. Fingers crossed nothing serious. The 780's are troopers.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1856303 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856308 - Posted: 18 Mar 2017, 13:29:23 UTC - in response to Message 1856253.  


An overflow is an overflow (noisy packet). It may contain 60 spikes, autocorrelations and pulses. They are searched in different work queues on GPU. Results are checked and reported. A different implementation (parallel) can find them in any order. No order can be said best.

The rate of inconclusive results can be found here: http://setiathome.berkeley.edu/results.php?hostid=7475713&offset=0&show_names=0&state=4&appid=29

Petri


. . An inconclusive rate of 2.5% seems not out of place, the error rate is a little more significant but at 2/3 of one percent is still not dramatic. Neither degrade the quality of the pool of result data and both are low enough to have only a very small impact on the work flow/ database size.

. . But then I am sure there is an assessment process to define the fitness for purpose.

Stephen

.
ID: 1856308 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856311 - Posted: 18 Mar 2017, 13:39:17 UTC - in response to Message 1856273.  

As always, Start with reproducing under bench with precisely the same parameters against stock/win32 CPU reference, then again with unroll set to 1. Assuming you reproduce mismatch with higher unroll, and then matches without unroll, then you just confirm what we already know. The Application is fast but broken, as it does not match the serial variants. Anyone that claims it it OK to choose whatever signals they like from the full dataset, has no idea what they are talking about, and you should point them out so I can ridicule them with computer science.


a) If a packet is full of crap and the decision is made solely based on number of signals found, say 30, then it is completely irrelevant what sort of crap the packet contains.
b) If a packet has an acceptable amount of signals they must be reported correctly.
c) If a packet does not have reportable signals, finding best non reportable wastes cycles on cosmetics.

Petri


. . The first two points make sense, but what if one process says there are too many signals and another says there are not too many, would it not be better to eliminate mismatches in this grey area to be sure of results. On point three maybe one day those "best" results of what are deemed insignificant results now may be important if a signal is confirmed at the very limit of what is seen as being significant now. Just my take.

Stephen

?
ID: 1856311 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856316 - Posted: 18 Mar 2017, 13:59:00 UTC - in response to Message 1856300.  

The second host there looks like it broke entirely. It'll be unclear on why yours and the first host mismatch, though your 30 pulses might mean you need a reboot or something.

[Edit:] inconclusives/Invalids seem very high, even for the experimental apps. I'd look at doing the coolbits thing to get the fan up, and keep temps in check.


. . Now this is a subject that is of concern to me. I have run coolbits on the Pentium machine to keep the 1060s cool, but it doesn't do anything except reports an option value has been set to true for both devices and the config file has been backed up. No matter what I try it does not open the nvidia graphical app to control things so my fans are still running way low and the cards way hot. Any idea of how to fix this issue?

. . I want to continue with this app but I don't want to damage the 1060s.

. . Of course with a windows version I would have the tools I need to take care of this :)

Stephen

:)
ID: 1856316 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856319 - Posted: 18 Mar 2017, 14:22:41 UTC

. . In addition to the previous post ...

Sun Mar 19 00:13:53 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 0000:01:00.0      On |                  N/A |
| 36%   63C    P2    53W / 120W |   1867MiB /  6068MiB |     43%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 0000:02:00.0     Off |                  N/A |
| 29%   55C    P2    60W / 120W |   1730MiB /  6072MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1027    G   /usr/lib/xorg/Xorg                             106MiB |
|    0      2192    G   compiz                                          30MiB |
|    0      4476    C   ...ome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80  1727MiB |
|    1      4444    C   ...ome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80  1727MiB |
+-----------------------------------------------------------------------------+


. . It would seem the temps have induced throttling on both GPUs, they are both cooler now but very low power consumption and very low utilisation. :(

. . the nice values are AOK.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
 4444 stephen   20   0 22.562g 536112 327560 R  62.1 13.2   1:47.61 setiathome+ 
 4476 stephen   20   0 22.592g 567568 327468 R  52.6 14.0   1:12.15 setiathome+ 


. . But this is what I get when I run this command ...
sudo nvidia-xconfig --thermal-configuration-check --cool-bits=28 --enable-all-gpus


Using X configuration file: "/etc/X11/xorg.conf".
Option "ThermalConfigurationCheck" "True" added to Screen "Screen0".
Option "ThermalConfigurationCheck" "True" added to Screen "Screen1".
Backed up file '/etc/X11/xorg.conf' as '/etc/X11/xorg.conf.backup'
New X configuration file written to '/etc/X11/xorg.conf'



. . very depressing :(

Stephen

:(
ID: 1856319 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1856324 - Posted: 18 Mar 2017, 14:43:15 UTC - in response to Message 1856319.  
Last modified: 18 Mar 2017, 15:23:47 UTC


Using X configuration file: "/etc/X11/xorg.conf".
Option "ThermalConfigurationCheck" "True" added to Screen "Screen0".
Option "ThermalConfigurationCheck" "True" added to Screen "Screen1".
Backed up file '/etc/X11/xorg.conf' as '/etc/X11/xorg.conf.backup'
New X configuration file written to '/etc/X11/xorg.conf'
That says the Option has been added. All you have to do is Open "NVIDIA X Server Settings" and go to "Thermal Settings".
Make sure "Enable GPU Fan Settings" is Checked and you should have a Slider control. Slide it to the desired setting and click Apply.
If you don't have the Slider, then Coolbits isn't working and you should try My suggestion on editing the xorg.conf file. You also may need to edit the gpu-manager.conf file to keep from having the xorg.conf overwritten on reboot. Both those items are discussed earlier in this thread.
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1835926#1835926
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855894
ID: 1856324 · Report as offensive
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 1856325 - Posted: 18 Mar 2017, 14:43:52 UTC

Greetings

I boosted the fan speed and also made some adjustments to the command line in the app_info file.

Also rebooted and so far so good on the BLC workunits.

I will have a better understanding tomorrow once I have sifted through the latest round of workunits.

@Stephen, I am using coolbits option 4, but I also just realised that you have two GPU's in the machine.

Regards
ID: 1856325 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1856332 - Posted: 18 Mar 2017, 15:10:51 UTC - in response to Message 1856325.  

...Also rebooted and so far so good on the BLC workunits...

I just had to check ;-)
It appears you're suffering the same problem I was having with x41p_zi3t1b. You are receiving Overflows where your WingPeople aren't, https://setiathome.berkeley.edu/workunit.php?wuid=2472661131
I just got zi3t1d running on the Mac, we'll see how that works.
ID: 1856332 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1856431 - Posted: 18 Mar 2017, 21:39:24 UTC - in response to Message 1856332.  

. . It would seem the temps have induced throttling on both GPUs, they are both cooler now but very low power consumption and very low utilisation. :(

63°c on a video card (or CPU) I would describe as warm, certainly not hot nor even close to causing thermal throttling. Thermal throttling results when they reach their maximum rated temperatures, and before that point their fans would be running at 100%.
The low temperatures & power consumption would be due to the low load.
Grant
Darwin NT
ID: 1856431 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856445 - Posted: 18 Mar 2017, 23:06:42 UTC - in response to Message 1856324.  

That says the Option has been added. All you have to do is Open "NVIDIA X Server Settings" and go to "Thermal Settings".
Make sure "Enable GPU Fan Settings" is Checked and you should have a Slider control. Slide it to the desired setting and click Apply.
If you don't have the Slider, then Coolbits isn't working and you should try My suggestion on editing the xorg.conf file. You also may need to edit the gpu-manager.conf file to keep from having the xorg.conf overwritten on reboot. Both those items are discussed earlier in this thread.
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1835926#1835926
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855894


. . Aaaah, and there's the rub. What is the actual command to <open "NVIDIA X Server Settings">? I have found the file xorg.cnf and the thermalconfigurationcheck option is there and "true". But I cannot find a gpu-manager file and no variation of ./xserver or ./nvidia-xserver runs anything. :(

. . I guess I am dumber than I thought but there is nothing about this that makes sense to me.

Stephen

?
ID: 1856445 · Report as offensive
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 1856447 - Posted: 18 Mar 2017, 23:13:35 UTC

Greetings

Have gone back through my results since I made the changes and I only have found 3 workunits with the -9 overflow still in pending out of a possible 60 workunits that have been done.

I then checked my valid workunits and I have found at two -9 overflow workunits that have validated. is this normal??

https://setiathome.berkeley.edu/workunit.php?wuid=2472679069

https://setiathome.berkeley.edu/workunit.php?wuid=2472696759

Regards
ID: 1856447 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856448 - Posted: 18 Mar 2017, 23:18:28 UTC - in response to Message 1856431.  

. . It would seem the temps have induced throttling on both GPUs, they are both cooler now but very low power consumption and very low utilisation. :(

63°c on a video card (or CPU) I would describe as warm, certainly not hot nor even close to causing thermal throttling. Thermal throttling results when they reach their maximum rated temperatures, and before that point their fans would be running at 100%.
The low temperatures & power consumption would be due to the low load.


. . Thanks Grant but that was after throttling. The temps were hitting the 80 mark the last time I had checked it then boom, down to those numbers and the runtimes had doubled. I shut it down and let it cool a little then fired it back up, and running exactly the same WUs the usage went up to 88% and the temps climbed back up to the high 70s.

. . Maybe it wasn't throttling but if not I would like to know what it was that turned down the wick on the GPUs rather than turning up the fans. The default fan profiles are waaaayyy too adventurous for my liking. Too much pre-occupation with low noise and not enough with good cooling.

. . Where is a Linux version of Afterburner ..... wwwwaaahhhhhhh!

Stephen

:(
ID: 1856448 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1856451 - Posted: 18 Mar 2017, 23:27:58 UTC - in response to Message 1856445.  
Last modified: 19 Mar 2017, 0:00:51 UTC

That says the Option has been added. All you have to do is Open "NVIDIA X Server Settings" and go to "Thermal Settings".
Make sure "Enable GPU Fan Settings" is Checked and you should have a Slider control. Slide it to the desired setting and click Apply.
If you don't have the Slider, then Coolbits isn't working and you should try My suggestion on editing the xorg.conf file. You also may need to edit the gpu-manager.conf file to keep from having the xorg.conf overwritten on reboot. Both those items are discussed earlier in this thread.
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1835926#1835926
https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1855894
. . Aaaah, and there's the rub. What is the actual command to <open "NVIDIA X Server Settings">? I have found the file xorg.cnf and the thermalconfigurationcheck option is there and "true". But I cannot find a gpu-manager file and no variation of ./xserver or ./nvidia-xserver runs anything. :(

. . I guess I am dumber than I thought but there is nothing about this that makes sense to me.
Hit the Top Search button in the launcher and enter nv, it should show the App. Once running, lock it to the Launcher so it's easy to find.
If NVIDIA X Server Settings is in the Search box you click it. If NVIDIA X Server Settings is in the Launcher you just click it. Pretty straightforward, and it appears you have managed to open it previously.
Open NVIDIA X Server Settings and see if you have the Fan Control Slider before doing anything else.
ID: 1856451 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1856458 - Posted: 19 Mar 2017, 0:03:18 UTC - in response to Message 1856451.  


. . Aaaah, and there's the rub. What is the actual command to <open "NVIDIA X Server Settings">? I have found the file xorg.cnf and the thermalconfigurationcheck option is there and "true". But I cannot find a gpu-manager file and no variation of ./xserver or ./nvidia-xserver runs anything. :(

. . I guess I am dumber than I thought but there is nothing about this that makes sense to me.


Hit the Top Search button in the launcher and enter nv, it should show the App. Once running, lock it to the Launcher so it's easy to find.

If NVIDIA X Server Settings is in the Search box you click it. If NVIDIA X Server Settings is in the Launcher you just click it. Pretty straightforward, and it appears you have managed to open it previously.
Open NVIDIA X Server Settings and see if you have the Fan Control Slider before doing anything else.


. . Thanks TBar,

. . If the app had been in the Launcher I would have had no problems. The nv. search didn't find it, but an nvidia. search did and solved the problem thanks for that.

. . The 1060s thank you as well, they are now running at around 60deg.

Stephen

:)
ID: 1856458 · Report as offensive
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.