Issues with 1 GPU on Penta Nano System

Message boards : Number crunching : Issues with 1 GPU on Penta Nano System
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1826546 - Posted: 24 Oct 2016, 18:23:57 UTC - in response to Message 1826528.  

I forgot that SIV - System Information Viewer also shows this (only for BOINC processes)
Right-Click [Windows] button -> BOINC Status

It also have on that window "[ ] CPU Affinity" check-box but I don't know what exactly it does
(the tooltip say something - hover over [ ] to see "Allow %s to optimise the CPU Affinity of BOINC processes")


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1826546 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826593 - Posted: 24 Oct 2016, 23:09:19 UTC - in response to Message 1826518.  

No. What I need is:
http://clip2net.com/s/3DCFYVb
for all 5 tasks.


The background looks like task manager, but I am not sure what the dialog box in the foreground is. How do I display that information?

Task manager -> Right click on process line-> Set affinity...


Here is the affinity info: https://flic.kr/p/Nuamo7

Maybe I can try some of the other tools during lunch today...
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826593 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826599 - Posted: 24 Oct 2016, 23:54:54 UTC

Thanks BilBG for system explorer link!


PentaNanoSystemExplorer by Rick (ç‘žå…‹), on Flickr
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826599 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826618 - Posted: 25 Oct 2016, 1:09:33 UTC - in response to Message 1826544.  

It works with taskmanager also.

Sure but only single task per screenshot. And we need to see all 5 of them.
I'm afraid 5th task will not pinned. Will see. Unfortunately, such mighty hosts ignore beta testing :/ and change in affinity handling was done between v8.12 and v8.19...

I would be happy to participate in Beta testing. I will read up on it, but if you can get me started, that would be great. I just seem to have too many active projects and limited time, but I will do what I can.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826618 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826644 - Posted: 25 Oct 2016, 4:54:30 UTC
Last modified: 25 Oct 2016, 5:35:07 UTC

I decided to drop back to beta4 to see what happens. Seems like I lost all work in progress in doing this. Not sure why, since I used the installer. I only installed the MB app, so it is only running MB jobs now. Looks like affinity is not a problem in this version. But the same GPU is still underloaded. It is a different device number in BOINC, but HWInfo shows the same GPU to be underloaded. Can someone let me know how to extract my past data to see if this has always been a problem? I am concerned it may be a hardware issue, but Mike mentioned FPU throttling may be the issue. I am hoping past data could help identify best direction to go in.


PentaNanoSystemExplorer_r3500 by Rick (ç‘žå…‹), on Flickr
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826644 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826655 - Posted: 25 Oct 2016, 8:15:40 UTC - in response to Message 1826644.  

If you ran only main server kept only ~1 day of past data. On beta site http://setiathome.berkeley.edu/beta/ data last longer.
Currently no need to join beta for windows GPU apps though - nothing unreleased. Only for the next round that still have to start.

With overloaded FPU explanation not quite obvious why only one of 5 GPUs starve.
With r3500 affinity handling there is 2 pairs of fully loaded CPU cores so 2 instead of just one GPUs should show starvation. It's a puzzle why only one.
Also, the puzzle why only exact same one. CPU pinning depends on task completion time and this time quite random so different GPU processes should be paired together on same CPU module so, if it's CPU-dependent starvation, different GPU devices should starve over time. But you say that only one particular GPU device shows performance drop...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826655 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826656 - Posted: 25 Oct 2016, 8:20:23 UTC - in response to Message 1826655.  

Cause now you armed with many monitoring tools I would recommend to make sure in these observations:
1) only single GPU device starve
2) it's always the same physical GPU device.

And then time to turn to hardware: swap affected GPU with any other. What it gives?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826656 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826666 - Posted: 25 Oct 2016, 10:57:16 UTC - in response to Message 1826656.  

Cause now you armed with many monitoring tools I would recommend to make sure in these observations:
1) only single GPU device starve
2) it's always the same physical GPU device.

And then time to turn to hardware: swap affected GPU with any other. What it gives?


Thanks so much for your help on this! I find that I learn a lot with each issue I work through. From HWInfo and GPUz, I am convinced it is always the same GPU. I need to physically confirm which one it is, but that should be easy from the PCIe configuration of the one which is impacted. I agree with your thoughts on the starving GPU theory. I think it is unlikely. Also, I am pretty obsessive about monitoring my systems and I am quite certain that this started happening recently. Since this system is watercooled, messing with HW will be tedious, so I am going to start with Software. This system went through a mess when Windows Update tried to install Crimson which gives BSOD for 5 Nano's and also always fails the latest Windows cumulative update. So my plan is to do a clean Windows install first and get all of the updates loaded. Then manually install GPU drivers again. If I still have the problem, then I will have to start swapping GPU's around to determine if it is GPU or MB. I will have a busy weekend... Thanks again to everyone who contributed here!
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826666 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1826674 - Posted: 25 Oct 2016, 12:08:32 UTC
Last modified: 25 Oct 2016, 12:09:08 UTC

I don`t think its a hardware issue.
Have checked approx 200 results of this host and its merely device 0 and device 3.
So not one particular GPU.
90% of your results have a CPU usage between 30% and 50% but some on those devices have nearly or more than 100%.

The easy way would be to exclude one GPU, lets say device 2 via cc_config file and test for 24 hours.
If the issue doesn`t happen within this period my theory is correct.


With each crime and every kindness we birth our future.
ID: 1826674 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1826686 - Posted: 25 Oct 2016, 13:27:26 UTC - in response to Message 1826644.  

I decided to drop back to beta4 to see what happens. Seems like I lost all work in progress in doing this. Not sure why, since I used the installer.

<version_num>819</version_num> in v0.45_Beta5 don't exist in beta4
<version_num>812</version_num> is the highest in beta4

Look in:
MB8_win_x86_SSE2_OpenCL_ATi_HD5.aistub
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1826686 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826758 - Posted: 26 Oct 2016, 5:26:46 UTC - in response to Message 1826674.  

I don`t think its a hardware issue.
Have checked approx 200 results of this host and its merely device 0 and device 3.
So not one particular GPU.
90% of your results have a CPU usage between 30% and 50% but some on those devices have nearly or more than 100%.

The easy way would be to exclude one GPU, lets say device 2 via cc_config file and test for 24 hours.
If the issue doesn`t happen within this period my theory is correct.


Hi Mike, Thanks for the recommendation. I think it is a good approach. I will try this before I make any other changes. One point though, the device # did change when I reverted to beta 4, but I verified in HWInfo it was still the same GPU.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826758 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826759 - Posted: 26 Oct 2016, 5:27:35 UTC - in response to Message 1826686.  

I decided to drop back to beta4 to see what happens. Seems like I lost all work in progress in doing this. Not sure why, since I used the installer.

<version_num>819</version_num> in v0.45_Beta5 don't exist in beta4
<version_num>812</version_num> is the highest in beta4

Look in:
MB8_win_x86_SSE2_OpenCL_ATi_HD5.aistub


Thanks for the explanation. I will try to be more careful in the future.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826759 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826770 - Posted: 26 Oct 2016, 10:23:24 UTC - in response to Message 1826758.  

Hi Mike, Thanks for the recommendation. I think it is a good approach. I will try this before I make any other changes. One point though, the device # did change when I reverted to beta 4, but I verified in HWInfo it was still the same GPU.


I used cc_config to excluded device 2. I verified with HWInfo that a GPU other than the one of interest was now idle. I then verified that the original suspect GPU still only has sporadic loading. This was observed in GPUz and long processing times in BOINC Manager.

I suspect the problem is related to drivers or Windows Updates. Since the latest cumulative update keeps failing, I will do a clean Win10Pro install this weekend.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826770 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1826776 - Posted: 26 Oct 2016, 11:58:22 UTC - in response to Message 1826770.  
Last modified: 26 Oct 2016, 12:10:09 UTC

Post the contents (if any) of app_config.xml and mb_cmdline_win_x86_SSE2_OpenCL_ATi_HD5.txt

Also the relevant lines from Event Log (Ctrl+Shift+E) about BOINC GPU detection
and relevant lines from cc_config.xml where you did <exclude_gpu> or <ignore_nvidia_dev>

I think "the device # did change when I reverted to beta 4" is just some coincidence - if you mean that BOINC GPU detection changed and show different device # for the same GPU
 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1826776 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826779 - Posted: 26 Oct 2016, 12:17:37 UTC - in response to Message 1826776.  

app_config:
<app_config>
<app>
<name>setiathome_v8</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1.4</cpu_usage>
</gpu_versions>
</app>
<app>
<name>astropulse_v7</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1.4</cpu_usage>
</gpu_versions>
</app>
</app_config>

mb_cmdline_win_x86_SSE2_OpenCL_ATi_HD5:
-v 1 -instances_per_device 1 -total_GPU_instances_num 5 -sbs 1024 -period_iterations_num 1 -no_defaults_scaling -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

cc_config (I had to recreate since I over wrote the test file)
<cc_config>
<options>
<use_all_gpus>1</use_all_gpus>
<process_priority>3</process_priority>
<exclude_gpu>
<url>setiathome.berkeley.edu</url>
<device_num>2</device_num>
</exclude_gpu>
</options>
</cc_config>
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826779 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826781 - Posted: 26 Oct 2016, 12:21:18 UTC - in response to Message 1826776.  

Event Log
26-Oct-16 7:52:00 PM | | Starting BOINC client version 7.6.33 for windows_x86_64
26-Oct-16 7:52:00 PM | | log flags: file_xfer, sched_ops, task
26-Oct-16 7:52:00 PM | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
26-Oct-16 7:52:00 PM | | Data directory: C:\ProgramData\BOINC
26-Oct-16 7:52:00 PM | | Running under account Rick
26-Oct-16 7:52:01 PM | | OpenCL: AMD/ATI GPU 0: AMD Radeon (TM) R9 Fury Series (driver version 1912.5 (VM), device version OpenCL 2.0 AMD-APP (1912.5), 4096MB, 4096MB available, 8192 GFLOPS peak)
26-Oct-16 7:52:01 PM | | OpenCL: AMD/ATI GPU 1: AMD Radeon (TM) R9 Fury Series (driver version 1912.5 (VM), device version OpenCL 2.0 AMD-APP (1912.5), 4096MB, 4096MB available, 8192 GFLOPS peak)
26-Oct-16 7:52:01 PM | | OpenCL: AMD/ATI GPU 2: AMD Radeon (TM) R9 Fury Series (driver version 1912.5 (VM), device version OpenCL 2.0 AMD-APP (1912.5), 4096MB, 4096MB available, 8192 GFLOPS peak)
26-Oct-16 7:52:01 PM | | OpenCL: AMD/ATI GPU 3: AMD Radeon (TM) R9 Fury Series (driver version 1912.5 (VM), device version OpenCL 2.0 AMD-APP (1912.5), 4096MB, 4096MB available, 8192 GFLOPS peak)
26-Oct-16 7:52:01 PM | | OpenCL: AMD/ATI GPU 4: AMD Radeon (TM) R9 Fury Series (driver version 1912.5 (VM), device version OpenCL 2.0 AMD-APP (1912.5), 4096MB, 4096MB available, 8192 GFLOPS peak)
26-Oct-16 7:52:01 PM | SETI@home | Found app_info.xml; using anonymous platform
26-Oct-16 7:52:01 PM | | Host name: Server01
26-Oct-16 7:52:01 PM | | Processor: 8 AuthenticAMD AMD FX-8370 Eight-Core Processor [Family 21 Model 2 Stepping 0]
26-Oct-16 7:52:01 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 tce tbm topx page1gb rdtscp bmi1
26-Oct-16 7:52:01 PM | | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.14393.00)
26-Oct-16 7:52:01 PM | | Memory: 15.98 GB physical, 18.36 GB virtual
26-Oct-16 7:52:01 PM | | Disk: 111.24 GB total, 63.72 GB free
26-Oct-16 7:52:01 PM | | Local time is UTC +8 hours
26-Oct-16 7:52:01 PM | SETI@home | Found app_config.xml
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826781 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1826782 - Posted: 26 Oct 2016, 12:38:31 UTC - in response to Message 1826779.  
Last modified: 26 Oct 2016, 12:50:54 UTC

cc_config (I had to recreate since I over wrote the test file)
<cc_config>
    <options>
        <use_all_gpus>1</use_all_gpus>
    	<process_priority>3</process_priority>
        <exclude_gpu>
            <url>setiathome.berkeley.edu</url>
            <device_num>2</device_num>
        </exclude_gpu>
    </options>
</cc_config>

Are you sure this <exclude_gpu> works?
Format is:
<exclude_gpu>
   <url>project_URL</url>
   [<device_num>N</device_num>]
   [<type>NVIDIA|ATI|intel_gpu</type>]
   [<app>appname</app>]
</exclude_gpu>

Your project_URL is not correct

Better try <ignore_nvidia_dev> which is simpler.
You may ignore all except one GPU to see where/what the GPU load will be and then for next test change numbers:

Test GPU 0:
<ignore_nvidia_dev>1</ignore_nvidia_dev>
<ignore_nvidia_dev>2</ignore_nvidia_dev>
<ignore_nvidia_dev>3</ignore_nvidia_dev>
<ignore_nvidia_dev>4</ignore_nvidia_dev>


Test GPU 1:
<ignore_nvidia_dev>0</ignore_nvidia_dev>

<ignore_nvidia_dev>2</ignore_nvidia_dev>
<ignore_nvidia_dev>3</ignore_nvidia_dev>
<ignore_nvidia_dev>4</ignore_nvidia_dev>


Test GPU 2:
<ignore_nvidia_dev>0</ignore_nvidia_dev>
<ignore_nvidia_dev>1</ignore_nvidia_dev>

<ignore_nvidia_dev>3</ignore_nvidia_dev>
<ignore_nvidia_dev>4</ignore_nvidia_dev>

...


Also it is not known to me why do you need/want <process_priority>3</process_priority> (above normal) for CPU tasks
 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1826782 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1826783 - Posted: 26 Oct 2016, 12:57:47 UTC - in response to Message 1826779.  

app_config:
<app_config>
  <app>
    <name>setiathome_v8</name>
  <gpu_versions>
    <gpu_usage>1</gpu_usage>
    <cpu_usage>1.4</cpu_usage>
  </gpu_versions>
  </app>
  <app>
    <name>astropulse_v7</name>
  <gpu_versions>
    <gpu_usage>1</gpu_usage>
    <cpu_usage>1.4</cpu_usage>
  </gpu_versions>
  </app>
</app_config>

With <cpu_usage>1.4</cpu_usage> you are asking BOINC to free 1.4 * 5 = 7 cores (of 8 your CPU has)
This may cause sometimes BOINC to refuse to start 5 GPU tasks if there are more than one "High Priority" CPU tasks (in danger to miss deadline)

On this computer with 5 GPU tasks (over 4 real FPUs) is probably better to not run CPU tasks.
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1826783 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826788 - Posted: 26 Oct 2016, 13:29:24 UTC - in response to Message 1826782.  

cc_config (I had to recreate since I over wrote the test file)
<cc_config>
    <options>
        <use_all_gpus>1</use_all_gpus>
    	<process_priority>3</process_priority>
        <exclude_gpu>
            <url>setiathome.berkeley.edu</url>
            <device_num>2</device_num>
        </exclude_gpu>
    </options>
</cc_config>

Are you sure this <exclude_gpu> works?


Also it is not known to me why do you need/want <process_priority>3</process_priority> (above normal) for CPU tasks
 


The cc_config for excluding a GPU definitely works. I have just verified again.

The process priority increase only applies to GPU tasks. It probably makes no difference with -hp option.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826788 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1826790 - Posted: 26 Oct 2016, 13:36:29 UTC - in response to Message 1826783.  

With <cpu_usage>1.4</cpu_usage> you are asking BOINC to free 1.4 * 5 = 7 cores (of 8 your CPU has)
This may cause sometimes BOINC to refuse to start 5 GPU tasks if there are more than one "High Priority" CPU tasks (in danger to miss deadline)

On this computer with 5 GPU tasks (over 4 real FPUs) is probably better to not run CPU tasks.


I tried with no CPU tasks and found no impact on the 1 under loaded GPU.

I still think this problem is related to software/drivers. It is a bit of a kludge to get 5 Nanos to work in this system. Crimson will fail to install with 5 in the system. Manual install of drivers will also fail. So I have to do a manual install with 1 unplugged and then plug it in and manual install with different 1 unplugged. That was working fine until recently. One thing that may have impacted it is Windows keeps trying and failing to install the latest cumulative update.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1826790 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Issues with 1 GPU on Penta Nano System


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.