ROCm 1.8

Message boards : Number crunching : ROCm 1.8
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936392 - Posted: 20 May 2018, 11:31:23 UTC

I finally have my 1950X/RX-Vega64x4 machine back to running on Linux. After I swapped the Triple ProDuo GPU's for a Quad block of Vega64's, the latest drivers from AMD site did not support Vega on Threadripper, as it was using ROCm 1.6. I installed ROCm 1.8 as described on the ROCm GitHub Site and it is working fine. Only issue is that the cards are identified as "[4] AMD Device 687f (8176MB) OpenCL: 1.2" for this computer Eos. Doesn't seem to be causing any problems though.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936392 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1936396 - Posted: 20 May 2018, 12:40:38 UTC

Looks nice Rick.

Maybe you should upgrade the kernel.
I`m running 4.15.8 on my Mint 18 partition.


With each crime and every kindness we birth our future.
ID: 1936396 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936406 - Posted: 20 May 2018, 13:46:22 UTC - in response to Message 1936396.  

Looks nice Rick.

Maybe you should upgrade the kernel.
I`m running 4.15.8 on my Mint 18 partition.


I considered loading Ubuntu 18.04, but it looks like ROCm 1.8 was built for 16.04. Maybe not an issue, but I wanted to keep this attempt simple since my last attempt failed. Hope the next version is built for 18.04.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936406 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1936420 - Posted: 20 May 2018, 15:52:20 UTC - in response to Message 1936406.  

My older Linux machines running on Ubuntu 16.04 are using the latest 4.15.0-20 kernels which have all the Ryzen/TR fixes/patches baked in. That is the kernel that got installed in my newest Ubuntu 18.04 LTS machine.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1936420 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936487 - Posted: 21 May 2018, 0:59:37 UTC

I woke this morning to 2 invalids and 1 computation error. I may have to drop back to Win10... I will let it run until the outage to see if it continues.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936487 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1936513 - Posted: 21 May 2018, 8:16:12 UTC

That's a lot of inconclusives you have there as well as a couple more invalids.

Cheers.
ID: 1936513 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936514 - Posted: 21 May 2018, 8:35:52 UTC - in response to Message 1936513.  

That's a lot of inconclusives you have there as well as a couple more invalids.

Cheers.

I agree. Here is the task page for the same machine running windows https://setiathome.berkeley.edu/results.php?hostid=8507353. I also noticed a difference in command line arguments and will align to what is used in Windows when I get back to the machine this evening.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936514 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1936519 - Posted: 21 May 2018, 10:14:04 UTC

You should remove -pref_wg_num_per_cu from command line from your Linux install.
Thats the biggest difference between your win and Linux tasks.


With each crime and every kindness we birth our future.
ID: 1936519 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936521 - Posted: 21 May 2018, 11:27:58 UTC - in response to Message 1936519.  
Last modified: 21 May 2018, 11:31:37 UTC

You should remove -pref_wg_num_per_cu from command line from your Linux install.
Thats the biggest difference between your win and Linux tasks.


I just updated to make it to be nearly identical, everything except -tt 500 vs -tt 600. Looks like I am still getting inconclusives...
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936521 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1936523 - Posted: 21 May 2018, 11:39:26 UTC

You should also change -tune 1 from 2.1.18 to -tune 1 4 4 16.


With each crime and every kindness we birth our future.
ID: 1936523 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936524 - Posted: 21 May 2018, 11:43:56 UTC - in response to Message 1936523.  

You should also change -tune 1 from 2.1.18 to -tune 1 4 4 16.


Yes, I have changed that. Everything except -tt is aligned with the window install for this machine. So far, I have no invalids with the aligned command line, but do have a few inconclusives. Hard to tell if the problem persists. I have it set to not download new tasks in case I need to boot to windows.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936524 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936539 - Posted: 21 May 2018, 14:32:23 UTC

Looks like it still has issues so I booted the machine to Windows. I will do some more research...
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936539 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936653 - Posted: 22 May 2018, 11:42:24 UTC

I am hoping that an application expert may be able to comment on if the differences I am seeing in the OpenCL platform information between ROCm 1.8 and latest AMD Windows drivers.

Here are the ROCm details with items different from windows colored in red and additional items in yellow:
Max compute units: 64
Max work group size: 256
Max clock frequency: 1630Mhz
Max memory allocation: 7287183769
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Queue properties:
Out-of-Order: No
Profiling timer offset: 1
Global free memory: 1
SIMD per compute unit: 4
SIMD width: 16
SIMD instruction width: 1
Wavefront width: 64
Global mem channels: 64
Global mem channel banks: 4
Global mem channel bank width: 256
Local mem size per compute unit: 65536
Local mem banks: 32
Thread trace supported: No
Board Name: Device 687f

Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Driver version: 2617.0 (HSA1.1,LC)
Version: OpenCL 1.2
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

Follows is the Platform Information from latest AMD Windows driver:
Max compute units: 64
Max work group size: 256
Max clock frequency: 1630Mhz
Max memory allocation: 3221225472
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 3221225472
Constant buffer size: 3221225472
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Queue properties:
Out-of-Order: No
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Driver version: 2580.6 (PAL,HSAIL)
Version: OpenCL 1.2 AMD-APP (2580.6)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event cl_amd_liquid_flash

Seems like the most significant difference is that ROCm 1.8 in Linux is showing the actual memory on the GPU of 8GB, where windows is showing something much smaller, ~3GB, probably limited by 32bit. Is this what is causing the invalid results? Any recommendations on how to deal with it?
ID: 1936653 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1936656 - Posted: 22 May 2018, 12:15:54 UTC
Last modified: 22 May 2018, 12:17:30 UTC

First of all Linux app is 64 bit whilst windows app is 32 bit.
Of course Linux driver can access full memory range of the GPU.
I`m almost sure this is not related to your invalids.

I would suggest to run a few hours without any app args to see if the invalids stop.
But it could be the driver itself because it don`t identify your GPU`s correctly IMHO.


With each crime and every kindness we birth our future.
ID: 1936656 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936658 - Posted: 22 May 2018, 12:25:46 UTC - in response to Message 1936656.  

First of all Linux app is 64 bit whilst windows app is 32 bit.
Of course Linux driver can access full memory range of the GPU.
I`m almost sure this is not related to your invalids.

I would suggest to run a few hours without any app args to see if the invalids stop.
But it could be the driver itself because it don`t identify your GPU`s correctly IMHO.


Thanks for the insight on 32/64bit differences. For the name, I think maybe BOINC is picking up the new parameter "Board Name:" instead of "Name:". Both Windows and new ROCm 1.8 have the same value for Name.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936658 · Report as offensive
Rob

Send message
Joined: 7 Apr 12
Posts: 9
Credit: 951,019
RAC: 0
Germany
Message 1936661 - Posted: 22 May 2018, 12:30:24 UTC
Last modified: 22 May 2018, 12:30:43 UTC

Don't really have an idea if this problem can be tweaked away - I have opened a ticket at ROCm though, maybe there is a bug in their OpenCL implementation?
https://github.com/RadeonOpenCompute/ROCm/issues/423

Might be worth keeping an eye on that in case they need some more diagnostic info
ID: 1936661 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1936663 - Posted: 22 May 2018, 12:38:53 UTC - in response to Message 1936661.  
Last modified: 22 May 2018, 12:39:23 UTC

Don't really have an idea if this problem can be tweaked away - I have opened a ticket at ROCm though, maybe there is a bug in their OpenCL implementation?
https://github.com/RadeonOpenCompute/ROCm/issues/423

Might be worth keeping an eye on that in case they need some more diagnostic info


This is possible of course.


With each crime and every kindness we birth our future.
ID: 1936663 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936666 - Posted: 22 May 2018, 12:54:55 UTC - in response to Message 1936661.  

Don't really have an idea if this problem can be tweaked away - I have opened a ticket at ROCm though, maybe there is a bug in their OpenCL implementation?
https://github.com/RadeonOpenCompute/ROCm/issues/423

Might be worth keeping an eye on that in case they need some more diagnostic info


Thanks for posting. I will keep on eye on the thread. I was thinking of giving the non-SoG version of the app a try.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936666 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936667 - Posted: 22 May 2018, 12:58:02 UTC

The ROCm 1.8 package includes a tool to monitor and configure the GPUs: rocmsmi.py
Here is a summary report of the current state of my GPUs:
====================    ROCm System Management Interface    ====================
================================================================================
 GPU  Temp    AvgPwr   SCLK     MCLK     Fan      Perf    SCLK OD
  3   39.0c   175.0W   1630Mhz  945Mhz   0.0%     auto      0%       
  1   43.0c   144.0W   1630Mhz  945Mhz   0.0%     auto      0%       
  2   37.0c   190.0W   1630Mhz  945Mhz   0.0%     auto      0%       
  0   44.0c   178.0W   1630Mhz  945Mhz   0.0%     auto      0%       
================================================================================
====================           End of ROCm SMI Log          ====================

Definitely useful!
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936667 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1936673 - Posted: 22 May 2018, 13:31:59 UTC - in response to Message 1936666.  

Don't really have an idea if this problem can be tweaked away - I have opened a ticket at ROCm though, maybe there is a bug in their OpenCL implementation?
https://github.com/RadeonOpenCompute/ROCm/issues/423

Might be worth keeping an eye on that in case they need some more diagnostic info


Thanks for posting. I will keep on eye on the thread. I was thinking of giving the non-SoG version of the app a try.


Can someone point me to the latest Linux non-SoG AMD MB app?
Thanks!
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1936673 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : ROCm 1.8


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.