have more GPUs than actually exist

Message boards : Number crunching : have more GPUs than actually exist
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1994861 - Posted: 23 May 2019, 14:30:32 UTC - in response to Message 1994250.  

Sorry, just saw this reply.

I have had a number of systems and have never seen a problem with any nvidia boards. This problem (it still exists) seems to be "owned" by Windows and AMD and first showed up with RX and S series boards. I had not seen this on HD7950s nor 7850s

I am guessing that the search for drivers in the client ( gpu_detect.cpp, gpu_amd.cpp …. gpu_opencl.cpp) uses opencl and CAL and (bigger guess) both of those return a driver and the algorithm selects both instead of just one. I as thinking of building a VS utility program that just used those programs and see if I could debug it. IMHO this program has had to support so many platforms it is difficult to pull out a few modules and build a test unit. I did notice recently there are some sample programs in GitHub including an "openclapp.cpp" and the full source for "clinfo" is always available which could be used as a starter.

The most recent example of the problem was this:

Three S9000 and one S9100 were working fine but the fan fell off the back of the S9000. The fan holder was a 3d printer forged POS. I pulled the S9100 to see if I could do a better job of securing the fan. The system ran fine with three S9000.

I secured the fan, and rebooted with the "taped on fan" . The device manager showed four S9000 which is incorrect. I selected the first board and instructed windows to update the driver using the AMD one that it had been using. After the update the correct GPUs were identified unfortunately, the boinc client now saw 8 GPUs. two s9100 and six s9000. There is an AMD driver problem somewhere clearly, but the client should not be using "phantom" GPUs. They were assigned tasks and seem to be running but from prior experience I know the tasks never finish so I did my trick of editing that opencl info table and then making it read only. FWIW, I have multipole AMD boards under ubuntu with AMD drivers and have not seen this problem. Restricted to Windows & AMD it seems.
ID: 1994861 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1994933 - Posted: 23 May 2019, 23:32:23 UTC - in response to Message 1994861.  
Last modified: 23 May 2019, 23:37:01 UTC

Restricted to Windows & AMD it seems.

And now NVidia, on Win10 Pro (v1803 build 17134.706)
I don't upgrade drivers unless I have to, so when I got a new RTX2060 there was no way the old driver would support it.

So I shutdown BOINC, upgraded the driver (selecting the clean install & only the video, audio, and USB3 drivers) shut down the computer, installed the new card, and rebooted. Normally Windows will then find the new hardware, install the driver for the new hardware & off you go. This time that didn't occur- the system booted with the default driver, no detection of new hardware & installing of driver so I re-installed the driver manually (same options as before) and this time it recognised the hardware & installed OK. Re-booted again (out of habit as much as any other reason), still good.
Re-started BOINC and it continued on with it's existing WUs. But after a few minutes I noticed that one of my GPUs was showing Fan stop- which only happens after it's been doing nothing for a while. BOINC showed 2 GPU WUs being processed. Ran GPUz, and lo & behold only the one card was processing work- the RTX 2060 in the primary PCIe*16 slot.
Then I remembered seeing this thread a while back, found it, checked my BOINC event log & there they were, multiple OpenCL entries for each card. I tried the suggestion in other threads of using an earlier driver, but 3 different drivers (back to the one before the original for the RTX 2060) made no difference.

I checked the Registry entry Richard suggested, but nothing there. Was hoping for further suggestions, as this has the potential to become somewhat of an issue as more people upgrade their hardware, and will require new drivers to do so.
The system is presently working nicely (thanks to your workaround), but it would be nice to have a fix, not a kludge to get around the problem.
Grant
Darwin NT
ID: 1994933 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1994937 - Posted: 23 May 2019, 23:48:09 UTC - in response to Message 1994933.  
Last modified: 23 May 2019, 23:50:28 UTC

The system is presently working nicely (thanks to your workaround), but it would be nice to have a fix, not a kludge to get around the problem.


I have been successful using revo uninstaller and enabling the advanced scan. I make sure the driver I want is ready to be installed and when the system starts to reboot I pull the ethernet cable to make sure windows does not fetch an old driver. I tried the ddu and also tried the so-called "clean" install that is an AMD option. I assume it will work just as good on nvidia as amd. It also removes things that are missing the uninstall.

Revo is free. It worked so well (the free one) that I bought the portable so I could just connect a USB to do a clean uninstall of anything. I did not bother with the uninstall this last time as I routinely save the copro_info file that works and it is convenient to copy and paste it into the boinc folder.
ID: 1994937 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1994940 - Posted: 24 May 2019, 0:01:37 UTC - in response to Message 1994937.  

I pull the ethernet cable to make sure windows does not fetch an old driver.

I used the Group Policy editor to stop Windows from updating drivers. That's one headache I don't need.
Grant
Darwin NT
ID: 1994940 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1995965 - Posted: 30 May 2019, 17:37:45 UTC

The next time someone has a problem with too many OpenCL devices detected.

Download Oblomov's clinfo and run it in Terminal / Command Prompt. Count the number of devices reported carefully. Note that the report includes both GPU and CPU devices (if you have drivers for those.)

If clinfo reports too many devices then something has gone wrong with the driver install and it's the vendor who needs to fix things. Go to the vendors website and open a support ticket. You'll need to tell what driver version you installed and how exactly you installed it. Include the clinfo output in the ticket as well and a link to clinfo website so that the vendor can easily find it for in-house retesting (though they probably already know about it.)

And btw. Turns out HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors is for old drivers. Now it's more complicated and in the future still more complicated.
ID: 1995965 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1996052 - Posted: 31 May 2019, 2:37:32 UTC - in response to Message 1995965.  
Last modified: 31 May 2019, 2:56:35 UTC

The next time someone has a problem with too many OpenCL devices detected.

Download Oblomov's clinfo and run it in Terminal / Command Prompt. Count the number of devices reported carefully. Note that the report includes both GPU and CPU devices (if you have drivers for those.)

If clinfo reports too many devices then something has gone wrong with the driver install and it's the vendor who needs to fix things. Go to the vendors website and open a support ticket. You'll need to tell what driver version you installed and how exactly you installed it. Include the clinfo output in the ticket as well and a link to clinfo website so that the vendor can easily find it for in-house retesting (though they probably already know about it.)

And btw. Turns out HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors is for old drivers. Now it's more complicated and in the future still more complicated.


I ran clinfo and posted some info back Feb 11 here
https://boinc.berkeley.edu/forum_thread.php?id=12830&postid=90028#90028

The problem then, and I assume now, is that the boinc gpu detect is finding different drivers and, I am guessing, assumes there is a GPU attached to each, and ends up thinking there are 2x as many GPUs as exist.

As you can see in the "message log" driver 2766.5 is on device 0 & 1 and driver 26.71 (older driver) is on device 2 and 3. In reality there are exactly two RX-560 cards at that time.
Note that window device manager sees only 2 gpus and tech power up's gpu-z clearly show only 2. My guess is that the gpudetect looks for GPUs two different ways (opencl and cal) and cal returns the older, unused drivers, and opencl returns the new drivers and the gpu detect program thinks there are 4 gpus where in reality there are only 2. That is just a guess. I would offer to help debug the problem using VS2017 but the kitchen sink et, al, would have to be removed from the client. That is as unlikely as hell freezing over. The program that reads the coproc_info.xml has access to windows runtime modules (unlike opencl) and could easily enumerate the GPUS and compare the results to what opencl (or cal) found, blessed, and stored in that coproc_info file.
ID: 1996052 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1996105 - Posted: 31 May 2019, 18:16:49 UTC - in response to Message 1996052.  

My guess is that the gpudetect looks for GPUs two different ways (opencl and cal) and cal returns the older, unused drivers, and opencl returns the new drivers


Pretty sure CAL isn't the problem here or even involved in any way.

I ran clinfo and posted some info back Feb 11 here
https://boinc.berkeley.edu/forum_thread.php?id=12830&postid=90028#90028


As I said back then, the log was cut short. But it actually has enough information to show that it is driver (install) problem.

Number of platforms: 2
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2766.5)
Platform Name: AMD Accelerated Parallel Processing
...
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2671.3)
Platform Name: AMD Accelerated Parallel Processing
...
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
...
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2


I'm guessing 2671.3 uses the key in Software and 2766.5 uses the key in CurrentControlSet, and that something went wrong with the transition and now you have both and while ICD loader has some code to handle multiple instances of the same driver it isn't smart enough to handle this situation. Remove the key from Software and see what happens.
ID: 1996105 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1999047 - Posted: 21 Jun 2019, 14:49:03 UTC - in response to Message 1996105.  
Last modified: 21 Jun 2019, 15:25:26 UTC

I'm guessing 2671.3 uses the key in Software and 2766.5 uses the key in CurrentControlSet, and that something went wrong with the transition and now you have both and while ICD loader has some code to handle multiple instances of the same driver it isn't smart enough to handle this situation. Remove the key from Software and see what happens.


Started looking at this again. I bought a 4-in-1 riser with the idea of using 1xHD7950 + 3xS9000 + 1xS9100 + 2xRX560 as these 7 fit in the 850watt supply rating (will be close). On first boot I had 10 GPUs which I expected (twice as many as actually exist)

Have not got to the RX560 but I have a stable*** system with the HD7950 & S9x00 boards but did find a problem.

tried two AMD drivers each generated a slightly different software key
[HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors]
"C:\\WINDOWS\\System32\\DriverStore\\FileRepository\\c0334912.inf_amd64_5cd8c2a7964a9949\\B334754\\amdocl64.dll"=dword:00000000
---and----
[HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors]
"C:\\WINDOWS\\System32\\DriverStore\\FileRepository\\c0313676.inf_amd64_96bbc33bec5c7fae\\amdocl64.dll"=dword:00000000


That key cannot be deleted. Even with BOINC not running, deleting the key (I exported it first) causes the blue screen where windows gathers information to send back. When the system reboots the key is back in the registry but that might be because the system died before the registry could be updated. I looked in the event viewer to see what was using that key but didn't see anything of value just the normal erro "last reboot was unexpected" or whatever.

I cannot find the other key you mentioned, the "CurrentControlSet" key. Must be in another thread?? I can try deleting that. Currently the system is working with 5 boards using my trick of editing that coproc info file and then making it read only.

*** I wont know for sure how stable the system is as I am running only 2 concurrent tasks (Milkyway) per board until I find out why I cannot run 5 per board for very long


[EDIT] Found problem: the d0 GPU (one of the S9000) has not been assigned a stask. Instead the two tasks it was to get were assigned to another GPU . The d0 (using gpuz) shows 300mhz speed (idle). I have seen this before: Phantom tasks that never complete. The copro_info I edited: I simply delete the last 5 devices as I assumed they were duplicates. I will have to go back to the origin copro_info and find the correct d0 board. The board indexes I usually go with are 0.0,,, … 4.4 but I supect the 4-in-1 riser messes with the index and the d0 board maybe be 1.0 or 5.5 instead of 0.0 Plus there is no telling if the first board listed in copro_info is d0 or not. The net effect is the index is not correct and one of the gpus never gets assigned a task and the two "phantom" tasks never complete.

6			6/21/2019 9:02:34 AM	Failed to delete old coproc_info.xml. error code -110	
7			6/21/2019 9:02:53 AM	OpenCL: AMD/ATI GPU 0: AMD FirePro S9000 (driver version 2841.5, device version OpenCL 1.2 AMD-APP (2841.5), 6144MB, 6144MB available, 3154 GFLOPS peak)	
8			6/21/2019 9:02:53 AM	OpenCL: AMD/ATI GPU 1: AMD FirePro S9000 (driver version 2841.5, device version OpenCL 1.2 AMD-APP (2841.5), 6144MB, 6144MB available, 3154 GFLOPS peak)	
9			6/21/2019 9:02:53 AM	OpenCL: AMD/ATI GPU 2: AMD FirePro S9100 (driver version 2841.5, device version OpenCL 2.0 AMD-APP (2841.5), 12288MB, 12288MB available, 4506 GFLOPS peak)	
10			6/21/2019 9:02:53 AM	OpenCL: AMD/ATI GPU 3: AMD FirePro S9000 (driver version 2841.5, device version OpenCL 1.2 AMD-APP (2841.5), 6144MB, 6144MB available, 3154 GFLOPS peak)	
11			6/21/2019 9:02:53 AM	OpenCL: AMD/ATI GPU 4: AMD Radeon HD 7900 Series (driver version 2841.5, device version OpenCL 1.2 AMD-APP (2841.5), 3072MB, 3072MB available, 3604 GFLOPS peak)

[/code]
ID: 1999047 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : have more GPUs than actually exist


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.