have more GPUs than actually exist

Message boards : Number crunching : have more GPUs than actually exist
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989199 - Posted: 7 Apr 2019, 21:12:59 UTC
Last modified: 7 Apr 2019, 21:22:47 UTC

I have posted similar problems months ago at BOINC forum but they claim it is an AMD driver problem. Going to post it here though it wont do much good as they (BOINC) claim I am the only one with this problem. I do have a solution.

Using a really old core-2 quad motherboard that had 5 Pci-E slots, I first put in 4 RX560 gpus on a fresh install of windows 10 and got SETI crunching on 4 V8-ATI units. All was fine until I added the remaining RX560 for a total of 5.

On reboot, windows device manager shows 5 RX560, Techpowerup's GPUz shows 5 and so does CPUID. All windows apps show 5 GPUs but, unfortunately, BOINC 7.14.2 thinks I have 10 of them and assigns 10 tasks 5 of which will never complete as I have gone through this before.

My solution then and now was to edit coproc_info.xml and remove the extra 5 GPU's and then make that file read-only so it cannot be changed back to 10 GPUs on BOINC restarting.

Hope this helps someone


yes, 5 can be done
ID: 1989199 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989201 - Posted: 7 Apr 2019, 21:23:28 UTC - in response to Message 1989199.  

It would help us BOINC developers to track down this problem if you could provide the exact details.

I'd like to see the Event Log entries showing the 10 devices detected at startup, please - an exact copy'n'paste, showing driver version numbers and suchlike.

I think you'll find help here from other setizens who have seen the same as you - you're not actually alone, but you need to collaborate with the wider community to get it sorted.

Remember you'll need to unlock coproc_info.xml if you ever decide to change your real GPUs again.
ID: 1989201 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989213 - Posted: 7 Apr 2019, 22:34:20 UTC
Last modified: 7 Apr 2019, 22:55:03 UTC

OK, was able to duplicate the problem by renaming that coproc_info file and restarting boinc

First, want to clarify what happened originally. The mombo had one x16 and four x1 and I put an RX560 into the X16 which covered the adjecent x1. The other three x1 had risers for their RX560s. The monitor was connected to the RX560 in the X16 slot. After verify SETI was crunching on 4 work units, I powered down, pulled the X16 which exposed the X1 and then used another pair of risers to get all 5 RX boards up and running. I suspect if I had put all 5 gpus into position at first then the problem may not have happened.

Here are the 10 work units supposidly being crunched. IMG
Here are the events corresponding to the 10 GPUs. TEXT
Here is the 10 GPU coproc_info.xml file. TEXT

Here are the 5 GPUs that reflect reality (from event file) TEXT
Here is the working fine coproc file. TEXT

If you compare the two coproc_info files you will see that I deleted the bottom five </ati_opencl> paragraphs

after the edit, I marked the file as read only to prevent it from being re-created

(one of the) Original discussion of problem HERE


[EDIT] that XML was edited to only delete the gpus numbered from "5" on.

windows and gpuz report only 5, boinc not getting correct info.
ID: 1989213 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989220 - Posted: 7 Apr 2019, 22:51:00 UTC - in response to Message 1989213.  

Many thanks for that. I'm on UTC+1, so ten minutes to midnight - on my way to bed. I'll have a proper look in the morning.
ID: 1989220 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989240 - Posted: 8 Apr 2019, 2:09:22 UTC
Last modified: 8 Apr 2019, 2:10:46 UTC

Left ".xml" off of one of the urls
corrected below
Here is the 10 GPU coproc_info.xml file. TEXT

unaccountably, the device numbers go from 0 to 9 instead of 0 to 4. Have no idea how the AMD driver could cause this problem. I had the idea originally of debugging this behavior using VS2017 and boinc GitHub sources, but the problem is probably in the openCL (or cuda??) library that boinc uses to enumerate the display devices. Apparently boinc cannot directly ask windowsr for this info unlike the device manager, GPUz, or CPUiD
ID: 1989240 · Report as offensive
Profile bloodrain
Volunteer tester
Avatar

Send message
Joined: 8 Dec 08
Posts: 231
Credit: 28,112,547
RAC: 1
Antarctica
Message 1989250 - Posted: 8 Apr 2019, 5:35:41 UTC - in response to Message 1989240.  

thank you. yeah for some reason when you have more then 1 gpu. it acts strange. not showing a said amount or outright wrong spec for a card
ID: 1989250 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1989267 - Posted: 8 Apr 2019, 12:07:44 UTC
Last modified: 8 Apr 2019, 12:08:16 UTC

I have had that issue in some sense of the word on both a Nvidia and an Amd gpu (2400G). My fix was to back grade to significantly lower driver versions. Since I don''t care that I am not running the very latest, it was another reasonable solution.

What both of my excess gpus had in common is they were under Windows 10.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1989267 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989288 - Posted: 8 Apr 2019, 14:29:42 UTC
Last modified: 8 Apr 2019, 14:47:10 UTC

I have a small boinc farm but I generally shut most of them down when summer starts as it is hot here in south Texas. I was experimenting with an old core2 quad Q9550s (low power) with five of those low power RX560 GPUs and put together a web based program that anyone can use to calculate efficiency. I estimate my "breadboard" system as using 45*5 + 65 + 110 (GPUs +CPU + remainder & loss) = 400 watts for a total of 1,253 watts per credit. Stats can be seen HERE. I was going to let this low power system run all summer.

I do have an APC that can show exact power consumption but it is being used on another system.

[EDIT] for some reason I had to take the www out of the above urls. Some sites require the www others seem not to. Something does not seem right since my previous posts used www and they are working unlike the above in the preview. If the above don't work on your browser then add www to the address. If I am not calculating the wattage correctly in the program then PM me. The sources are at GitHub under my name.
ID: 1989288 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989298 - Posted: 8 Apr 2019, 15:56:58 UTC - in response to Message 1989240.  

Left ".xml" off of one of the urls
corrected below
Here is the 10 GPU coproc_info.xml file. TEXT

unaccountably, the device numbers go from 0 to 9 instead of 0 to 4. Have no idea how the AMD driver could cause this problem. I had the idea originally of debugging this behavior using VS2017 and boinc GitHub sources, but the problem is probably in the openCL (or cuda??) library that boinc uses to enumerate the display devices. Apparently boinc cannot directly ask windowsr for this info unlike the device manager, GPUz, or CPUiD
Note that each virtual GPU has both a <device_num> and a <opencl_device_index>. The device numbers go from 0 to 9: the device indexes go from 0 to 4 and then restart, 0 to 4 again.

As Juha - who is a very experienced developer - said in the 'previous' discussion, BOINC evaluates devices through software: it doesn't pretend to have direct hardware detection code. My analysis would be that BOINC enumerates the available drivers, and then runs through each driver, enumerating the devices it reports.

<device_num> is an internal number, created and used only by BOINC
<opencl_device_index> is an external number, reported by the OpenCL stack component from one or more driver installations. We can perhaps suggest to BOINC developers that the enumeration code should watch for and flag duplicated device_index numbers. There is already a complicated process for trying to uniquely identify devices which are both CUDA capable and OpenCL capable, so that BOINC doesn't try to run both a CUDA app and an OpenCL app on the same silicon at the same time.

In the previous thread, Juha suggested that you inspect HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors, but I can't see any reply to that particular question. When I look at that key on my machine here, I see

[HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors]
"IntelOpenCL64.dll"=dword:00000000
"C:\\Windows\\System32\\nvopencl.dll"=dword:00000000
showing how two OpenCL libraries can co-exist. It would be worth checking that, since you seem to have found a workround for the problem but not yet isolated the root cause.
ID: 1989298 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989299 - Posted: 8 Apr 2019, 16:09:32 UTC - in response to Message 1989298.  
Last modified: 8 Apr 2019, 16:10:40 UTC

thanks for looking Richard!


In the previous thread, Juha suggested that you inspect HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors, but I can't see any reply to that particular question. When I look at that key on my machine here, I see

[HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors]
"IntelOpenCL64.dll"=dword:00000000
"C:\\Windows\\System32\\nvopencl.dll"=dword:00000000

showing how two OpenCL libraries can co-exist. It would be worth checking that, since you seem to have found a workround for the problem but not yet isolated the root cause.


I have only the AMD driver at that location

C:\WINDOWS\System32\DriverStore\FileRepository\c0340998.inf_amd64_4e7ad8ec950b7e37\B340755\amdocl64.dll dword:0
ID: 1989299 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1989300 - Posted: 8 Apr 2019, 16:12:32 UTC

it might be worthwhile to run DDU and then re-install the drivers fresh on a clean slate, to eliminate the possibility that old drivers are causing this problem.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1989300 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989301 - Posted: 8 Apr 2019, 16:18:09 UTC - in response to Message 1989288.  

I have a small boinc farm but I generally shut most of them down when summer starts as it is hot here in south Texas. I was experimenting with an old core2 quad Q9550s (low power) with five of those low power RX560 GPUs and put together a web based program that anyone can use to calculate efficiency. I estimate my "breadboard" system as using 45*5 + 65 + 110 (GPUs +CPU + remainder & loss) = 400 watts for a total of 1,253 watts per credit. Stats can be seen HERE. I was going to let this low power system run all summer.
There's something strange about that argument, with both the numbers and the units. You seem to have used rated TDP values for the power consumption, measuring the maximum possible power draw (needed to specify a safe power cabling solution), rather than the actual power draw during use. Some time ago, I put together this little table of power consumption, taken from a Killa-watt meter measuring the mains input to the system case only:

Idle - BOINC not running:		22 watts
Running NumberFields on 4 cores:	55 watts
Running SETI x64 AVX on 4 cores:	69 watts
ditto at VHAR:				71 watts
That's a full i5-6500 CPU @ 3.20GHz system with SSD and HDD. Currently, that meter is showing 125 watts maximum, with the 4 cores, the Intel GPU, and a NVidia GTX 1050Ti all running.

To discuss watts and credits in the same breath, you have to take the time dimension into account: 'watts' is an instantaneous measurement, 'credit' is earned over a period of time. Your utility company will bill you in kilowatt-hours: you'd probably measure credits in watt-seconds, aka Joules.

for some reason I had to take the www out of the above urls. Some sites require the www others seem not to.
That would depend how many versions of the address have been registered on the DNS servers for the domain.
ID: 1989301 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1989303 - Posted: 8 Apr 2019, 16:33:00 UTC - in response to Message 1989301.  

There's something strange about that argument, with both the numbers and the units. You seem to have used rated TDP values for the power consumption, measuring the maximum possible power draw (needed to specify a safe power cabling solution), rather than the actual power draw during use. Some time ago, I put together this little table of power consumption, taken from a Killa-watt meter measuring the mains input to the system case only:

Idle - BOINC not running:		22 watts
Running NumberFields on 4 cores:	55 watts
Running SETI x64 AVX on 4 cores:	69 watts
ditto at VHAR:				71 watts
That's a full i5-6500 CPU @ 3.20GHz system with SSD and HDD. Currently, that meter is showing 125 watts maximum, with the 4 cores, the Intel GPU, and a NVidia GTX 1050Ti all running.

To discuss watts and credits in the same breath, you have to take the time dimension into account: 'watts' is an instantaneous measurement, 'credit' is earned over a period of time. Your utility company will bill you in kilowatt-hours: you'd probably measure credits in watt-seconds, aka Joules.


Bear with me for a sec: Your i6500 is at
http://setiathome.berkeley.edu/results.php?hostid=8121358&offset=0&show_names=0&state=4&appid=

Your wattage of 125 is nice, I was just guessing on my system. Putting the above url into my program calculates 11.94 seconds for a single credit which gives 1492 joules expended during the 12 or so seconds. Lemme know what you think?
ID: 1989303 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1989305 - Posted: 8 Apr 2019, 16:43:23 UTC - in response to Message 1989303.  
Last modified: 8 Apr 2019, 16:52:04 UTC

Ah - I see what you've done.

125 watts of power x 11.94 seconds per credit = 1,492.5 watt-seconds, or joules, of energy.

But if I come upstairs and suspend SETI (which was running on the 1050TI), the meter only drops from 125 watts to 80 watts - so the 1050Ti alone is drawing 45 watts (below rated TDP), and the marginal cost of a SETI credit is only 537.3 joules.
ID: 1989305 · Report as offensive
Profile bloodrain
Volunteer tester
Avatar

Send message
Joined: 8 Dec 08
Posts: 231
Credit: 28,112,547
RAC: 1
Antarctica
Message 1991383 - Posted: 25 Apr 2019, 4:13:00 UTC - in response to Message 1989305.  

on video cards. i have 3 set of dual video card set up. and the power draw they have ref. on the box/ spec site. is wrong under load at times. 580 from power color draw different then what spec on site
ID: 1991383 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991396 - Posted: 25 Apr 2019, 6:00:54 UTC - in response to Message 1991383.  

The TDP power spec that graphics cards companies publish is for graphics loads in games. Different animal when the card is doing compute loads.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991396 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1991560 - Posted: 26 Apr 2019, 13:27:51 UTC

Oooh nice find..

This is my 2080Ti Host

I took 230W of usage as i'm using the CPU core to feed the GPU.

Avg: 31.7 30.0 63.4
STD: 54.4 54.2 119.4

0.50 seconds per credit from above info one device
1.9998 Credits per second for one device
Times shown above were divided by number of concurrent tasks(1)
7,199 number of credits in an hour this system
170 total watts used by a single producing device (avg each work unit)
7,199 credits per hour for exactly one device(1 tasks)
A kilowatt hour will theoretically produce maximum of 42,348 credits each device this PC
Use the above KWH credits to compare this device with any other device
as the overhead (idle) has been removed
Actual credit product is less because the GPU has idle time between tasks

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1991560 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1991698 - Posted: 27 Apr 2019, 13:28:06 UTC - in response to Message 1991560.  

V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1991698 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1991700 - Posted: 27 Apr 2019, 13:41:39 UTC - in response to Message 1991698.  

V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time.


ID: 1991700 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991706 - Posted: 27 Apr 2019, 14:47:40 UTC - in response to Message 1991700.  

V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time.


What Juan said. WoW! I guess you managed to wrangle the code snippet oddbjornik threw in here for pre-initialization.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991706 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : have more GPUs than actually exist


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.