Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 21 · Next

AuthorMessage
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 9
Credit: 391,588
RAC: 19
Canada
Message 2022937 - Posted: 13 Dec 2019, 3:27:21 UTC - in response to Message 2022930.  

According to Ryan Smith from AnandTech, no. The OpenCL drivers are still garbage.
ID: 2022937 · Report as offensive     Reply Quote
lastsworder

Send message
Joined: 9 Dec 19
Posts: 1
Credit: 13,014
RAC: 0
China
Message 2022940 - Posted: 13 Dec 2019, 4:14:21 UTC - in response to Message 2022695.  

收到,已经禁用GPU,但是得解决这个问题,毕竟5700的计算能力应该还是很可观的。
ID: 2022940 · Report as offensive     Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 9
Credit: 391,588
RAC: 19
Canada
Message 2022956 - Posted: 13 Dec 2019, 5:57:11 UTC - in response to Message 2022940.  

收到,已经禁用GPU,但是得解决这个问题,毕竟5700的计算能力应该还是很可观的。

Translation: I've received the message and disabled my GPU. However, this issue must be resolved, since the computational capabilities of the RX5700 is quite impressive.
ID: 2022956 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13331
Credit: 208,696,464
RAC: 304
Australia
Message 2022958 - Posted: 13 Dec 2019, 6:11:44 UTC - in response to Message 2022937.  

According to Ryan Smith from AnandTech, no. The OpenCL drivers are still garbage.
Ryan Smith on RX 5xxx OpenCL support.
While I’m including compute performance for the sake of completeness here, the compute situation on Navi has not substantially changed since the launch of the Radeon RX 5700 series over 5 months ago. AMD’s Adrenaline 2020 software has improved the state of their OpenCL drivers slightly – there are fewer hard crashes and performance is up in some cases – but their drivers are still dysfunctional and not fit for production use. In particular, Folding@Home and parts of CompuBench are still unable to run.

AMD is aware of the issue, unfortunately they don’t have any updates to offer on the situation. I am of the distinct impression that AMD has made OpenCL on Windows a low priority for now, and has opted to focus their software efforts on bringing up additional Navi GPUs, as well as improving Navi gaming performance and continuing to develop their ROCm platform for Linux. So anyone looking to do GPU compute on AMD’s GPUs would best be served by using Vega or Polaris cards if they’re using to Windows, or switching to Linux for these matters.

Grant
Darwin NT
ID: 2022958 · Report as offensive     Reply Quote
Profile Justin Turner Arthur

Send message
Joined: 20 Oct 03
Posts: 12
Credit: 3,929,052
RAC: 2
United States
Message 2022959 - Posted: 13 Dec 2019, 7:12:47 UTC - in response to Message 2022958.  

Unfortunately, the stock multibeam client excludes the ROCm OpenCL runtime in its plan class at the moment, so you'd have to roll your own to use the card on ROCm.
ID: 2022959 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3422
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2022999 - Posted: 13 Dec 2019, 18:11:00 UTC

Thanks to lastsworder for pointing out that AMD just released the cheaper RX 5500 XT yesterday, which is probably going to have the same issue, and more of them may turn up here due to their affordability.
ID: 2022999 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2023005 - Posted: 13 Dec 2019, 19:04:42 UTC

If you want to see how the plain RX 5700 would work at SETI just look at this Host, seems the RX 5700s work fine in MacOS Catalina, https://setiathome.berkeley.edu/results.php?hostid=8592369&offset=80
Unfortunately, it appears he stopped producing work on that machine some time ago.
The best action to take on these cards is to simply restrict the WUs to allow only One AMD GPU per MB task in Windows, basically the same as already is being done with the Astropulse tasks on Main. It shouldn't be that difficult to extend the restriction to include Multibeam. If the problem is ever fixed the same code would still allow AMD work to be produced.
ID: 2023005 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4243
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2023007 - Posted: 13 Dec 2019, 19:53:16 UTC - in response to Message 2023005.  

I was kind of wondering how other platforms were performing. are there any RX5700s on Linux? Are they producing bad results too?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2023007 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26177
Credit: 261,360,520
RAC: 489
Australia
Message 2023012 - Posted: 13 Dec 2019, 20:19:02 UTC - in response to Message 2023007.  

I was kind of wondering how other platforms were performing. are there any RX5700s on Linux? Are they producing bad results too?
None that I've come across as yet, they've all been Windows rigs.

Cheers.
ID: 2023012 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26177
Credit: 261,360,520
RAC: 489
Australia
Message 2023034 - Posted: 13 Dec 2019, 23:59:32 UTC

[AfZ]TomServo1 1483720
achimbln 138625
antoi 10856207
Baldarov 9438496
Brandon 8198367
Camiron 7449359
Christopher 9894096
CoffeeSloth 10266313
Derrek 219419
dsharbour 10858679
Dzsozi 8002127
Earendil 146007
egon.sauter 494566
Foaming Mad Cow Industries 219464
fred 1935325
ghostbuster 564989
HawkMedic 10838738
higemayuge 10790664
HMZ 9079227
Jeff 10639246
Jeffrey A. Smith 38247
Jerjes 1291426
Jorge Barrera 9650295
Kekke 46817
lastsworder 10878688
lupaslupas 10002927
Maulwurf 1516335
MaximusPrometheus 10240426
mnelsonx 272885
Niflhuem 113140
No Name@Extraterrestrial Intelligence 8116
Oriah 9838773
Otosan 8547502
Peter Furlong 7965665
phoenix7477 10773411
rgeens 10740140
Saint123 159425
Stephen Diem 36679
stogdan 10865456
StrayCat 177967
Strickland 34273
suhail ahmad 9878177
toby 9442798
Tomik 8972653
Trezy 10367889
Tristan 9778349
VMS Software Inc 45538
xakei 10823091
Zac 10033486
Damn, the mongrels are into me today with a few doing me out of credit and I've 5 more names to add to that list. :-)

calendir 9663884
eryndel 10878567
PantherJon 9801065
Rocky 270621
T66 3336343
ID: 2023034 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3422
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2023035 - Posted: 14 Dec 2019, 0:55:45 UTC - in response to Message 2023034.  
Last modified: 14 Dec 2019, 2:46:51 UTC

calendir 9663884
eryndel 10878567
PantherJon 9801065
Rocky 270621
T66 3336343


Pestered and thank you again! Er... no need in future to quote the whole list. It's getting annoyingly long and I don't see it getting shorter anytime soon. :^)

Edit: I found half a dozen or so more from checking valids on the host queues of the known bad IDs, sent pesterposts and updated my original list on the last page.

One of those computers (T66's) demonstrated perfectly why this is such a deceptively severe issue, which I think that the admins. here are missing. It only had 8 invalids, but of the valids, 2 were from the GPU, so it is managing to find another AMD RX to cross-validate a quarter of the time, seemingly very disproportionately with small percentage of AMD RX Windows rigs out there.
ID: 2023035 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4243
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2023036 - Posted: 14 Dec 2019, 1:20:01 UTC

the longer this goes on, the more RX5xxx cards will be in the hands of users, the more cards will be on the project, further increasing the chances that they find another RX5xxx card to cross validate with.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2023036 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26177
Credit: 261,360,520
RAC: 489
Australia
Message 2023037 - Posted: 14 Dec 2019, 1:20:41 UTC - in response to Message 2023035.  
Last modified: 14 Dec 2019, 1:21:09 UTC

Pestered and thank you again! Er... no need in future to quote the whole list. It's getting annoyingly long and I don't see it getting shorter anytime soon. :^)

One of those computers (T66's) demonstrated perfectly why this is such a deceptively severe issue, which I think that the admins. here are missing. It only had 8 invalids, but of the valids, 2 were from the GPU, so it is managing to find another AMD RX to cross-validate a quarter of the time, seemingly very disproportionately with small percentage of AMD RX Windows rigs out there.
Yeah T66 and higemayuge teamed up against me for 3 invalids here and yesterday's download problems meant that I've got teamed up with several of those users of those cards over quite a large number of tasks.

Yes I expect that list to get very long if these cards are allowed to continue on in the way that they currently are and the science to be further corrupted the longer they stay. :-(

Cheers.
ID: 2023037 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3422
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2023055 - Posted: 14 Dec 2019, 3:13:31 UTC
Last modified: 14 Dec 2019, 3:14:06 UTC

I had a look at the Stderr of these cross-validated results and I think that the reason that they are not being actioned to any great degree is that they are all finishing with result overflow -9. As far as I know, work units like this are thrown out as noise.

So although we are unfortunately losing a lot of observations, as well as wasting large amounts of processing and bandwidth, I don't know if it's actually going to taint the end results (other than by their absence.)
ID: 2023055 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2023060 - Posted: 14 Dec 2019, 4:01:48 UTC - in response to Message 2023055.  

On my invalid tasks involving cross-validate AMD results I have had five tasks that were not an overflow. Granted that was 5 out of 14, so there are a lot of overflowed tasks. But somewhere along the way either Eric or Richard stated that even overflows were useful science even when they were noisy. The do inject noisy "birdies" on purpose you know.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2023060 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3422
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2023069 - Posted: 14 Dec 2019, 9:01:58 UTC - in response to Message 2023060.  
Last modified: 14 Dec 2019, 10:13:59 UTC

Perfect... thanks Keith. That certainly answers that yes, bad data is going to get into the final database and eventually Nebula.

Edit: Also Niflhuem has responded and disabled GPU computing. It was a given, but I also found this post which shows that other BOINC projects are affected (as would be expected as OpenCL as far as I know is used for all AMD GPU computing.
ID: 2023069 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 20771
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2023073 - Posted: 14 Dec 2019, 9:24:45 UTC

Hopefully when Nebula does its thing the "bad data" being puked by these RX5700s will be seen as single events and so slung into the waste bin. It only becomes a problem if the same frequency/location pair comes up on is second scan with a "valid" result, or if a perfectly good "old" result get slung because it isn't paired due to results from these GPUs.
It is worth remembering that the actual data "tapes" are, as far as I'm aware kept, so it would (in theory) be possible to re-run the suspect data, having blocked RX5700s from getting anywhere near it. The sooner RX5700 are blocked as a class (the same as happened a few years back when there were issues with nVidia GPUs and VLARS) the better it will be for all (apart from RX5700 owners).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2023073 · Report as offensive     Reply Quote
Profile MagicEye
Volunteer tester
Avatar

Send message
Joined: 19 Sep 99
Posts: 70
Credit: 40,327,877
RAC: 75
Germany
Message 2023075 - Posted: 14 Dec 2019, 9:26:47 UTC

Is there any way to contact the project responsibles?
Are they aware of the problem?
Are there any solutions ongoing?

I have seen in the last days some tasks that were send out not only to 1 other host but to 2. On the one hand a very good idea.
On the other hand the 2 5700 GPUs still overruled the other 2 PCs. :(

The good thing is, that astropulse seems to run without error and really quite fast - about 1cr per second of run time. And with these tasks and in most cases also quite good AMD CPUs they get a lot of credits - maybe thats the reason they don't see that the 5700 card are producing so much waste.
ID: 2023075 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 20771
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2023077 - Posted: 14 Dec 2019, 9:30:16 UTC

The person to contact is Eric Korpela, and I'm pretty sure from earlier correspondence he is already aware of the situation. I know he monitors the forum, but I also know he's been very busy with things outside the atmosphere just now. Next would be Jeff Cobb.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2023077 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26177
Credit: 261,360,520
RAC: 489
Australia
Message 2023083 - Posted: 14 Dec 2019, 9:54:48 UTC

The buggers are coming out of the woodwork like termites looking for more to chew on here and I'll have more to add to the list in the morning. :-(

Cheers.
ID: 2023083 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 21 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.