Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 20 · Next

AuthorMessage
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3777
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025080 - Posted: 26 Dec 2019, 16:00:47 UTC
Last modified: 2 Jan 2020, 10:44:33 UTC

Now that things are back to "normal" (quotes required around here) I've pestered the new additions. Here's the updated list as it's now three pages back (will update the cc in Invalid Host Messaging as always... (and please don't reply quoting the entire thing :^p )
(Edit: Thanks to Recedham, the latest person to reply and indicate they've disabled the affected GPU.)

李溪伦 9302807
[AfZ]TomServo1 1483720
achimbln 138625
Alexandr Galushchenko 9609912
alffrommars 9024750
antoi 10856207
aridhol 10288747
Arnab 10093567
Baldarov 9438496
Bigthor 480399
Borktron 10682716
Brandon 8198367
calendir 9663884

Camiron 7449359
Carl 914781
Christopher 9894096
CoffeeSloth 10266313
Crisu 7833612
dalex 10881818
Daniel Conrad Broom 8059986
Daniel frederikson 9813817
Daniel Penz 91581
Dank 49802
Derrek 219419
Doc_Jebus 10863878
dsharbour 10858679
Dzsozi 8002127
Earendil 146007
egon.sauter 494566
Eirikafh 10883218
Eric 9157146
eryndel 10878567
Esta 10624508
Foaming Mad Cow Industries 219464
fred 1935325
fredi 7913572
ghostbuster 564989
gunsnammo 137399
Haiko_N 9198068
HawkMedic 10838738
higemayuge 10790664
HMZ 9079227
Jeff 10639246
Jeffrey A. Smith 38247
Jerjes 1291426
JohnDoe 9166075
Jorge Barrera 9650295
Juraxell 10864786
Kekke 46817
knutella 9880098
lastsworder 10878688
lupaslupas 10002927
MadMikeDelta 8221690
Maulwurf 1516335
MaximusPrometheus 10240426
mgg 279419
mnelsonx 272885
Niflhuem 113140

No Name@Extraterrestrial Intelligence 8116
NYX.consulting 10503661
Oriah 9838773

Otosan 8547502
PantherJon 9801065
Peter Furlong 7965665
phoenix7477 10773411
Rafael 8249913
rame 10738
rAttmAniA 9002301
Recedham 954834
rgeens 10740140
Richard Hartland 9781177
Rocky 270621

Saint123 159425

Stephen Diem 36679
stogdan 10865456
StrayCat 177967

Strickland 34273
suhail ahmad 9878177
Swagstergo 10882690
T66 3336343
toby 9442798
TomasFraus 8445239
Tomik 8972653
Trezy 10367889
Tristan 9778349
vleermuis 1295921
VMS Software Inc 45538

werewolf_007 10880222
xakei 10823091
Zac 100334866

Italicized names have replied and indicated they are disabling their affected GPUs.
Struck-through names are confirmed to no longer be producing these bad results, ie via disabling GPU computing.
ID: 2025080 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 2025149 - Posted: 27 Dec 2019, 4:56:35 UTC - in response to Message 2025080.  

Some bad results are still coming through when AMD machines with the bad drivers pair up 2 out of 3. At its worst the problem was on par with bad CPUs in term of bad inserted results (which are usually a few out of every 100,000 results) . Compared to the amount of data we lose to RFI it's small, but we still try to prevent it.

If we decide the database and the download server can handle the additional load, I might bump the overflow quorum to 4 next week.
@SETIEric@qoto.org (Mastodon)

ID: 2025149 · Report as offensive     Reply Quote
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 2025157 - Posted: 27 Dec 2019, 7:49:15 UTC - in response to Message 2025149.  

Some bad results are still coming through when AMD machines with the bad drivers pair up 2 out of 3. At its worst the problem was on par with bad CPUs in term of bad inserted results (which are usually a few out of every 100,000 results) . Compared to the amount of data we lose to RFI it's small, but we still try to prevent it.

If we decide the database and the download server can handle the additional load, I might bump the overflow quorum to 4 next week.


Greetings Eric

Thankyou for the update :-)

Mark
ID: 2025157 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35274
Credit: 261,360,520
RAC: 489
Australia
Message 2025289 - Posted: 28 Dec 2019, 4:33:45 UTC

A new 1 turned up for the 27th UTC here. :-(

Arnab 10093567

Cheers.
ID: 2025289 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3777
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025314 - Posted: 28 Dec 2019, 12:18:57 UTC - in response to Message 2025289.  
Last modified: 28 Dec 2019, 14:50:18 UTC

Thanks again... two more were found by going through the wingies who "validated" that person's RX 5700 GPU results. All pestered and list updated. :^)
ID: 2025314 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22293
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2025332 - Posted: 28 Dec 2019, 16:30:47 UTC - in response to Message 2025157.  

I suspect we will see a few more RX5xxx GPUs appearing in the next couple of weeks as folks get the Christmas money spent - and with that there will be an increase in the probability of invalid x-validation. It would be interesting (but I suspect difficult) to know how many RX5xxx GPUs there are compared to the total number of active participants.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2025332 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2025354 - Posted: 28 Dec 2019, 20:47:23 UTC
Last modified: 28 Dec 2019, 20:52:27 UTC

It seems the quorum for overflows isn't working well.

https://setiathome.berkeley.edu/workunit.php?wuid=3807599912

perhaps it's because my result WASN'T an overflow? I think this is what usually happens, only the 5700s are finding the overflow result where the correct app is finding real signals.

8373826078 8643138 26 Dec 2019, 8:42:00 UTC 26 Dec 2019, 8:47:16 UTC Completed and validated      15.22 12.06 1.68 SETI@home v8 v8.22 (opencl_ati5_nocal)
                                                                                                                   windows_intelx86
8373826079 8433872 26 Dec 2019, 8:41:58 UTC 27 Dec 2019, 1:06:15 UTC Completed, marked as invalid 51.22 28.34 0.00 SETI@home v8
                                                                                                                   Anonymous platform (NVIDIA GPU)
8376632701 8872566 27 Dec 2019, 4:55:43 UTC 27 Dec 2019, 5:20:04 UTC Completed and validated      16.33 10.97 1.68 SETI@home v8 v8.22 (opencl_ati_nocal)
                                                                                                                   windows_intelx86

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2025354 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35274
Credit: 261,360,520
RAC: 489
Australia
Message 2025355 - Posted: 28 Dec 2019, 20:54:26 UTC - in response to Message 2025354.  

It seems the quorum for overflows isn't working well.

https://setiathome.berkeley.edu/workunit.php?wuid=3807599912

perhaps it's because my result WASN'T an overflow? I think this is what usually happens, only the 5700s are finding the overflow result where the correct app is finding real signals.

8373826078 8643138 26 Dec 2019, 8:42:00 UTC 26 Dec 2019, 8:47:16 UTC Completed and validated      15.22 12.06 1.68 SETI@home v8 v8.22 (opencl_ati5_nocal) windows_intelx86
8373826079 8433872 26 Dec 2019, 8:41:58 UTC 27 Dec 2019, 1:06:15 UTC Completed, marked as invalid 51.22 28.34 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU)
8376632701 8872566 27 Dec 2019, 4:55:43 UTC 27 Dec 2019, 5:20:04 UTC Completed and validated      16.33 10.97 1.68 SETI@home v8 v8.22 (opencl_ati_nocal) windows_intelx86
Well you just found a new 1 there to add to the list Ian. ;-)

Eirikafh 10883218

Cheers.
ID: 2025355 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2025356 - Posted: 28 Dec 2019, 21:02:15 UTC - in response to Message 2025354.  
Last modified: 28 Dec 2019, 21:12:29 UTC

It seems the quorum for overflows isn't working well.

The quorum of 3 not works if the other 2 of the hosts are 5700...

IMHO The quorum technic (3 or even more) did not actualy solve the problem, just make it less frequent,

There are only 3 real fixes:

- Fix the driver - very unlikely in short term.
- Not allow to send a WU to 2 hosts with 5700 hosts. - requires changes on the server code. Not likely to be done too.
- Stop sending new work to this type of GPU' s - Like was doing with the Nvidia with the Vlar' s in the past. - The simple and most likely to be done of the 3 fixes. Most of the code changes was allready done.
ID: 2025356 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2025361 - Posted: 28 Dec 2019, 21:16:02 UTC - in response to Message 2025356.  

I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device.

Example, One WU goes to an nvidia host, one goes to an AMD/ATI host. They don’t agree, the tie breaker goes out to a CPU app.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2025361 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13778
Credit: 208,696,464
RAC: 304
Australia
Message 2025369 - Posted: 28 Dec 2019, 21:43:52 UTC - in response to Message 2025361.  

I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device.

Example, One WU goes to an nvidia host, one goes to an AMD/ATI host. They don’t agree, the tie breaker goes out to a CPU app.
Been thinking along the same line myself.
Initial resends can go to anyone, but a second, 3rd etc resend only goes to a CPU?

Would be a shame if necessary. I always look at resends as instant Credit- you're not waiting on your wingman, they've already returned their result. As soon as you return yours, it's pretty much instant validation & Credit.
Grant
Darwin NT
ID: 2025369 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22293
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2025378 - Posted: 28 Dec 2019, 22:02:33 UTC

One would have to be aware of any re-schedulers using RX5xxx GPUs.
From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2025378 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35274
Credit: 261,360,520
RAC: 489
Australia
Message 2025380 - Posted: 28 Dec 2019, 22:08:34 UTC - in response to Message 2025378.  

One would have to be aware of any re-schedulers using RX5xxx GPUs.
From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source.
+1

Cheers.
ID: 2025380 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13778
Credit: 208,696,464
RAC: 304
Australia
Message 2025381 - Posted: 28 Dec 2019, 22:09:21 UTC - in response to Message 2025378.  
Last modified: 28 Dec 2019, 22:09:55 UTC

From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source.
Would it make blocking just one family of GPUs possible? That's the main question, and it would be the best method as it would allow quorums of 2 again for all WUs.

As I posted earlier in the thread, the problem will get much worse as AMD have released the RX 5500 XT a couple of weeks ago, a much cheaper video card using the same architecture, requiring the same problematic drivers.
Grant
Darwin NT
ID: 2025381 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2025392 - Posted: 29 Dec 2019, 0:02:16 UTC - in response to Message 2025361.  

I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device.

Example, One WU goes to an nvidia host, one goes to an AMD/ATI host. They don’t agree, the tie breaker goes out to a CPU app.


. . Yes, making sure the tie breaker is not a 5xxx would be a good step. But it still doesn't eliminate the problem if the original hosts are both 5xxx so your idea is necessary so that doesn't happen.

Stephen

. .
ID: 2025392 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2025395 - Posted: 29 Dec 2019, 0:04:20 UTC - in response to Message 2025378.  

One would have to be aware of any re-schedulers using RX5xxx GPUs.
From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source.


. . That sounds best to me ...

Stephen

. .
ID: 2025395 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2025400 - Posted: 29 Dec 2019, 0:13:40 UTC

Here's a new one for your list Mr. Kevvy. Not an early overflow. Got robbed.

https://setiathome.berkeley.edu/workunit.php?wuid=3809612187

Swagstergo
https://setiathome.berkeley.edu/show_host_detail.php?hostid=8871372
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2025400 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35274
Credit: 261,360,520
RAC: 489
Australia
Message 2025403 - Posted: 29 Dec 2019, 0:33:39 UTC

And another 2 Xmas presents hit the list.

dalex 10881818

Daniel Penz 91581

Cheers.
ID: 2025403 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3777
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025414 - Posted: 29 Dec 2019, 4:11:03 UTC
Last modified: 29 Dec 2019, 4:30:09 UTC

Eirikafh, Swagstergo, dalex and Daniel Penz all pestered and thank you both! dalex is the first multi-RX participant I have seen with three hosts, all with bad RX 5700s. As well, Peter Furlong has replied and indicated has stopped producing the bad work units... thanks!

@Ian: Keith had also noted that some were not overflows. Your result was not, but both the quorum RX 5700 results on that work unit were. If I see any more like this ie from Dec. 28th on I will pass them to Dr. Korpela as this would indicate that the three-quorum isn't checking the two "valid" results but instead all of them for -9s and clearly in most cases the good platform is not going to have that.

I'm also at a loss why the platform hasn't been blocked from v8 work (their AP results are actually good) but ¯\_(ツ)_/¯
ID: 2025414 · Report as offensive     Reply Quote
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 2025440 - Posted: 29 Dec 2019, 11:07:34 UTC
Last modified: 29 Dec 2019, 11:08:08 UTC

Greetings All

Yep and they just keep coming, here is 2 that hit me, also I couldnt see in earlier lists

Dank https://setiathome.berkeley.edu/show_host_detail.php?hostid=8670499

Rafael https://setiathome.berkeley.edu/show_host_detail.php?hostid=8875140

Has anyone thought of the following,

Email the team founder that all these hosts respectively belong to and/or

Put an advisory on the home page of Seti@home regarding the situation with OpenCL and these cards.

Just thinking out aloud here :-)

Regards
Mark
ID: 2025440 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 20 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.