Message boards :
Number crunching :
Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 20 · Next
Author | Message |
---|---|
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
Now that things are back to "normal" (quotes required around here) I've pestered the new additions. Here's the updated list as it's now three pages back (will update the cc in Invalid Host Messaging as always... (and please don't reply quoting the entire thing :^p ) (Edit: Thanks to Recedham, the latest person to reply and indicate they've disabled the affected GPU.) æŽæºªä¼¦ 9302807 achimbln 138625 Alexandr Galushchenko 9609912 antoi 10856207 Arnab 10093567 Bigthor 480399 Borktron 10682716 calendir 9663884 Christopher 9894096 CoffeeSloth 10266313 Crisu 7833612 dalex 10881818 Daniel Conrad Broom 8059986 Daniel frederikson 9813817 Daniel Penz 91581 Dank 49802 Doc_Jebus 10863878 Dzsozi 8002127 Earendil 146007 egon.sauter 494566 Eirikafh 10883218 Eric 9157146 eryndel 10878567 Esta 10624508 Foaming Mad Cow Industries 219464 fred 1935325 fredi 7913572 ghostbuster 564989 Haiko_N 9198068 HawkMedic 10838738 higemayuge 10790664 HMZ 9079227 Jeff 10639246 Jerjes 1291426 JohnDoe 9166075 Jorge Barrera 9650295 Juraxell 10864786 Kekke 46817 knutella 9880098 lupaslupas 10002927 MadMikeDelta 8221690 MaximusPrometheus 10240426 mnelsonx 272885 Niflhuem 113140 No Name@Extraterrestrial Intelligence 8116 Oriah 9838773 Otosan 8547502 PantherJon 9801065 Peter Furlong 7965665 phoenix7477 10773411 Rafael 8249913 rame 10738 rAttmAniA 9002301 rgeens 10740140 Richard Hartland 9781177 Rocky 270621 Saint123 159425 Stephen Diem 36679 StrayCat 177967 Strickland 34273 suhail ahmad 9878177 Swagstergo 10882690 T66 3336343 toby 9442798 TomasFraus 8445239 Tomik 8972653 Trezy 10367889 vleermuis 1295921 VMS Software Inc 45538 werewolf_007 10880222 xakei 10823091 Italicized names have replied and indicated they are disabling their affected GPUs. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Some bad results are still coming through when AMD machines with the bad drivers pair up 2 out of 3. At its worst the problem was on par with bad CPUs in term of bad inserted results (which are usually a few out of every 100,000 results) . Compared to the amount of data we lose to RFI it's small, but we still try to prevent it. If we decide the database and the download server can handle the additional load, I might bump the overflow quorum to 4 next week. @SETIEric@qoto.org (Mastodon) |
tazzduke Send message Joined: 15 Sep 07 Posts: 190 Credit: 28,269,068 RAC: 5 |
Some bad results are still coming through when AMD machines with the bad drivers pair up 2 out of 3. At its worst the problem was on par with bad CPUs in term of bad inserted results (which are usually a few out of every 100,000 results) . Compared to the amount of data we lose to RFI it's small, but we still try to prevent it. Greetings Eric Thankyou for the update :-) Mark |
Wiggo Send message Joined: 24 Jan 00 Posts: 36772 Credit: 261,360,520 RAC: 489 |
A new 1 turned up for the 27th UTC here. :-( Arnab 10093567 Cheers. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
|
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
I suspect we will see a few more RX5xxx GPUs appearing in the next couple of weeks as folks get the Christmas money spent - and with that there will be an increase in the probability of invalid x-validation. It would be interesting (but I suspect difficult) to know how many RX5xxx GPUs there are compared to the total number of active participants. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
It seems the quorum for overflows isn't working well. https://setiathome.berkeley.edu/workunit.php?wuid=3807599912 perhaps it's because my result WASN'T an overflow? I think this is what usually happens, only the 5700s are finding the overflow result where the correct app is finding real signals. 8373826078 8643138 26 Dec 2019, 8:42:00 UTC 26 Dec 2019, 8:47:16 UTC Completed and validated 15.22 12.06 1.68 SETI@home v8 v8.22 (opencl_ati5_nocal) windows_intelx86 8373826079 8433872 26 Dec 2019, 8:41:58 UTC 27 Dec 2019, 1:06:15 UTC Completed, marked as invalid 51.22 28.34 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU) 8376632701 8872566 27 Dec 2019, 4:55:43 UTC 27 Dec 2019, 5:20:04 UTC Completed and validated 16.33 10.97 1.68 SETI@home v8 v8.22 (opencl_ati_nocal) windows_intelx86 Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Wiggo Send message Joined: 24 Jan 00 Posts: 36772 Credit: 261,360,520 RAC: 489 |
It seems the quorum for overflows isn't working well.Well you just found a new 1 there to add to the list Ian. ;-) Eirikafh 10883218 Cheers. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
It seems the quorum for overflows isn't working well. The quorum of 3 not works if the other 2 of the hosts are 5700... IMHO The quorum technic (3 or even more) did not actualy solve the problem, just make it less frequent, There are only 3 real fixes: - Fix the driver - very unlikely in short term. - Not allow to send a WU to 2 hosts with 5700 hosts. - requires changes on the server code. Not likely to be done too. - Stop sending new work to this type of GPU' s - Like was doing with the Nvidia with the Vlar' s in the past. - The simple and most likely to be done of the 3 fixes. Most of the code changes was allready done. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device. Example, One WU goes to an nvidia host, one goes to an AMD/ATI host. They don’t agree, the tie breaker goes out to a CPU app. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device.Been thinking along the same line myself. Initial resends can go to anyone, but a second, 3rd etc resend only goes to a CPU? Would be a shame if necessary. I always look at resends as instant Credit- you're not waiting on your wingman, they've already returned their result. As soon as you return yours, it's pretty much instant validation & Credit. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
One would have to be aware of any re-schedulers using RX5xxx GPUs. From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Wiggo Send message Joined: 24 Jan 00 Posts: 36772 Credit: 261,360,520 RAC: 489 |
One would have to be aware of any re-schedulers using RX5xxx GPUs.+1 Cheers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
From memory the runt of the code used to block VLARs(?) to nVidia GPUs is still there, so it might be just as easy to twist that a little bit to block sending tasks to that GPU family, and so cut the problem off at the source.Would it make blocking just one family of GPUs possible? That's the main question, and it would be the best method as it would allow quorums of 2 again for all WUs. As I posted earlier in the thread, the problem will get much worse as AMD have released the RX 5500 XT a couple of weeks ago, a much cheaper video card using the same architecture, requiring the same problematic drivers. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I think another option could ideally be to add some more logic to the WU distribution to make sure that a WU goes to different devices and not distributed to the same kind of device. . . Yes, making sure the tie breaker is not a 5xxx would be a good step. But it still doesn't eliminate the problem if the original hosts are both 5xxx so your idea is necessary so that doesn't happen. Stephen . . |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
One would have to be aware of any re-schedulers using RX5xxx GPUs. . . That sounds best to me ... Stephen . . |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Here's a new one for your list Mr. Kevvy. Not an early overflow. Got robbed. https://setiathome.berkeley.edu/workunit.php?wuid=3809612187 Swagstergo https://setiathome.berkeley.edu/show_host_detail.php?hostid=8871372 Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Wiggo Send message Joined: 24 Jan 00 Posts: 36772 Credit: 261,360,520 RAC: 489 |
And another 2 Xmas presents hit the list. dalex 10881818 Daniel Penz 91581 Cheers. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
Eirikafh, Swagstergo, dalex and Daniel Penz all pestered and thank you both! dalex is the first multi-RX participant I have seen with three hosts, all with bad RX 5700s. As well, Peter Furlong has replied and indicated has stopped producing the bad work units... thanks! @Ian: Keith had also noted that some were not overflows. Your result was not, but both the quorum RX 5700 results on that work unit were. If I see any more like this ie from Dec. 28th on I will pass them to Dr. Korpela as this would indicate that the three-quorum isn't checking the two "valid" results but instead all of them for -9s and clearly in most cases the good platform is not going to have that. I'm also at a loss why the platform hasn't been blocked from v8 work (their AP results are actually good) but ¯\_(ツ)_/¯ |
tazzduke Send message Joined: 15 Sep 07 Posts: 190 Credit: 28,269,068 RAC: 5 |
Greetings All Yep and they just keep coming, here is 2 that hit me, also I couldnt see in earlier lists Dank https://setiathome.berkeley.edu/show_host_detail.php?hostid=8670499 Rafael https://setiathome.berkeley.edu/show_host_detail.php?hostid=8875140 Has anyone thought of the following, Email the team founder that all these hosts respectively belong to and/or Put an advisory on the home page of Seti@home regarding the situation with OpenCL and these cards. Just thinking out aloud here :-) Regards Mark |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.