Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 19 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5086
Credit: 771,441,398
RAC: 1,794,236
United States
Message 2006516 - Posted: 9 Aug 2019, 18:56:34 UTC

This is becoming much more common, similar to the problem with the APs,

https://setiathome.berkeley.edu/workunit.php?wuid=3597951375
   Task    Computer            Sent                 Time reported                Status             Run time  CPU time  Credit                         Application
7934624541 8534188  8 Aug 2019,  5:33:40 UTC  8 Aug 2019,  5:50:25 UTC  Completed and validated        17.34    13.97    1.87   SETI@home v8 v8.22 (opencl_ati5_SoG_nocal) windows_intelx86
7934624542 8757016  8 Aug 2019,  5:33:35 UTC  8 Aug 2019,  9:29:35 UTC  Completed, marked as invalid  176.24   149.68    0.00   SETI@home v8 Anonymous platform (NVIDIA GPU)
7935750639 7060821  8 Aug 2019, 15:01:01 UTC  8 Aug 2019, 16:11:08 UTC  Completed, marked as invalid   35.12    17.83    0.00   SETI@home v8 Anonymous platform (ATI GPU)
7936540283 6942127  8 Aug 2019, 21:19:24 UTC  8 Aug 2019, 21:55:40 UTC  Completed and validated        23.18    20.08    1.87   SETI@home v8 v8.22 (opencl_ati_nocal) windows_intelx86
All the AMD GPUs have Hundreds of Invalids elsewhere. I suggest action sooner rather than later as it will undoubtedly become worse with more RX 5700 XT GPUs arriving. Maybe just One AMD GPU per WorkUnit?
ID: 2006516 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 18303
Credit: 411,475,170
RAC: 43,367
United Kingdom
Message 2006522 - Posted: 9 Aug 2019, 19:23:31 UTC

The most likely action is akin to the one they used a few years ago when nVidia GPUs were having grief with VLARs. A blanket embargo on all such GPUs until a "rock solid" solution was found (in the current situation "is found"). Naturally there will be people who will be significantly affected by such an action, but as this now appears to be jeopardising the integrity of the data (and thus the science) action should be taken sooner than later.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2006522 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3045
Credit: 1,036,835,818
RAC: 2,060,806
Canada
Message 2006525 - Posted: 9 Aug 2019, 19:32:25 UTC
Last modified: 11 Aug 2019, 3:01:56 UTC

Thanks for this, TBar. I had seen this mentioned before but didn't read it enough to see that they unfortunately aren't erroring out but instead running to boinc_finish completion.

I'm sure he's aware, but just in case, I'll collect the info and let Dr. Korpela know.

Edit: He's acknowledged... I am glad that I advised him as I don't think he had been yet.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2006525 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11014
Credit: 1,110,433,014
RAC: 1,680,033
United States
Message 2006538 - Posted: 9 Aug 2019, 20:42:58 UTC - in response to Message 2006525.  

The new Navi 5700 and 5700XT are useless for compute currently. The drivers are not ready for compute. All projects that rely on AMD OpenCL drivers are producing nothing but garbage results and invalids. The AMD developers and the Khronos group are aware of the problem but not a peep from either of them about what the real problem is or when to expect a fix. In the meantime, I think those cards should be banned until the drivers are fixed for all projects.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2006538 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3045
Credit: 1,036,835,818
RAC: 2,060,806
Canada
Message 2007063 - Posted: 12 Aug 2019, 23:34:07 UTC

Looks like Dr. Korpela is on it as that work unit linked in the OP has now been purged.

This host's work unitslinked in the "Invalid Host Messaging" show it only has 64 in progress none of which are opencl_ati_nocal so either that platform has been banned or the owner acted on a private message. Another host from that thread shows active opencl_ati_nocal work but it was sent five days ago.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2007063 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 12386
Credit: 201,034,485
RAC: 267,377
Australia
Message 2007665 - Posted: 16 Aug 2019, 6:25:26 UTC

No blanket ban for these GPUs yet.
I've noticed a huge jump in my Inconclusives with the start of WoW (to be expected), and in addition to the usual suspects, there are a few RX 5700/ RX 5700 XTs with work that was sent out to them in the last day or two.
Grant
Darwin NT
ID: 2007665 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 2501
Credit: 1,113,784,702
RAC: 3,392,595
United States
Message 2008929 - Posted: 23 Aug 2019, 13:00:04 UTC

def not banned. this one just popped up for me. I got edged out on this WU by 2 5700s that cross validated each other:

https://setiathome.berkeley.edu/workunit.php?wuid=3619893627
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2008929 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3045
Credit: 1,036,835,818
RAC: 2,060,806
Canada
Message 2008931 - Posted: 23 Aug 2019, 13:10:07 UTC - in response to Message 2008929.  

And by checking the queues of the computers involved, others will inevitably be found. Well, the good word has been passed so it's out of our hands lol. ¯\_(ツ)_/¯
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2008931 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 2501
Credit: 1,113,784,702
RAC: 3,392,595
United States
Message 2009192 - Posted: 25 Aug 2019, 13:38:07 UTC - in response to Message 2008929.  

I PM'd Alexander, and he agreed to remove his 5700 from the project for now.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2009192 · Report as offensive     Reply Quote
Nikolaj_sofus

Send message
Joined: 25 Aug 19
Posts: 7
Credit: 280,643
RAC: 1
Denmark
Message 2009197 - Posted: 25 Aug 2019, 13:55:12 UTC

I'm new to seti@home.... and just started crunching with my vega 56 today... any issues with the vega cards or is it just an issue with the rdna cards?

Also, how does the validation process work? can two vega cards actually validate eachother?
ID: 2009197 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3045
Credit: 1,036,835,818
RAC: 2,060,806
Canada
Message 2009200 - Posted: 25 Aug 2019, 14:11:00 UTC - in response to Message 2009192.  
Last modified: 25 Aug 2019, 14:34:42 UTC

I PM'd Alexander, and he agreed to remove his 5700 from the project for now.


Also that work unit I linked below only two days ago has now been purged as was the last one, so there definitely is some admin. involvement going on.

@Nikolaj: You're fine... no issues with that card. The only check the project makes for quorum as far as I know is that both copies of a work unit aren't sent to the same participant.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2009200 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5086
Credit: 771,441,398
RAC: 1,794,236
United States
Message 2009260 - Posted: 25 Aug 2019, 21:37:20 UTC - in response to Message 2009200.  
Last modified: 25 Aug 2019, 21:49:59 UTC

There has been a problem with Astropulse and the Newer AMD cards for a couple of Years. Any driver newer than around 1800 will produce the Wrong AstroPulse results.
Supposedly there is just One AMD/ATI card for each AP Workunit allowed, however, I've noticed that sometimes that doesn't happen and Two AMD cards will cross-validate with the wrong results.
Here's an example of the differences,
AstroPulse v7 Anonymous platform (NVIDIA GPU)
single pulses: 26
repetitive pulses: 30
percent blanked: 2.40

AstroPulse v7 v7.09 (opencl_ati_100) windows_intelx86
single pulses: 7
repetitive pulses: 30
percent blanked: 2.40


It's been like this for years.
ID: 2009260 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17652
Credit: 252,779,398
RAC: 195,500
Australia
Message 2010048 - Posted: 30 Aug 2019, 21:37:20 UTC

Another pair of RX 5700's adding to poor science. :-(

https://setiathome.berkeley.edu/workunit.php?wuid=3632373304

They really need to be looked at seriously.

Cheers.
ID: 2010048 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,615,701
RAC: 4,587
United States
Message 2010575 - Posted: 3 Sep 2019, 23:49:58 UTC - in response to Message 2010048.  

Is there any specific information that RX 5700 and 5700 XT owners can pass along to AMD to complain about this issue? It's not great news to buy a new card and find out you're stuck with just slow CPU crunching.
ID: 2010575 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17652
Credit: 252,779,398
RAC: 195,500
Australia
Message 2010585 - Posted: 4 Sep 2019, 0:41:45 UTC - in response to Message 2010575.  

Is there any specific information that RX 5700 and 5700 XT owners can pass along to AMD to complain about this issue? It's not great news to buy a new card and find out you're stuck with just slow CPU crunching.
It's really very simple, the OpenCL part of AMD's latest drivers is broken for computational work. ;-)

Cheers.
ID: 2010585 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11014
Credit: 1,110,433,014
RAC: 1,680,033
United States
Message 2010586 - Posted: 4 Sep 2019, 0:45:58 UTC - in response to Message 2010585.  

OpenCL is controlled by the Khronos Group. Not AMD. They are the ones that need to be contacted that their driver component is broken.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010586 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 12386
Credit: 201,034,485
RAC: 267,377
Australia
Message 2010612 - Posted: 4 Sep 2019, 7:40:28 UTC - in response to Message 2010586.  

OpenCL is controlled by the Khronos Group. Not AMD. They are the ones that need to be contacted that their driver component is broken.
Khronos might be the ones responsible for OpenCL, but the supplier of the hardware are the ones responsible for the function of their drivers.
If there is an issue with the OpenCL specification, then things will become very ugly and protected (lots of figure pointing). But if it's just a case of an issue with the driver development, hopefully it shouldn't take too long for it to be resolved by the manufacturer of the hardware affected.
Grant
Darwin NT
ID: 2010612 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6120
Credit: 104,782,853
RAC: 40,364
Russia
Message 2010625 - Posted: 4 Sep 2019, 12:32:43 UTC

And did someone thread about it on AMD OpenCL forums?
Anyone with ability to do offline testing and possession of such "broken" hardware+software?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 2010625 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6120
Credit: 104,782,853
RAC: 40,364
Russia
Message 2010627 - Posted: 4 Sep 2019, 12:38:42 UTC - in response to Message 2006516.  

This is becoming much more common, similar to the problem with the APs,

https://setiathome.berkeley.edu/workunit.php?wuid=3597951375
   Task    Computer            Sent                 Time reported                Status             Run time  CPU time  Credit                         Application
7934624541 8534188  8 Aug 2019,  5:33:40 UTC  8 Aug 2019,  5:50:25 UTC  Completed and validated        17.34    13.97    1.87   SETI@home v8 v8.22 ([b]opencl_ati5_SoG_nocal[/b]) windows_intelx86
7936540283 6942127  8 Aug 2019, 21:19:24 UTC  8 Aug 2019, 21:55:40 UTC  Completed and validated        23.18    20.08    1.87   SETI@home v8 v8.22 ([b]opencl_ati_nocal[/b]) windows_intelx86
All the AMD GPUs have Hundreds of Invalids elsewhere. I suggest action sooner rather than later as it will undoubtedly become worse with more RX 5700 XT GPUs arriving. Maybe just One AMD GPU per WorkUnit?




For future reference could anyone posting such comparisons also to grab stderr outputs too while they are available, please.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 2010627 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11014
Credit: 1,110,433,014
RAC: 1,680,033
United States
Message 2010648 - Posted: 4 Sep 2019, 15:36:26 UTC - in response to Message 2010625.  

And did someone thread about it on AMD OpenCL forums?
Anyone with ability to do offline testing and possession of such "broken" hardware+software?

Phoronix did testing and reviews of the RX 5700XT and could not get the card and drivers to pass the OpenCL parts of their standardized test suite.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010648 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 19 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.