Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 21 · Next

AuthorMessage
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4243
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2023596 - Posted: 18 Dec 2019, 18:05:30 UTC - in response to Message 2023594.  

I've also seen several people on reddit mention that. maybe on other projects, I haven't seen any Linux/Navi systems here at SETI.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2023596 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14472
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2023602 - Posted: 18 Dec 2019, 19:40:44 UTC - in response to Message 2023438.  

I would like a report from Richard at what was decided or discussed during the BOINC steering committee meeting this morning on this problem.
I had a PM from Mr. Kevvy about that (I'm not an AMD user, so I don't usually bother reading this thread).

Mr. Kevvy described it as a SETI@Home team meeting, rather than a BOINC steering committee meeting, which sounds sensible for an issue like this. I'm sorry to say that I'm not invited to SETI@Home team meetings, and so far as I'm aware there are - currently - no published minutes or recordings. I wasn't even aware there was a meeting until Mr. Kevvy asked.
ID: 2023602 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2023643 - Posted: 19 Dec 2019, 2:23:56 UTC - in response to Message 2023593.  

New drivers today, but don't expect a fix.

https://www.reddit.com/r/Amd/comments/ebz5vk/radeon_software_19123_tomorrow_december_18_2019/fb84hpx/?utm_source=share&utm_medium=web2x

tmakedon
Director of AMD Software Strategy

Does this reply prove to you that we are monitoring Reddit? ;-) Yes this will show up as a known issue and we are investigating it.


As promised, it's on the known issues (finally):
https://www.amd.com/en/support/kb/release-notes/rn-rad-win-19-12-3
ID: 2023643 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2023648 - Posted: 19 Dec 2019, 2:55:00 UTC - in response to Message 2023643.  
Last modified: 19 Dec 2019, 2:55:30 UTC

As promised, it's on the known issues (finally):
https://www.amd.com/en/support/kb/release-notes/rn-rad-win-19-12-3

made clickable
bottom of the known issues list
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2023648 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4243
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2023654 - Posted: 19 Dec 2019, 3:38:44 UTC - in response to Message 2023648.  

what a cop out too. SETI isnt the only project having issues with these cards. Instead of naming SETI specifically, they should have said something like some OpenCL compute applications will provide incorrect results.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2023654 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2023676 - Posted: 19 Dec 2019, 6:39:31 UTC - in response to Message 2023654.  

what a cop out too. SETI isnt the only project having issues with these cards. Instead of naming SETI specifically, they should have said something like some OpenCL compute applications will provide incorrect results.


. . Exactly there is NO maybe about it ... they are giving out nothing but rubbish ...

Stephen

:(
ID: 2023676 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26181
Credit: 261,360,520
RAC: 489
Australia
Message 2023833 - Posted: 20 Dec 2019, 5:50:10 UTC

A new 1 decided to pester me, along with the usual culprits. :-(

JohnDoe 9166075

Cheers.
ID: 2023833 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26181
Credit: 261,360,520
RAC: 489
Australia
Message 2023966 - Posted: 21 Dec 2019, 0:42:01 UTC

I was either very very lucky yesterday (UTC time) as only 2 regulars were of a very minor annoyance to me, or more have shut their GPU's down (or have they been finally locked out?).

Cheers.
ID: 2023966 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3422
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2023967 - Posted: 21 Dec 2019, 0:45:43 UTC - in response to Message 2023966.  

I was either very very lucky yesterday (UTC time) as only 2 regulars were of a very minor annoyance to me, or more have shut their GPU's down (or have they been finally locked out?).

Cheers.


I think it's working.. between my pestering (JohnDoe 9166075 is the latest and thank you) and word getting out. [AfZ]TomServo1 is the latest to reply and indicate the GPU is now disabled.
ID: 2023967 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1380
Credit: 54,506,847
RAC: 60
United States
Message 2024148 - Posted: 21 Dec 2019, 17:14:48 UTC - in response to Message 2023967.  
Last modified: 21 Dec 2019, 17:38:17 UTC

I've made a change to the validator that raises the effective quorum for overflow results. The should limit the number of successful cross validations for these GPUs.

That change is still missing a few, so I need to check the overflow detection mechanism. No, my mistake. It is working. The results I was looking at were before the change.
@SETIEric

ID: 2024148 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13331
Credit: 208,696,464
RAC: 304
Australia
Message 2024202 - Posted: 21 Dec 2019, 20:58:02 UTC - in response to Message 2024148.  

I've made a change to the validator that raises the effective quorum for overflow results. The should limit the number of successful cross validations for these GPUs.

That change is still missing a few, so I need to check the overflow detection mechanism. No, my mistake. It is working. The results I was looking at were before the change.

Thank you for that.
That should also help reduce the server load.
Grant
Darwin NT
ID: 2024202 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2024218 - Posted: 21 Dec 2019, 22:02:24 UTC - in response to Message 2024202.  
Last modified: 21 Dec 2019, 22:04:02 UTC

That should also help reduce the server load.

Not be so sure about that. Will take more time to clear the quorum for overflow results.

But sure will reduce the pollution of the DB giving more time to they fix the driver.
ID: 2024218 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024256 - Posted: 22 Dec 2019, 1:28:10 UTC - in response to Message 2024148.  

I've made a change to the validator that raises the effective quorum for overflow results. The should limit the number of successful cross validations for these GPUs.

That change is still missing a few, so I need to check the overflow detection mechanism. No, my mistake. It is working. The results I was looking at were before the change.


. . Many thanks, hope all this does not spoil your Christmas too much.

Stephen

:)
ID: 2024256 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26181
Credit: 261,360,520
RAC: 489
Australia
Message 2024270 - Posted: 22 Dec 2019, 2:46:55 UTC

I must of just been lucky yesterday as today I can report 3 new culprits with 1 being a Linux job so it's certainly not just Windows thing. :-(

fredi 7913572 Linux
Recedham 954834
TomasFraus 8445239

Cheers.
ID: 2024270 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13139
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2024299 - Posted: 22 Dec 2019, 6:14:33 UTC - in response to Message 2024270.  

I must of just been lucky yesterday as today I can report 3 new culprits with 1 being a Linux job so it's certainly not just Windows thing. :-(

fredi 7913572 Linux
Recedham 954834
TomasFraus 8445239

Cheers.

Good find. I thought there was the same issue with the Linux hosts. But rare to find one using the 5700 on the Seti project.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2024299 · Report as offensive     Reply Quote
Profile Chooka Project Donor
Avatar

Send message
Joined: 13 Dec 12
Posts: 10
Credit: 7,913,101
RAC: 0
Australia
Message 2024781 - Posted: 24 Dec 2019, 5:29:32 UTC

That latest AMD driver update 19.12.2 was garbage. Caused my system to crash. That running 1 wu on a Radeon VII. I reverted back to 19.9.2 and it's been stable since.
ID: 2024781 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 26181
Credit: 261,360,520
RAC: 489
Australia
Message 2024784 - Posted: 24 Dec 2019, 5:50:26 UTC - in response to Message 2024781.  

That latest AMD driver update 19.12.2 was garbage. Caused my system to crash. That running 1 wu on a Radeon VII. I reverted back to 19.9.2 and it's been stable since.
But at least your cards, so long as you stay with your current drivers, are not effected by the huge problem that effects the newer RX pose just out of the box and so far no solution is in site (other than AMD now stating that their RX cards are not "SETI Friendly").

But I would be very wary about newer drivers on their older kits now. ;-)

Cheers.
ID: 2024784 · Report as offensive     Reply Quote
Profile Justin Turner Arthur

Send message
Joined: 20 Oct 03
Posts: 12
Credit: 3,929,052
RAC: 2
United States
Message 2024788 - Posted: 24 Dec 2019, 6:21:26 UTC

That is good info. Looks like fredi's computer 8867025 is running the PAL OpenCL driver for Linux (probably from the AMDGPU-Pro package). As ROCm driver cards still aren't chosen in the ATI multibeam client's plan class, the only results we'd get from those will be from anonymous hosts, and that's only when ROCm finally gets Navi support.

So the big question is how this continues to work on the macOS OpenCL runtime according to observations posted to this thread. My guesses are one of these:
- There's an issue with Navi support at the PAL layer. Both the Windows driver and AMDGPU-Pro use the PAL layer provided by device drivers.
- Both the Windows and AMDGPU-Pro OpenCL compilers are shipped with the same GCN bitcode mistakes.

I don't know if the Apple stack uses PAL at all.
ID: 2024788 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1380
Credit: 54,506,847
RAC: 60
United States
Message 2025033 - Posted: 26 Dec 2019, 4:17:24 UTC - in response to Message 2024788.  
Last modified: 26 Dec 2019, 5:31:13 UTC

Since yesterday, we've fallen back to the old server so anonymous platform apps should be able to get work. Since this morning we should have had the validator that requires 3 results for overflow results. Merry Christmas!

Now I just need to explain why some workunits like this one are sneaking through.

Problem found and fixed....
@SETIEric

ID: 2025033 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13331
Credit: 208,696,464
RAC: 304
Australia
Message 2025037 - Posted: 26 Dec 2019, 5:35:39 UTC - in response to Message 2025033.  

Since yesterday, we've fallen back to the old server so anonymous platform apps should be able to get work. Since this morning we should have had the validator that requires 3 results for overflow results. Merry Christmas!
Thanks for all your efforts.

We're just very curious as to how the Scheduler from Beta ended up on main, after all the complaints about it's issues over at Beta about it?
Grant
Darwin NT
ID: 2025037 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 21 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.