Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 20 · Next

AuthorMessage
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22257
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2025443 - Posted: 29 Dec 2019, 11:14:01 UTC - in response to Message 2025381.  

It certainly worked when there were issues with nVidia GPUs a few years back, so a bit of thinking around the logic identifying the GPU famuily and I can't see why it shouldn't work again (but on a different GPU family...)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2025443 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2025444 - Posted: 29 Dec 2019, 11:37:15 UTC - in response to Message 2025443.  

It certainly worked when there were issues with nVidia GPUs a few years back, so a bit of thinking around the logic identifying the GPU famuily and I can't see why it shouldn't work again (but on a different GPU family...)
You're not thinking of the Fermi fiasco, are you? That happened because the programmers (NVidia themselves, for those early apps) used an undocumented shortcut which was removed for the later generations. Once identified, NVidia removed it from the SETI application and documented the problem for future programmers.
ID: 2025444 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025454 - Posted: 29 Dec 2019, 14:34:00 UTC - in response to Message 2025440.  
Last modified: 29 Dec 2019, 17:00:27 UTC

Dank and Rafael added and pestered and thank you. :^) Also thanks to Swagstergo who is the latest to reply and indicate that they disabled their affected GPU.

I also found I was mugged for 3810416472 which is exactly what I was waiting for as it created after the indicated quorum fix was in place but stil "validated" with two -9 overflow RX 5700s overpowering a non-overflow platform. I will be contacting Dr. Korpela and I'll include the work unit posted earlier as well.

Edit: The pesterposts seem to be working. I went over the entire list checking work queues and found that an equal number or more people had disabled their affected GPUs (or stopped computing entirely) without replying than otherwise, so there are plenty more strikethroughs in there.
ID: 2025454 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2025455 - Posted: 29 Dec 2019, 15:20:56 UTC
Last modified: 29 Dec 2019, 15:24:10 UTC

I still suggest the same procedure currently being used for the AstroPulse Tasks simply be extended to include Multibeam tasks. The only solution which will stop Cross-Validation while still allowing the GPUs to participate in SETI is to only allow One AMD/ATI Host per work unit. It seems to work well for the AstroPulse tasks on Main, while Beta still allows you to be robbed. The MacOS 19.1 Catalina Update provides support for the AMD RX 5xxx series, so, you will start seeing more of those GPUs on Macs soon, and they work on Macs.
ID: 2025455 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35078
Credit: 261,360,520
RAC: 489
Australia
Message 2025526 - Posted: 30 Dec 2019, 3:43:40 UTC - in response to Message 2025444.  
Last modified: 30 Dec 2019, 3:48:44 UTC

It certainly worked when there were issues with nVidia GPUs a few years back, so a bit of thinking around the logic identifying the GPU famuily and I can't see why it shouldn't work again (but on a different GPU family...)
You're not thinking of the Fermi fiasco, are you? That happened because the programmers (NVidia themselves, for those early apps) used an undocumented shortcut which was removed for the later generations. Once identified, NVidia removed it from the SETI application and documented the problem for future programmers.
Or was it also before then? (295.xx-296.xx were the "sleepy drivers", but I seem to remember something even before them [edit: wasn't there also a bad driver group back in the 140.xx-160.xx range as well).

But IIRC it was Matt that use to work that kind of magic.

Anyhow, here's a couple of more bad SETI choice Xmas presents.

Daniel Conrad Broom 8059986

rAttmAniA 9002301

Cheers.
ID: 2025526 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025562 - Posted: 30 Dec 2019, 11:13:52 UTC - in response to Message 2025526.  

Thanks! Daniel Conrad Broom pestered. rAttmAniA was already in the list. And also thanks to fredi, the latest person to reply to indicate that they disabled their affected GPU.
ID: 2025562 · Report as offensive     Reply Quote
Profile MagicEye
Volunteer tester
Avatar

Send message
Joined: 19 Sep 99
Posts: 70
Credit: 40,327,877
RAC: 75
Germany
Message 2025577 - Posted: 30 Dec 2019, 15:25:58 UTC - in response to Message 2025033.  

Since yesterday, we've fallen back to the old server so anonymous platform apps should be able to get work. Since this morning we should have had the validator that requires 3 results for overflow results. Merry Christmas!

What is the difference between the old and the new validator setting?
In the past i have seen always quorum 3 if 2 hosts have different results - and this practice is still the same when i look on the today results of some 5700XT .
I just want to understand the effort - for me it stills looks the same.

From the science point of view the quorum 4 would be a very good idea to reduce the bad results marked as good by 2 5700 GPUs until a solution is found to exclude such hosts completely.
Or will the results of the 5700 GPUs be deleted in a later stage of the computing on the servers and we don't see any of these results in the science database?
ID: 2025577 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35078
Credit: 261,360,520
RAC: 489
Australia
Message 2025665 - Posted: 31 Dec 2019, 2:59:55 UTC

Another 2 that have crossed my path.

Alexandr Galushchenko 9609912

Richard Hartland 9781177

Cheers.
ID: 2025665 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2025730 - Posted: 31 Dec 2019, 19:04:54 UTC - in response to Message 2025628.  
Last modified: 31 Dec 2019, 19:06:33 UTC

I still suggest the same procedure currently being used for the AstroPulse Tasks simply be extended to include Multibeam tasks. The only solution which will stop Cross-Validation while still allowing the GPUs to participate in SETI is to only allow One AMD/ATI Host per work unit. It seems to work well for the AstroPulse tasks on Main, while Beta still allows you to be robbed. The MacOS 19.1 Catalina Update provides support for the AMD RX 5xxx series, so, you will start seeing more of those GPUs on Macs soon, and they work on Macs.

Does not work on Main either for AP. They do allow two ATI hosts per WU here too, even for AP.

I just got robbed on an AP WU by two ATI GPU's:

https://setiathome.berkeley.edu/workunit.php?wuid=3812949035
I suppose the recent Server swaps could have removed the AP code again...it has happened before. The AP results are different from the Multibeam results. The AP results actually DO result in bad Data being entered into the Database because the task has already passed the RFI test. In the Multibeam tasks, All of those overflows being done by the AMD 5xxx GPUs are at Chirp = Zero, and Will be removed as RFI. So, the Multibeam overflows are just a waste of Volunteer's time, and missed observations for the Project, whereas, the False AP results really do enter bad Data. Strange all the results matched though, it could be something else...
ID: 2025730 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2025958 - Posted: 2 Jan 2020, 3:11:10 UTC

new one that robbed Zalster:

https://setiathome.berkeley.edu/workunit.php?wuid=3816043248

AMD Jesus: https://setiathome.berkeley.edu/show_user.php?userid=70887
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2025958 · Report as offensive     Reply Quote
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2025996 - Posted: 2 Jan 2020, 10:26:35 UTC - in response to Message 2025958.  

new one that robbed Zalster:

https://setiathome.berkeley.edu/workunit.php?wuid=3816043248

AMD Jesus: https://setiathome.berkeley.edu/show_user.php?userid=70887


Do you mean https://setiathome.berkeley.edu/show_host_detail.php?hostid=8823664
Cause AMD Jesus has a Radeon VII and Vega is known good?
ID: 2025996 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22257
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2025997 - Posted: 2 Jan 2020, 10:27:57 UTC

That's interesting - if you look at the task summaries for those two robbers one will see that between them they have over 1000 "invalid" returns. Sadly "invalid" returns do not count "as strikes against" in the same way as errors do. Perhaps it is time for invalid returns to be treated in a similar way to errors (too many in a given period and the number of tasks sent out to that host are reduced).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2025997 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2025999 - Posted: 2 Jan 2020, 10:48:00 UTC

Alexandr Galushchenko and Richard Hartland pestered... thank you! AMD Jesus is the "correct" one (rame is already in the list) as has many invalids, but as noted not used a card known to be an issue and there are a mix of good and bad in there so I wonder if either it was some other issue or the card was changed. I will give that one a day to verify.
ID: 2025999 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2026015 - Posted: 2 Jan 2020, 22:20:32 UTC - in response to Message 2025997.  

That's interesting - if you look at the task summaries for those two robbers one will see that between them they have over 1000 "invalid" returns. Sadly "invalid" returns do not count "as strikes against" in the same way as errors do. Perhaps it is time for invalid returns to be treated in a similar way to errors (too many in a given period and the number of tasks sent out to that host are reduced).


+1

Stephen

! ! !
ID: 2026015 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35078
Credit: 261,360,520
RAC: 489
Australia
Message 2026026 - Posted: 2 Jan 2020, 23:17:41 UTC

Cause AMD Jesus has a Radeon VII and Vega is known good?
If you read the Stderr output (link may not last long) the 1st GPU listed is a gfx1010 (aka: RX 5700 XT and what the w/u was ran on) while the 2nd is a gfx906 (the Radeon VII), but why the older card is listed as the primary card in that system instead of the RX is a bit strange to me. ;-)

Cheers.
ID: 2026026 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2026032 - Posted: 2 Jan 2020, 23:59:52 UTC - in response to Message 2026026.  

Yes, AMD Jesus has two cards, one being the Radeon VII, the other being the RX5700(XT)

The reason the Radeon VII is seen as the better card is likely due to the amount of VRAM, 16GB on the RVII vs 8GB on the RX5700.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2026032 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35078
Credit: 261,360,520
RAC: 489
Australia
Message 2026034 - Posted: 3 Jan 2020, 0:06:36 UTC

I didn't think of that, but you're likely correct.

Cheers.
ID: 2026034 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35078
Credit: 261,360,520
RAC: 489
Australia
Message 2026049 - Posted: 3 Jan 2020, 1:06:15 UTC

A new 1 for me.

Richard 8565733

Cheers.
ID: 2026049 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2026127 - Posted: 3 Jan 2020, 12:13:12 UTC - in response to Message 2026026.  

Cause AMD Jesus has a Radeon VII and Vega is known good?
If you read the Stderr output (link may not last long) the 1st GPU listed is a gfx1010 (aka: RX 5700 XT and what the w/u was ran on) while the 2nd is a gfx906 (the Radeon VII), but why the older card is listed as the primary card in that system instead of the RX is a bit strange to me. ;-)

Cheers.


. . These days it is often the case, I have seen it on many machines, the older or lesser card gets primary listing.

Stephen

? ?
ID: 2026127 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2026169 - Posted: 3 Jan 2020, 21:20:29 UTC
Last modified: 3 Jan 2020, 21:21:28 UTC

The Radeon VII outperforms the 5700 XT, it is the more powerful card (particularly for single precision work).
Grant
Darwin NT
ID: 2026169 · Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 20 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.