Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 21 · Next

AuthorMessage
catavalon21

Send message
Joined: 2 Nov 01
Posts: 13
Credit: 7,238,152
RAC: 48
United States
Message 2021399 - Posted: 2 Dec 2019, 2:36:49 UTC - in response to Message 2021395.  

Disregard my prior post. Turns out it is multiple 5700s cross validating bad results, which others have already posted is the situation.
ID: 2021399 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5531
Credit: 192,787,363
RAC: 628
Australia
Message 2021455 - Posted: 2 Dec 2019, 19:32:03 UTC

. . Well I never ....

. . How annoying, one of my completed WUs got mugged by a couple of ATI hosts that are spewing out invalids all over the place. My (probably) valid result got binned because these 2 delinquent hosts agreed with each other even though most of the results on both machines are invalidated against other hosts. I have to wonder how often this happens ... there could be a lot of such trashy results in the database. :(

https://setiathome.berkeley.edu/workunit.php?wuid=3766336632

Stephen

:(
ID: 2021455 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021507 - Posted: 3 Dec 2019, 3:03:37 UTC - in response to Message 2021455.  

Is there any simple way to demonstrate this problem as a reproducible issue to submit to AMD? For example, is it possible to capture a "known" work unit and run it on two systems (one with RX5700 and one with an older AMD card, for example) to show that RX 5700 gets the results wrong? This would allow AMD to debug the issue, presumably. Raistmer filed a complaint with them, but they were looking for more information to reproduce the problem - which seems to relate to OpenCL on Windows specifically.

I haven't yet tried the 19.12.1 driver, but I am reluctant to do so. Every previous driver has not resolved the issue and there is nothing specific in the new release notes which would lead me to believe it fixes this issue.
ID: 2021507 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12966
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2021508 - Posted: 3 Dec 2019, 3:07:37 UTC - in response to Message 2021507.  

There are multiple tools to download work units offline so you can run them offline on various hardware to compare results. I'm sure Raistmer has already done so and sent the results off to AMD.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2021508 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021592 - Posted: 3 Dec 2019, 22:30:13 UTC - in response to Message 2021508.  

There are multiple tools to download work units offline so you can run them offline on various hardware to compare results. I'm sure Raistmer has already done so and sent the results off to AMD.


Actually, in the thread to AMD he stated that he did not have time for debugging (in September). I believe there was also an issue of hardware availability on first release of the RX 5700. Based on the thread it seems like nothing has been provided.
https://community.amd.com/thread/243179

(Even now for me to do this, I'd need to install my RX480 into my current PC as I have no other AMD hardware to run on). I just thought we might want to file some additional complaints along with instructions to reproduce ... I think it's clear that it is not an application problem since it is only this card's architecture failing (and because multiple other distributed computing projects have trouble with the rx 5700).

The problem is likely to become worse with the impending release of 5500 and 5600 series cards. So it's also possible that AMD will get this fixed soon.
https://www.tomshardware.com/news/report-radeon-rx-5500-xt-launches-next-week-rx-5600-xt-in-january

This is a fairly buggy card/driver. I had frequent crashes upon idle, attributable to the driver, until I found some random hint on the internet that I should set my PCI Express to v3.0 instead of Auto detect. That fixed the issue. The list of known issues is still long. I expect this to be my last AMD GPU.
ID: 2021592 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 23406
Credit: 261,360,520
RAC: 489
Australia
Message 2021603 - Posted: 3 Dec 2019, 23:01:02 UTC

For those of you with these cards and have the time, read this thread.

Yes I know that this is a Nvidia driver problem, but links to the tools needed as well as the relevant information needed for constructing a full report to AMD are there.

You may also realise why Raistmer may not have the time to spend on it even if he had the hardware available.

Cheers.
ID: 2021603 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021749 - Posted: 4 Dec 2019, 23:47:56 UTC - in response to Message 2021603.  

For those of you with these cards and have the time, read this thread.

Yes I know that this is a Nvidia driver problem, but links to the tools needed as well as the relevant information needed for constructing a full report to AMD are there.

You may also realise why Raistmer may not have the time to spend on it even if he had the hardware available.

Cheers.


Thanks. That (and it's linked thread) are quite extensive but I did read through them. I don't fault Raistmer at all for this or for not spending more time on it, the RX5700 OpenCL issues on Windows have been reported from other projects/applications and AMD should eventually address it, hopefully. Unfortunately the cards are going to become increasingly common and they need to be "off" until fixed because they are impacting all results. I have mine off, having only tried it once per driver version after I realized the problem.

If I can figure out what the steps are to reproduce I will submit an AMD report in the next week or so, hopefully. I do have AMD RX 480 in my closet that was my main card prior to the RX5700 purchase, I will see how that works if I can.
ID: 2021749 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021775 - Posted: 5 Dec 2019, 2:51:11 UTC - in response to Message 2021749.  

Well, that wasn't that hard - I ran the reference WU on my RX5700, my Ryzen 9 3900X, and on a laptop with an i7-8550U (CPU), Intel HD 620 iGPU, and GeForce MX150 GPU. All of them worked the same except the RX5700 which gave faulty results. Now, I need to get time to crack open my desktop, pull out the RX5700, and install the RX 480. That will take more time.
ID: 2021775 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12966
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2021777 - Posted: 5 Dec 2019, 2:54:32 UTC - in response to Message 2021749.  

For those of you with these cards and have the time, read this thread.

Yes I know that this is a Nvidia driver problem, but links to the tools needed as well as the relevant information needed for constructing a full report to AMD are there.

You may also realise why Raistmer may not have the time to spend on it even if he had the hardware available.

Cheers.


Thanks. That (and it's linked thread) are quite extensive but I did read through them. I don't fault Raistmer at all for this or for not spending more time on it, the RX5700 OpenCL issues on Windows have been reported from other projects/applications and AMD should eventually address it, hopefully. Unfortunately the cards are going to become increasingly common and they need to be "off" until fixed because they are impacting all results. I have mine off, having only tried it once per driver version after I realized the problem.

If I can figure out what the steps are to reproduce I will submit an AMD report in the next week or so, hopefully. I do have AMD RX 480 in my closet that was my main card prior to the RX5700 purchase, I will see how that works if I can.

Are you able to login to that thread and answer the question from the AMD moderator about steps needed to reproduce? Do you know where to point them to the source packages for oclFFT? I am unsure what they are asking for in fact. Asking for the source code to compile the app perchance? He can look here:
https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/src/
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2021777 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021779 - Posted: 5 Dec 2019, 3:10:01 UTC - in response to Message 2021777.  
Last modified: 5 Dec 2019, 3:12:14 UTC


Are you able to login to that thread and answer the question from the AMD moderator about steps needed to reproduce? Do you know where to point them to the source packages for oclFFT? I am unsure what they are asking for in fact. Asking for the source code to compile the app perchance? He can look here:
https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/src/


No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL.

I'm not able to reply yet because I am still experimenting, but the steps so far are pretty much as follows, after sorting through the thread...
1) Download lunatics installer and reference WU (optionally also reference result).
2) Unzip lunatics installer. In installer directory, create Lunatics.ini file with correct format
[LunaticsInstaller]
TestMode=1
3) Run installer and install CPU and ATI_HD5 Multibeam v8 (GPU) applications. (Note, I have not tried this on a PC without BOINC installed, likely requires BOINC preinstalled). Installer will indicate TestFiles directory where app is installed.
4) Copy reference WU file into target directory, rename as work_unit.sah
5) Run the executable of your choice (CPU or GPU)
6) Inspect results in stderr.txt and result.sah. Comparison will readily show that the RX5700 GPU result overflows the spike limit of 30. Same can be reproduced with any work unit.

Now, I am unsure what the relationship between the optimized apps from that test installer and the current stock apps are/is, but I assumed from that thread that they are the same.

Also as far as replying on the AMD forum, it is possible one has to be "whitelisted" to reply / post so it may take some time to do that as well.
ID: 2021779 · Report as offensive     Reply Quote
Bluerazor

Send message
Joined: 22 May 99
Posts: 15
Credit: 3,889,427
RAC: 12
United States
Message 2021782 - Posted: 5 Dec 2019, 3:34:53 UTC - in response to Message 2018182.  
Last modified: 5 Dec 2019, 3:56:59 UTC

AP app seems working well with the 5700XT !

https://setiathome.berkeley.edu/results.php?hostid=8772813&offset=0&show_names=0&state=0&appid=20
The RX5700 version works well in a Mac, https://setiathome.berkeley.edu/results.php?hostid=8592369&offset=120 You could try the other MB8 Windows App versions and see if they work with with the RX5700XT. Just install Lunatics and then insert the Apps from Here, https://setiathome.berkeley.edu/forum_thread.php?id=79765&postid=1801541#1801541 BOINC will run the NV & Intel Apps on the AMD 5700 by just keeping ATI in the <coproc> section instead of the other names. The current Stock Mac NV App is really the Intel_gpu build due to the normal NV builds not working very well on the NV GPUs. Use the included app_info.xml from the download to set it up and just use ATI in the <coproc> section. Make sure to set your cache very Low first, something like 0.001 will just download a couple of tasks at a time in case it doesn't work. Also, disable networking and suspend all but one task so you can test all the App versions. Or, just download and use the Benchmark package to test it offline, but, the Benchmark test may not run the NV & Intel Apps on the AMD card the way Anonymous platform will.


I had a testing error while trying this initially. It looks like the Intel app also overflows the spikes, at least as best I can tell. So does the older APU app r3430.
ID: 2021782 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12966
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2021789 - Posted: 5 Dec 2019, 5:51:37 UTC - in response to Message 2021779.  

No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL.


You can download the offline test bench application package and run tasks on the RX480 and the RX5700 and compare the results. The test package has Raistmers SoG app in it that is causing the issues.
http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=236

There is no difference in the code from the package r3557 app and the stock Seti SoG r3584 app.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2021789 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 23406
Credit: 261,360,520
RAC: 489
Australia
Message 2021929 - Posted: 6 Dec 2019, 11:41:02 UTC

I've been screwed again by a pair of RX5700's, https://setiathome.berkeley.edu/workunit.php?wuid=3772261364. :-(

Cheers.
ID: 2021929 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13248
Credit: 208,696,464
RAC: 304
Australia
Message 2021966 - Posted: 6 Dec 2019, 21:49:27 UTC - in response to Message 2021929.  

I've been screwed again by a pair of RX5700's, https://setiathome.berkeley.edu/workunit.php?wuid=3772261364. :-(
I get it about twice every 3 days or so.
Grant
Darwin NT
ID: 2021966 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2022026 - Posted: 7 Dec 2019, 6:10:00 UTC
Last modified: 7 Dec 2019, 6:13:36 UTC

This is Now a constant problem. It's way past time to set the Server to only assign One ATI/AMD Host per MB Work Unit. That would solve the problem and it's Already being done on the AP Work Units.

This is just One Day;
   Task    Computer           Sent	            Time reported                    Status	      Run time CPU time Credit            Application
8302185908  8097309  5 Dec 2019,  9:05:24 UTC  6 Dec 2019,  8:51:34 UTC   Completed, marked as invalid  389.03  375.35  0.00  SETI@home v8 Anonymous platform (NVIDIA GPU)
8302185909  8078707  5 Dec 2019,  9:05:26 UTC  5 Dec 2019,  9:23:04 UTC     Completed and validated      15.26   12.13  1.69  SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86
8306207516  7755250  6 Dec 2019, 13:08:08 UTC  6 Dec 2019, 13:27:18 UTC     Completed and validated      18.12   15.17  1.69  SETI@home v8 v8.22(opencl_ati_nocal)windows_intelx86

8306038393  6813106  6 Dec 2019, 11:59:46 UTC  6 Dec 2019, 20:51:17 UTC  Completed, marked as invalid    143.36  73.23  0.00  SETI@home v8 Anonymous platform (NVIDIA GPU)
8306038394  8821706  6 Dec 2019, 11:59:49 UTC  6 Dec 2019, 12:10:09 UTC    Completed and validated        13.16  10.25  1.53  SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86
8307980131  8856643  7 Dec 2019,  1:42:48 UTC  7 Dec 2019, 1:58:39 UTC     Completed and validated        19.05  13.64  1.53  SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86

8303378220  6813106  5 Dec 2019, 17:17:59 UTC  6 Dec 2019,  2:18:18 UTC   Completed, marked as invalid   108.07  54.48  0.00  SETI@home v8 Anonymous platform (NVIDIA GPU)
8303378221  8856740  5 Dec 2019, 17:17:53 UTC  5 Dec 2019, 17:23:04 UTC     Completed and validated       12.07   9.92  1.27  SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86
8305278519  8859902  6 Dec 2019,  6:58:09 UTC  6 Dec 2019,  7:13:04 UTC     Completed and validated       14.16  11.05  1.27  SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86

8305718679  8836536  6 Dec 2019, 9:55:21 UTC   6 Dec 2019, 10:07:19 UTC    Completed and validated        11.09    9.19  1.30  SETI@home v8 v8.22(opencl_ati_nocal)windows_intelx86
8305718680  6796479  6 Dec 2019, 9:55:22 UTC   6 Dec 2019, 14:23:06 UTC  Completed, marked as invalid    130.72  129.13  0.00  SETI@home v8 Anonymous platform (NVIDIA GPU)
8306987953  8772813  6 Dec 2019, 18:43:31 UTC  6 Dec 2019, 19:19:22 UTC    Completed and validated        15.45   12.16  1.30  SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86
ID: 2022026 · Report as offensive     Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 9
Credit: 391,588
RAC: 19
Canada
Message 2022349 - Posted: 8 Dec 2019, 12:16:18 UTC - in response to Message 2021789.  
Last modified: 8 Dec 2019, 12:17:57 UTC

No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL.


You can download the offline test bench application package and run tasks on the RX480 and the RX5700 and compare the results. The test package has Raistmers SoG app in it that is causing the issues.
http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=236

There is no difference in the code from the package r3557 app and the stock Seti SoG r3584 app.


I have an RX5700, an old MacBook Pro with an Intel iGPU, as well as an Nvidia 1060 6GB in my main laptop. I do not own a functional GCN-based ATI/AMD card. Will running the package and comparing the results on those devices work? I am willing to try running the package and posting the results here. However, I am not IT-skilled, at all.
ID: 2022349 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 5 Sep 04
Posts: 40
Credit: 20,877,975
RAC: 219
China
Message 2022362 - Posted: 8 Dec 2019, 16:26:38 UTC
Last modified: 8 Dec 2019, 16:35:38 UTC

Is there something in BOINC that would automatically detect faulty computers and start reducing work sent to it? I am looking at computers like this:
https://setiathome.berkeley.edu/results.php?hostid=8824639

The only valid results seem to be cross validation between AMD GPUs. Limiting to one AMD GPU per WU is good in terms of not tainting the results, but there is little point of sending a WU to that computer to waste their power at first place...
ID: 2022362 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4135
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2022369 - Posted: 8 Dec 2019, 17:00:45 UTC - in response to Message 2022362.  

there is, but I don't know the exact thresholds for it. I think systems like these slide under the radar because they still occasionally send "Valid" (huge air quotes here) results. The validation process is automated, and it's not smart enough to know that the two matching results from 2 AMD cards are actually the incorrect ones.

requiring matching results from different apps would likely solve this problem. then you would never have cross validations like this.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2022369 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 23406
Credit: 261,360,520
RAC: 489
Australia
Message 2022511 - Posted: 9 Dec 2019, 18:23:00 UTC

ID: 2022511 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3297
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2022525 - Posted: 9 Dec 2019, 19:43:27 UTC
Last modified: 10 Dec 2019, 11:58:35 UTC

First time I ever checked my invalids since this thread started, and of course I have a few cross-validations: 3778213593, 3777934914, 3777934944 and 3777934950,

I also had 3775326461 but it was deleted, so there is some proactive purging going on.

Edit: All of the four above that I could originally link to have now been removed as well since I posted this.
ID: 2022525 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 21 · Next

Message boards : Number crunching : Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.