Message boards :
Number crunching :
Flakey AMD/ATI GPUs, including RX 5700 XT, Cross Validating, polluting the Database
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 20 · Next
Author | Message |
---|---|
catavalon21 Send message Joined: 2 Nov 01 Posts: 13 Credit: 7,238,152 RAC: 48 |
Disregard my prior post. Turns out it is multiple 5700s cross validating bad results, which others have already posted is the situation. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Well I never .... . . How annoying, one of my completed WUs got mugged by a couple of ATI hosts that are spewing out invalids all over the place. My (probably) valid result got binned because these 2 delinquent hosts agreed with each other even though most of the results on both machines are invalidated against other hosts. I have to wonder how often this happens ... there could be a lot of such trashy results in the database. :( https://setiathome.berkeley.edu/workunit.php?wuid=3766336632 Stephen :( |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
Is there any simple way to demonstrate this problem as a reproducible issue to submit to AMD? For example, is it possible to capture a "known" work unit and run it on two systems (one with RX5700 and one with an older AMD card, for example) to show that RX 5700 gets the results wrong? This would allow AMD to debug the issue, presumably. Raistmer filed a complaint with them, but they were looking for more information to reproduce the problem - which seems to relate to OpenCL on Windows specifically. I haven't yet tried the 19.12.1 driver, but I am reluctant to do so. Every previous driver has not resolved the issue and there is nothing specific in the new release notes which would lead me to believe it fixes this issue. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
There are multiple tools to download work units offline so you can run them offline on various hardware to compare results. I'm sure Raistmer has already done so and sent the results off to AMD. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
There are multiple tools to download work units offline so you can run them offline on various hardware to compare results. I'm sure Raistmer has already done so and sent the results off to AMD. Actually, in the thread to AMD he stated that he did not have time for debugging (in September). I believe there was also an issue of hardware availability on first release of the RX 5700. Based on the thread it seems like nothing has been provided. https://community.amd.com/thread/243179 (Even now for me to do this, I'd need to install my RX480 into my current PC as I have no other AMD hardware to run on). I just thought we might want to file some additional complaints along with instructions to reproduce ... I think it's clear that it is not an application problem since it is only this card's architecture failing (and because multiple other distributed computing projects have trouble with the rx 5700). The problem is likely to become worse with the impending release of 5500 and 5600 series cards. So it's also possible that AMD will get this fixed soon. https://www.tomshardware.com/news/report-radeon-rx-5500-xt-launches-next-week-rx-5600-xt-in-january This is a fairly buggy card/driver. I had frequent crashes upon idle, attributable to the driver, until I found some random hint on the internet that I should set my PCI Express to v3.0 instead of Auto detect. That fixed the issue. The list of known issues is still long. I expect this to be my last AMD GPU. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36378 Credit: 261,360,520 RAC: 489 |
For those of you with these cards and have the time, read this thread. Yes I know that this is a Nvidia driver problem, but links to the tools needed as well as the relevant information needed for constructing a full report to AMD are there. You may also realise why Raistmer may not have the time to spend on it even if he had the hardware available. Cheers. |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
For those of you with these cards and have the time, read this thread. Thanks. That (and it's linked thread) are quite extensive but I did read through them. I don't fault Raistmer at all for this or for not spending more time on it, the RX5700 OpenCL issues on Windows have been reported from other projects/applications and AMD should eventually address it, hopefully. Unfortunately the cards are going to become increasingly common and they need to be "off" until fixed because they are impacting all results. I have mine off, having only tried it once per driver version after I realized the problem. If I can figure out what the steps are to reproduce I will submit an AMD report in the next week or so, hopefully. I do have AMD RX 480 in my closet that was my main card prior to the RX5700 purchase, I will see how that works if I can. |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
Well, that wasn't that hard - I ran the reference WU on my RX5700, my Ryzen 9 3900X, and on a laptop with an i7-8550U (CPU), Intel HD 620 iGPU, and GeForce MX150 GPU. All of them worked the same except the RX5700 which gave faulty results. Now, I need to get time to crack open my desktop, pull out the RX5700, and install the RX 480. That will take more time. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
For those of you with these cards and have the time, read this thread. Are you able to login to that thread and answer the question from the AMD moderator about steps needed to reproduce? Do you know where to point them to the source packages for oclFFT? I am unsure what they are asking for in fact. Asking for the source code to compile the app perchance? He can look here: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/src/ Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL. I'm not able to reply yet because I am still experimenting, but the steps so far are pretty much as follows, after sorting through the thread... 1) Download lunatics installer and reference WU (optionally also reference result). 2) Unzip lunatics installer. In installer directory, create Lunatics.ini file with correct format [LunaticsInstaller] TestMode=1 3) Run installer and install CPU and ATI_HD5 Multibeam v8 (GPU) applications. (Note, I have not tried this on a PC without BOINC installed, likely requires BOINC preinstalled). Installer will indicate TestFiles directory where app is installed. 4) Copy reference WU file into target directory, rename as work_unit.sah 5) Run the executable of your choice (CPU or GPU) 6) Inspect results in stderr.txt and result.sah. Comparison will readily show that the RX5700 GPU result overflows the spike limit of 30. Same can be reproduced with any work unit. Now, I am unsure what the relationship between the optimized apps from that test installer and the current stock apps are/is, but I assumed from that thread that they are the same. Also as far as replying on the AMD forum, it is possible one has to be "whitelisted" to reply / post so it may take some time to do that as well. |
Bluerazor Send message Joined: 22 May 99 Posts: 15 Credit: 3,889,427 RAC: 12 |
AP app seems working well with the 5700XT !The RX5700 version works well in a Mac, https://setiathome.berkeley.edu/results.php?hostid=8592369&offset=120 You could try the other MB8 Windows App versions and see if they work with with the RX5700XT. Just install Lunatics and then insert the Apps from Here, https://setiathome.berkeley.edu/forum_thread.php?id=79765&postid=1801541#1801541 BOINC will run the NV & Intel Apps on the AMD 5700 by just keeping ATI in the <coproc> section instead of the other names. The current Stock Mac NV App is really the Intel_gpu build due to the normal NV builds not working very well on the NV GPUs. Use the included app_info.xml from the download to set it up and just use ATI in the <coproc> section. Make sure to set your cache very Low first, something like 0.001 will just download a couple of tasks at a time in case it doesn't work. Also, disable networking and suspend all but one task so you can test all the App versions. Or, just download and use the Benchmark package to test it offline, but, the Benchmark test may not run the NV & Intel Apps on the AMD card the way Anonymous platform will. I had a testing error while trying this initially. It looks like the Intel app also overflows the spikes, at least as best I can tell. So does the older APU app r3430. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL. You can download the offline test bench application package and run tasks on the RX480 and the RX5700 and compare the results. The test package has Raistmers SoG app in it that is causing the issues. http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=236 There is no difference in the code from the package r3557 app and the stock Seti SoG r3584 app. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Wiggo Send message Joined: 24 Jan 00 Posts: 36378 Credit: 261,360,520 RAC: 489 |
I've been screwed again by a pair of RX5700's, https://setiathome.berkeley.edu/workunit.php?wuid=3772261364. :-( Cheers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
I've been screwed again by a pair of RX5700's, https://setiathome.berkeley.edu/workunit.php?wuid=3772261364. :-(I get it about twice every 3 days or so. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
This is Now a constant problem. It's way past time to set the Server to only assign One ATI/AMD Host per MB Work Unit. That would solve the problem and it's Already being done on the AP Work Units. This is just One Day; Task Computer Sent Time reported Status Run time CPU time Credit Application 8302185908 8097309 5 Dec 2019, 9:05:24 UTC 6 Dec 2019, 8:51:34 UTC Completed, marked as invalid 389.03 375.35 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU) 8302185909 8078707 5 Dec 2019, 9:05:26 UTC 5 Dec 2019, 9:23:04 UTC Completed and validated 15.26 12.13 1.69 SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86 8306207516 7755250 6 Dec 2019, 13:08:08 UTC 6 Dec 2019, 13:27:18 UTC Completed and validated 18.12 15.17 1.69 SETI@home v8 v8.22(opencl_ati_nocal)windows_intelx86 8306038393 6813106 6 Dec 2019, 11:59:46 UTC 6 Dec 2019, 20:51:17 UTC Completed, marked as invalid 143.36 73.23 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU) 8306038394 8821706 6 Dec 2019, 11:59:49 UTC 6 Dec 2019, 12:10:09 UTC Completed and validated 13.16 10.25 1.53 SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86 8307980131 8856643 7 Dec 2019, 1:42:48 UTC 7 Dec 2019, 1:58:39 UTC Completed and validated 19.05 13.64 1.53 SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86 8303378220 6813106 5 Dec 2019, 17:17:59 UTC 6 Dec 2019, 2:18:18 UTC Completed, marked as invalid 108.07 54.48 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU) 8303378221 8856740 5 Dec 2019, 17:17:53 UTC 5 Dec 2019, 17:23:04 UTC Completed and validated 12.07 9.92 1.27 SETI@home v8 v8.22(opencl_ati5_nocal)windows_intelx86 8305278519 8859902 6 Dec 2019, 6:58:09 UTC 6 Dec 2019, 7:13:04 UTC Completed and validated 14.16 11.05 1.27 SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86 8305718679 8836536 6 Dec 2019, 9:55:21 UTC 6 Dec 2019, 10:07:19 UTC Completed and validated 11.09 9.19 1.30 SETI@home v8 v8.22(opencl_ati_nocal)windows_intelx86 8305718680 6796479 6 Dec 2019, 9:55:22 UTC 6 Dec 2019, 14:23:06 UTC Completed, marked as invalid 130.72 129.13 0.00 SETI@home v8 Anonymous platform (NVIDIA GPU) 8306987953 8772813 6 Dec 2019, 18:43:31 UTC 6 Dec 2019, 19:19:22 UTC Completed and validated 15.45 12.16 1.30 SETI@home v8 v8.22(opencl_ati5_SoG_nocal)windows_intelx86 |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 9 Credit: 391,588 RAC: 19 |
No, I did not know where they were. While I am IT-skilled, I am no expert with SETI apps or OpenCL. I have an RX5700, an old MacBook Pro with an Intel iGPU, as well as an Nvidia 1060 6GB in my main laptop. I do not own a functional GCN-based ATI/AMD card. Will running the package and comparing the results on those devices work? I am willing to try running the package and posting the results here. However, I am not IT-skilled, at all. |
wujj123456 Send message Joined: 5 Sep 04 Posts: 40 Credit: 20,877,975 RAC: 219 |
Is there something in BOINC that would automatically detect faulty computers and start reducing work sent to it? I am looking at computers like this: https://setiathome.berkeley.edu/results.php?hostid=8824639 The only valid results seem to be cross validation between AMD GPUs. Limiting to one AMD GPU per WU is good in terms of not tainting the results, but there is little point of sending a WU to that computer to waste their power at first place... |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
there is, but I don't know the exact thresholds for it. I think systems like these slide under the radar because they still occasionally send "Valid" (huge air quotes here) results. The validation process is automated, and it's not smart enough to know that the two matching results from 2 AMD cards are actually the incorrect ones. requiring matching results from different apps would likely solve this problem. then you would never have cross validations like this. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Wiggo Send message Joined: 24 Jan 00 Posts: 36378 Credit: 261,360,520 RAC: 489 |
|
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3797 Credit: 1,114,826,392 RAC: 3,319 |
First time I ever checked my invalids since this thread started, and of course I have a few cross-validations: 3778213593, 3777934914, 3777934944 and 3777934950, I also had 3775326461 but it was deleted, so there is some proactive purging going on. Edit: All of the four above that I could originally link to have now been removed as well since I posted this. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.