I am getting a lot of gpu tasks with zero (0) expected processing times.

Message boards : Number crunching : I am getting a lot of gpu tasks with zero (0) expected processing times.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1978807 - Posted: 5 Feb 2019, 23:47:12 UTC - in response to Message 1978802.  

I just got an AP task and here is the parameters for it.
<workunit>
<name>ap_10ja19aa_B4_P0_00260_20190111_05360.wu</name>
<app_name>astropulse_v7</app_name>
<version_num>708</version_num>
<rsc_fpops_est>38074383581806.976562</rsc_fpops_est>
<rsc_fpops_bound>380743835818069.750000</rsc_fpops_bound>
<rsc_memory_bound>167772160.000000</rsc_memory_bound>
<rsc_disk_bound>62914560.000000</rsc_disk_bound>
<file_ref>
<file_name>ap_10ja19aa_B4_P0_00260_20190111_05360.wu</file_name>
<open_name>in.dat</open_name>
</file_ref>
</workunit>
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1978807 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978808 - Posted: 5 Feb 2019, 23:52:44 UTC - in response to Message 1978807.  

OK, ratio is the standard (default) 10::1 - so if your speed estimate is more than 10 times what you can deliver in practice, it all falls over. That gives us a datum. And I'm off to bed.
ID: 1978808 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1978832 - Posted: 6 Feb 2019, 1:51:42 UTC - in response to Message 1978802.  

Richard or Bob, I don't know if you can answer this question, but why is it that this is a problem only for AP7 tasks? I have no problems crunching regular S@H v8 GPU tasks. Right now all those tasks have an ETA of 50-60 minutes. At a glance of my recently validated tasks, they complete in about 40-50 minutes.
I'd need to know more about your setup, perhaps tomorrow - UK bedtime approaches...

Specifically, do you specify/run your own applications via 'anonymous platform' - an app_info.xml file?

If so, have you defined a speed estimate - a <flops> value - in that file, for either or both applications?

What are you seeing for the 'Peak Flops' performance of your GPU, in BOINC's startup Event Log?

And more generally - anyone can answer, please - what is the ratio of <rsc_fpops_est> to <rsc_fpops_bound> for AP tasks (I can look up MB tasks myself). That ratio determines what is regarded as the time limit to be exceeded for a 197 error.
No, I am not running anonymous applications, they are all stock.
As for flops, here's that and a bit more info from the startup of BOINC:
2/5/2019 7:48:28 PM | | OpenCL: AMD/ATI GPU 0: AMD Radeon(TM) Vega 8 Graphics (driver version 2766.5 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2766.5), 7206MB, 7206MB available, 43980464 GFLOPS peak)
2/5/2019 7:48:28 PM | | [coproc] No NVIDIA library found
2/5/2019 7:48:28 PM | | [coproc] No ATI library found.
2/5/2019 7:48:28 PM | | Host name: DESKTOP-FIDJHGU
2/5/2019 7:48:28 PM | | Processor: 4 AuthenticAMD AMD Ryzen 3 2200G with Radeon Vega Graphics [Family 23 Model 17 Stepping 0]
2/5/2019 7:48:28 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 svm sse4a osvw skinit wdt tce topx page1gb rdtscp fsgsbase bmi1 smep
2/5/2019 7:48:28 PM | | OS: Microsoft Windows 10: Core x64 Edition, (10.00.17763.00)
2/5/2019 7:48:28 PM | | Memory: 13.93 GB physical, 16.06 GB virtual
2/5/2019 7:48:28 PM | | Disk: 465.22 GB total, 419.08 GB free

Thanks for looking into this!

Bill
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1978832 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1978867 - Posted: 6 Feb 2019, 6:29:50 UTC


Isn't the lower limit already in place?

peak_flops = (x>0)?x:5e10



I know the lower limit isn't the problem here; its the upper limit. I could see the upper limit being a problem as new GPUs enter the market and test that upper limit, but I guess we would know that relatively quickly, and the GPU would still work. I'm speculating, though, I have only glanced at the code casually.
Richard or Bob, I don't know if you can answer this question, but why is it that this is a problem only for AP7 tasks? I have no problems crunching regular S@H v8 GPU tasks. Right now all those tasks have an ETA of 50-60 minutes. At a glance of my recently validated tasks, they complete in about 40-50 minutes.


No, it's in the wrong place to do any good on anything other than an "undefined" GPU. - such is the mess of if/then/else statements around the detemination of peak_flops :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1978867 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978887 - Posted: 6 Feb 2019, 10:26:54 UTC - in response to Message 1978832.  

OK, I think I've cracked it.

The est/bound ratio for MB is 20::1, but that's still too small to make any difference - red herring.

Would you believe it - this actually dates back to the code introduced with Credit New in 2010!

Remember that we're discussing Bill's case, and Bill is using stock applications - not an app_info file.

In this case, BOINC keeps track of the actual performance of your hardware on the server. When it allocates new work to your computer, it also tells your computer what the running average speed of your previous tasks has been - and your BOINC client uses that average speed for estimates, in preference to the measured hardware speed.

So far, so good - the system works well (so well that we've stopped noticing it).

But what happens if the running average isn't available? Then BOINC falls back on the hardware assessment, and in your case, that's gone badly wrong. Which means that every AP task you run on your ATI card will fail, and the server will never be able to assess the situation properly. So you're stuck in a loop with no way out.

For MB, the server had already made its assessment before AMD f***'d up their driver - so you can go on using that indefinitely. Until SETI updates the MB application, at which point everything is reset to zero and everyone will enter the infinite loop together :-)

Worse - you could go on processing AP tasks on your CPU, which isn't affected by the faulty GPU speed estimate. But you can't selectively turn off AP processing on GPU - you can turn off AP, or you can turn off the GPU, but you can't turn off the combination.

For other watchers - the clues are all in Application details for host 8640304. Any application version where you have 'completed' 11 or more tasks will be safe - and you can see that MB qualifies, but AP doesn't.

So, what's to do? There's been no further response from David overnight, and he'll be asleep now. So I'm going to stock up on caffeine, and then apply his code to the ATI pathway myself. Once I've managed that (it may take some time...), there'll be another BOINC client application to test - I'll come back with notes on how to do that.
ID: 1978887 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1978889 - Posted: 6 Feb 2019, 12:11:27 UTC - in response to Message 1978887.  

OK, I think I've cracked it.

The est/bound ratio for MB is 20::1, but that's still too small to make any difference - red herring.

Would you believe it - this actually dates back to the code introduced with Credit New in 2010!

Remember that we're discussing Bill's case, and Bill is using stock applications - not an app_info file.

In this case, BOINC keeps track of the actual performance of your hardware on the server. When it allocates new work to your computer, it also tells your computer what the running average speed of your previous tasks has been - and your BOINC client uses that average speed for estimates, in preference to the measured hardware speed.

So far, so good - the system works well (so well that we've stopped noticing it).

But what happens if the running average isn't available? Then BOINC falls back on the hardware assessment, and in your case, that's gone badly wrong. Which means that every AP task you run on your ATI card will fail, and the server will never be able to assess the situation properly. So you're stuck in a loop with no way out.

For MB, the server had already made its assessment before AMD f***'d up their driver - so you can go on using that indefinitely. Until SETI updates the MB application, at which point everything is reset to zero and everyone will enter the infinite loop together :-)

Worse - you could go on processing AP tasks on your CPU, which isn't affected by the faulty GPU speed estimate. But you can't selectively turn off AP processing on GPU - you can turn off AP, or you can turn off the GPU, but you can't turn off the combination.

For other watchers - the clues are all in Application details for host 8640304. Any application version where you have 'completed' 11 or more tasks will be safe - and you can see that MB qualifies, but AP doesn't.

So, what's to do? There's been no further response from David overnight, and he'll be asleep now. So I'm going to stock up on caffeine, and then apply his code to the ATI pathway myself. Once I've managed that (it may take some time...), there'll be another BOINC client application to test - I'll come back with notes on how to do that.
Richard, glad to hear there is at least a path forward with this. I don't think it matters much now, but I did recieve an AP task overnight and it does have a 0:00 ETA. If anything, that confirms the test from a few days ago didn't work.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1978889 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978892 - Posted: 6 Feb 2019, 12:53:11 UTC - in response to Message 1978889.  

Well, I've got another one for you to test ;-)

Same procedure as last time: you want the win-client from

https://ci.appveyor.com/project/BOINC/boinc/builds/22166081/artifacts

This should apply the sanity-check for the coprocessor speed in the correct place: it would be most helpful if you could test it in the next 30 hours (before tomorrow evening, UK time), as we should have a conference call which David may attend.

This build will also contain the fix for the scheduling bug that Keith Myers has been wrestling with, but not yet the fix for the consequential work fetch bug. Treat it with caution for testing only.
ID: 1978892 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1978894 - Posted: 6 Feb 2019, 13:10:00 UTC - in response to Message 1978892.  

I'll give it a shot when I get home tonight. What info do you need exactly for it?

I should have suspended the AP7 task that I downloaded, but it should hopefully be there when I get back. If not, and if I don't have any new AP7 tasks, would MW@H help?

I'm just trying to get my ducks in a row here. I know it will be late for you when I get a chance to look at this.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1978894 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978901 - Posted: 6 Feb 2019, 14:34:00 UTC - in response to Message 1978894.  

Thanks. I've already had one test report saying they have

06-Feb-2019 14:29:16 (low) [] OpenCL: AMD/ATI GPU 0: AMD Radeon(TM) Vega 8 Graphics (driver version 2766.5 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2766.5), 6567MB, 6567MB available, 1000 GFLOPS peak)
so the first hurdle is crossed - I found the right place!

The next question, and the ideal test, is whether 1000 GFLOPS is close enough to the actual speed of your device to survive past 197 EXIT_TIME_LIMIT_EXCEEDED. If you snag an AP, could you let it run, please, to see what happens?

(Maybe a comparative driver version would be handy, too, so we can work out when this started going wrong)
ID: 1978901 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1978904 - Posted: 6 Feb 2019, 14:54:10 UTC - in response to Message 1978901.  

If you look one of my earlier posts, I am running the same driver version 2766.5. Are you saying you want me to run a different driver? If so, I'm not sure what to do. I currently am running Adrenalin 2019 19.1.1. I thought that was the "driver" for the APU's graphics (or at least a means of updating the driver). What exactly would you like me to do?

As for the actual peak GFlops, is it supposed to reference single or double precision floating point? I have seen this reference for GFlops, but I have no way of verifying if it is accurate: https://www.techpowerup.com/gpu-specs/radeon-vega-8.c3042
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1978904 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978906 - Posted: 6 Feb 2019, 15:00:53 UTC - in response to Message 1978904.  
Last modified: 6 Feb 2019, 15:10:03 UTC

Don't worry about the driver - it was just quicker to ask than to look back in the thread. Thanks for the techpowerup link - I'll go and have a read. But the whole discussion about flops is very artificial - it depends hugely on whether the algorithm can be efficiently parallelized. There's no such thing as a right answer - just 'near enough' and consistent across the whole range of devices.

But you've given me the idea of one or two other places to look.

Edit - if anything, I'd suggest that the speeds should be compared with the

FP32 (float) performance	1,126 GFLOPS
from the review, so the hope is that AP tasks might run slightly faster than the initial estimate. But the best person to ask would be a fellow Ryzen owner, assuming they've got past the problem we're discussing here.
ID: 1978906 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1978912 - Posted: 6 Feb 2019, 15:29:00 UTC - in response to Message 1978906.  

So I just looked at a few of my MB GPU tasks that have been validated, and they are showing ~ 1,319,000 GFlops. Assuming that is float and not double, that is in line with what you have listed. Dumb question, but shouldn't peak flops be the same (give or take), regardless of what application runs it?
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1978912 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1978920 - Posted: 6 Feb 2019, 16:02:51 UTC - in response to Message 1978912.  

Yes, you're right - Peak flops is a fixed value. I was thinking of 'effective processing speed', which will vary according to the efficiency of the application.

You'll see the effective speed listed as 'Average processing rate' or APR on the Application details page I linked earlier.

Interestingly, your Ryzen is showing an APR of 229.12 GFLOPS with the opencl_ati_100 application version - and you've completed two AP tasks since you created the machine record here on 23 Dec 2018. But you're down to "Max tasks per day: 1", which suggests that it broke at least 30 days ago - fairly soon after creation. Can you remember updating the graphics driver?
ID: 1978920 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1978922 - Posted: 6 Feb 2019, 16:04:47 UTC

Realistically one would expect the peak flops to be the same, but there is some "strangeness" in the applications as well as the BOINC code that could do with being bottomed out.

Bill - If I were you I'd suspend all but one of the AP destined for your GPU until the first one has completed, "just to make sure". At first glance it does look as if Richard has found the right place and the value is "about right", but only time will tell.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1978922 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1979012 - Posted: 6 Feb 2019, 23:01:51 UTC - in response to Message 1978922.  

Realistically one would expect the peak flops to be the same, but there is some "strangeness" in the applications as well as the BOINC code that could do with being bottomed out.

Bill - If I were you I'd suspend all but one of the AP destined for your GPU until the first one has completed, "just to make sure". At first glance it does look as if Richard has found the right place and the value is "about right", but only time will tell.
Ok, I just downloaded the latest build. I only need to replace boinc.exe, not any of the other files that are zipped, right?

I still have the same AP7 task sitting in my queue from this morning, and it still lists a 0:00 eta. no other new AP7 tasks.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1979012 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979013 - Posted: 6 Feb 2019, 23:06:27 UTC - in response to Message 1979012.  
Last modified: 6 Feb 2019, 23:07:02 UTC

I only need to replace boinc.exe, not any of the other files that are zipped, right?
Yup.

I still have the same AP7 task sitting in my queue from this morning, and it still lists a 0:00 eta. no other new AP7 tasks.
Hopefully, the estimate will change immediately when you restart with the new client - not guaranteed, please let us know. Then, see how it runs...
ID: 1979013 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1979037 - Posted: 7 Feb 2019, 0:30:37 UTC - in response to Message 1979013.  

Hopefully, the estimate will change immediately when you restart with the new client - not guaranteed, please let us know. Then, see how it runs...
Nope, still stayed 0:00 after the restart, and even when I forced it to run (by suspending everything else), it popped the standard computation error. I guess we're going to have to be patient as there are 0 AP tasks available at the moment.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1979037 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979072 - Posted: 7 Feb 2019, 8:39:59 UTC - in response to Message 1979037.  

That's disappointing, and a bit worrying. Could you post the first dozen or so lines from your Event Log after startup, please? I'd be looking for something like this:

1: 06-Feb-2019 14:29:15 (low) [] cc_config.xml not found - using defaults
2: 06-Feb-2019 14:29:15 (low) [] Starting BOINC client version 7.15.0 for windows_x86_64
3: 06-Feb-2019 14:29:15 (low) [] This a development version of BOINC and may not function properly
4: 06-Feb-2019 14:29:15 (low) [] log flags: file_xfer, sched_ops, task
5: 06-Feb-2019 14:29:15 (low) [] Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
6: 06-Feb-2019 14:29:15 (low) [] Data directory: D:\BOINC
8: 06-Feb-2019 14:29:16 (low) [] OpenCL: AMD/ATI GPU 0: AMD Radeon(TM) Vega 8 Graphics (driver version 2766.5 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2766.5), 6567MB, 6567MB available, 1000 GFLOPS peak)
9: 06-Feb-2019 14:29:16 (low) [] Version change (7.14.2 -> 7.15.0)
11: 06-Feb-2019 14:29:16 (low) [] Processor: 4 AuthenticAMD AMD Ryzen 3 2200G with Radeon Vega Graphics [Family 23 Model 17 Stepping 0]
(I've removed a couple of lines for privacy)

Note the "1000 GFLOPS peak" - that's my patch. Maybe it only comes into play when you download new work with the patch in place.
ID: 1979072 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979073 - Posted: 7 Feb 2019, 9:15:07 UTC - in response to Message 1976618.  

On 23 Jan, Rob wrote:
Yes there are a number of people suffering this just now - all appear to have "gained" stupidly high device peak FLOPS values - looking at one of your tasks I see:
Device peak FLOPS 	19,956,140.22 GFLOPS

Which is probably about 2000 times as high as it should be :-(
Rob, could you give me a guide to the origin for that statement? Any idea if it's ongoing? When did it start (first report date?)

Your workround of switching to anonymous platform and setting the value there is feasible at SETI, but doesn't work for people who don't seek help on the website, or at other BOINC projects.

I'm thinking of calling for an emergency hotfix BOINC release with a more refined version of my patch: I've got some time booked in tonight's conference call, and I'm briefing the chair in advance at 18:45 UTC tonight.

Are there any reports at all from platforms other than Windows? Since the problem seems to be in that 2766.5 Ryzen driver, it's likely to be platform specific.

There's also a conference call with project administrators next week. I'm thinking of asking project admins to check their databases beforehand for an increased error rate.
ID: 1979073 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1979077 - Posted: 7 Feb 2019, 10:58:26 UTC

Arghh-------
Now that means a bit of memory digging.
When this was first reported I had a look a number of tasks on my list that were "pending" and had a look at the reports for them, particularly where the wingman had timed-out and I was paired with an AMD GPU and found a few that had timed-out with stupid peak_flops values and found they had zero run-time estimates.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1979077 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : I am getting a lot of gpu tasks with zero (0) expected processing times.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.