I am getting a lot of gpu tasks with zero (0) expected processing times.

Message boards : Number crunching : I am getting a lot of gpu tasks with zero (0) expected processing times.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1976605 - Posted: 23 Jan 2019, 15:36:34 UTC

So they immediately time out/quit.

Not all of my gpu tasks are showing up with 0 processing times.

I think I read someone else having this issue?

It seems to be specific to my Windows 10/Amd 2400G box.

Any fixes besides just letting them "run"?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1976605 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976618 - Posted: 23 Jan 2019, 17:17:05 UTC

Yes there are a number of people suffering this just now - all appear to have "gained" stupidly high device peak FLOPS values - looking at one of your tasks I see:
Device peak FLOPS 	19,956,140.22 GFLOPS

Which is probably about 2000 times as high as it should be :-(
One "fix" might be to manually set the peak flops value to something a bit more sensible in coproc_info.xml - here's the line to look for:
<peak_flops>8186112000000.000000</peak_flops>

That is lifted directly from my Windoze machine with a pair of GTX1070ti so the value shouldn't be too far from what you need, and don't you just love the superfluous four decimal places....
I know its a real b*** a*** counting all the zeros - I've got it wrong a few times in the past :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976618 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1976640 - Posted: 23 Jan 2019, 20:02:38 UTC - in response to Message 1976618.  

TY, I will look for this and make the change.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1976640 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1976691 - Posted: 24 Jan 2019, 0:28:16 UTC

Nobody commented on my post wondering whether the latest change to BOINC caused this new issue. I keep wondering since this type of failure wasn't common till recently was caused by that new code change. https://setiathome.berkeley.edu/forum_thread.php?id=83758
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1976691 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976732 - Posted: 24 Jan 2019, 6:34:07 UTC

Actually you picked up on something I'd said earlier in the same thread - had the API changed? - it certainly looks that way. If it is an API change then someone will have to go for a dig, see what the changes are, and modify the code to correct. This might be harder than it looks as the API concerned is not in the "tender loving care" of either BOINC or SETI but AMD/ATI (and I don't know how good their documentation is)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976732 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1976759 - Posted: 24 Jan 2019, 15:10:00 UTC

Tom, is this happening for all tasks, or specific ones? I have had a similar problem with my Ryzen 3 2200G, but only for AP7 GPU tasks. I have posted about the problem over here: https://setiathome.berkeley.edu/forum_thread.php?id=83758.

Keith, I did message Raistmer, and he said he hadn't worked on the code recently. He thinks Mike may know more about what the problem may be. He's going to take a look into it and get back to me. I'll let everyone know when I get an update.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1976759 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976760 - Posted: 24 Jan 2019, 15:22:49 UTC
Last modified: 24 Jan 2019, 15:28:46 UTC

Bill,
Tom's issue is specifically with MBs, and affect (just about) every MB run on his GPU - they crash out after a second or so, having exceeded their estimated zero expected run time.
The AP issue may be related, but does not appear to be as tasks run to completion and have run for a sensible duration, also it is "random" in that not every task fails. If it were the same issue I would expect all AP tasks to fail in the same manner as MBs are failing, not some complete and others fail, but without a few "failed" AP records available to examine it is very difficult to see exactly what errors (if any) are being reported by the AP applications (running on AMD/ATI GPUs).

Edit:
Cancel the above thought about the zero duration problem not necessarily applying to APs
- I've just found a couple of AP tasks that failed with the same zero estimate duration, so it looks as if it is the AMD/ATI API that has changed - and since estimated duration is calculated using data supplied by BOINC and that data is well out of order I would doubt that (wearing their SETI hats) either Raistmer or Mike will find a solution.
I'll expand on this more when I get home in about an hour's time.

Note - At this time this does not appear to affect nVidia or Intel GPUs.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976760 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1976768 - Posted: 24 Jan 2019, 16:42:39 UTC

Thanks for the insight Rob. I was thinking of the change by Richard in the BOINC 7.14.2 release as the cause. Not that the AMD/ATI API change is the root cause of the issue that the newer BOINC amplifies with the new method of calculating peak-flops.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1976768 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976774 - Posted: 24 Jan 2019, 17:10:24 UTC

I've further dug....
An' it ain't lookin'g preddy :-(

Internal to BOINC peak_flops is calculated from data drawn from either the card or the driver, actually it is calculated in more than one place (bad practice, but since the same set of instructions are used, of no significance).
It is mainly used in scheduling and work fetch (very much as one would expect), and also in the very first initial guess at how long a task will take.
A scaled version is transmitted to the science application as part of a whole load of data that heads that way. Now this is quite interesting as the scaled version is cast into an integer whereas the calculated version is a double, but that shouldn't do what is being seen and the process is the same nVidia and Intel GPUs and any gloop in that casting would affect them in just the same manner.
BOINC outputs a scaled version to the output file when that is created
I've not yet grabbed a copy of the stock MB/SoG or AP source code for ATI/AMD GPUs, so haven't had a prowl through there to see what it is used for there.

I've already posted a possible work-around, can somebody try it and see what happens please?
(you may have to restart BOINC as it appears this is one of those values that is only read/calculated at start - please post whatever happens, all information is of value at this stage)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976774 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1976777 - Posted: 24 Jan 2019, 17:41:42 UTC

The coprocessor device capabilities are enumerated from their respective library API files. Either the .DLL in Windows or the .so.1 file in Linux. The ATI/AMD cards get theirs from the gpu_opencl.cpp file in the /client sub-directory. Primarily clGetDeviceInfo is pulled from the API library. clGetDeviceInfo gets used in calculating peak_flops.

Also the amount of RAM on the card is also used in calculating peak flops and they seem to be doubling that on AMD cards because they are only reporting half of actual RAM onboard. The clock frequency is also used.

#ifdef __APPLE__
    // Work around a bug in OpenCL which returns only
    // 1/2 of total global RAM size.
    // This bug applies only to ATI GPUs, not to NVIDIA
    // This has already been fixed on latest Catalyst
    // drivers, but Mac does not use Catalyst drivers.
    if (ati_opencls.size() > 0) {
        // This problem seems to be fixed in OS 10.7
        if (compareOSVersionTo(10, 7) < 0) {
            opencl_get_ati_mem_size_from_opengl(warnings);
        }
    }
#endif


// TODO: Find a better way to calculate / estimate peak_flops for future coprocessors?
                prop.peak_flops = 0;
                if (prop.max_compute_units) {
                    prop.peak_flops = prop.max_compute_units * prop.max_clock_frequency * MEGA;
                }
                if (prop.peak_flops <= 0) prop.peak_flops = 45e9;

                other_opencls.push_back(prop);

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1976777 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1976778 - Posted: 24 Jan 2019, 17:45:52 UTC - in response to Message 1976774.  

I've further dug....
An' it ain't lookin'g preddy :-(

Internal to BOINC peak_flops is calculated from data drawn from either the card or the driver, actually it is calculated in more than one place (bad practice, but since the same set of instructions are used, of no significance).
It is mainly used in scheduling and work fetch (very much as one would expect), and also in the very first initial guess at how long a task will take.
A scaled version is transmitted to the science application as part of a whole load of data that heads that way. Now this is quite interesting as the scaled version is cast into an integer whereas the calculated version is a double, but that shouldn't do what is being seen and the process is the same nVidia and Intel GPUs and any gloop in that casting would affect them in just the same manner.
BOINC outputs a scaled version to the output file when that is created
I've not yet grabbed a copy of the stock MB/SoG or AP source code for ATI/AMD GPUs, so haven't had a prowl through there to see what it is used for there.

I've already posted a possible work-around, can somebody try it and see what happens please?
(you may have to restart BOINC as it appears this is one of those values that is only read/calculated at start - please post whatever happens, all information is of value at this stage)

Interesting...
I will restart BOINC when I get home and see if I can download any AP GPU tasks. The problem is they are few and far between, so it may take a bit before I have something to report back on.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1976778 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1976779 - Posted: 24 Jan 2019, 17:50:24 UTC - in response to Message 1976778.  

I just got some on a host that I was planning on experimenting with a modified client that can't process AP tasks. It was empty yesterday and now I have picked up five. Guess they will go out to someone else.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1976779 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976781 - Posted: 24 Jan 2019, 18:12:59 UTC - in response to Message 1976777.  
Last modified: 24 Jan 2019, 19:02:56 UTC

I fell into that trap a few hours ago.
Memory size is not used in the calculation of peak_flops, but is used later to check that the GPU is capable actually running the job in hand.
With the poor layout and commenting in this section it's an easy one to fall into :-(

What caught me was the "MEGA" multiplier, I eventually found it is a simple scaling factor to get from Munits to units (MHz to Hz, Mbytes to bytes etc), a bit of a red herring.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976781 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976784 - Posted: 24 Jan 2019, 18:16:18 UTC - in response to Message 1976778.  

Bill - MB running on your AMD/ATI GPU will do - APs are just too rare to work with reliably for this part of the investigation.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976784 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1976785 - Posted: 24 Jan 2019, 18:19:50 UTC - in response to Message 1976774.  


I've already posted a possible work-around, can somebody try it and see what happens please?
(you may have to restart BOINC as it appears this is one of those values that is only read/calculated at start - please post whatever happens, all information is of value at this stage)


If you are talking about the reset the gpu flops# above, I did that. At the time I did that I had no more 0 estimated times for it to process. I did notice that about 2/3's of the gpu tasks I had at the time I posted this question had 0 estimated times. The other 1/3 had something that sounded reasonable (30+ minutes).

Would another work-around be to background my video driver?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1976785 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976790 - Posted: 24 Jan 2019, 19:01:06 UTC
Last modified: 24 Jan 2019, 19:07:04 UTC

Tom - thanks.
As a workaround that looks as if we are looking in the right area.

I doubt that doing anything like backgrounding your video driver will do anything useful :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976790 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1976791 - Posted: 24 Jan 2019, 19:04:05 UTC - in response to Message 1976784.  

Bill - MB running on your AMD/ATI GPU will do - APs are just too rare to work with reliably for this part of the investigation.

Sorry for the noob question, but "MB" is just the stock seti@home v8 application, right? If that is the case, I have been running v8 GPU tasks with no problems from the start. I only have this problem with AP7 GPU tasks.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1976791 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1976792 - Posted: 24 Jan 2019, 19:08:56 UTC
Last modified: 24 Jan 2019, 19:09:29 UTC

Yes Bill - MB is the stock "v8" application.

My bad - I was thinking of Tom who has the same problem on MB....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1976792 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1976828 - Posted: 25 Jan 2019, 0:36:58 UTC - in response to Message 1976774.  

Internal to BOINC peak_flops is calculated from data drawn from either the card or the driver, actually it is calculated in more than one place (bad practice, but since the same set of instructions are used, of no significance).
It is mainly used in scheduling and work fetch (very much as one would expect), and also in the very first initial guess at how long a task will take.
A scaled version is transmitted to the science application as part of a whole load of data that heads that way. Now this is quite interesting as the scaled version is cast into an integer whereas the calculated version is a double, but that shouldn't do what is being seen and the process is the same nVidia and Intel GPUs and any gloop in that casting would affect them in just the same manner.
BOINC outputs a scaled version to the output file when that is created
I've not yet grabbed a copy of the stock MB/SoG or AP source code for ATI/AMD GPUs, so haven't had a prowl through there to see what it is used for there.

I've already posted a possible work-around, can somebody try it and see what happens please?
(you may have to restart BOINC as it appears this is one of those values that is only read/calculated at start - please post whatever happens, all information is of value at this stage)

My computer downloaded one AP7 GPU task this afternoon, and it still shows a completion time of 0:00. I restarted BOINC, and ran benchmarks again for giggles, and nothing changed (not that I expected it to.

I hope I'm not muddying the waters here, it has been 15 years since I've done any real programming, but I can't get past the fact that the peak_flops is sent over as a double precision value, but it is received as an integer. Should that error no matter what, or be converted the same way whether it is an AMD, nVidia or Intel? Is it possible that the data drawn from the card or driver is stored as a different type of data, which causes it to act differently? I guess if its a calculation its irrelevant...I'm just trying to make sense of it.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1976828 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1976842 - Posted: 25 Jan 2019, 1:52:14 UTC - in response to Message 1976828.  

You are not muddying the waters any more than they already are Bill. The code is a definite patchwork by DA over many years. It is very fragmented code and very hard to follow. Everyone that looks at the code says the same thing.

Have you tried reverting to a much older driver for the APU? I'm fairly sure you haven't since you probably need the latest to cover both the APU and the Vega. So whatever changes in the ATI/AMD driver API that is fouling up the BOINC client code is probably going to be present. Think the solution has to come from new client code and nothing is wrong with the drivers.

This really needs to be brought up as a bug for the BOINC developers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1976842 · Report as offensive
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Number crunching : I am getting a lot of gpu tasks with zero (0) expected processing times.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.