NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · Next

AuthorMessage
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2040466 - Posted: 26 Mar 2020, 4:04:02 UTC - in response to Message 2040462.  

Such pessimism and defeatism!

If there's a bug in these R445 drivers, which it seems likely to be, then ...
... I for one will push for them to fix the bug.

I have not yet fully characterized the behavior that I'm seeing, and may not have time until the weekend.

I've asked Einstein to check whether they had such an error for Nvidia cards or not. No answer yet.
ID: 2040466 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040850 - Posted: 27 Mar 2020, 14:51:23 UTC - in response to Message 2040462.  

Such pessimism and defeatism!


or realism. why waste the time and energy fixing something that will be not applicable by the time the fix can be implemented? you remember how long it took to get fixed last time? SETI is over in 4 days, why bother?

maybe they had a breakdown of communication, but I would imagine the same team that made the updates last time are still making the updates now.

or maybe the changes they made were not in line with other parts of their driver, and caused conflicts with other goals they had, so learning that the fix would no longer be needed, allows them to remove the conflicting code?

we're not likely to get clear answers, so we can only speculate.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040850 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2040899 - Posted: 27 Mar 2020, 18:45:40 UTC - in response to Message 2040850.  

why waste the time and energy fixing

Why? Because if it's broken for calls made by one science app, it can be broken for calls made by other science apps.

we're not likely to get clear answers, so we can only speculate

I intend to not speculate, as you have done. I intend to get answers, and get it fixed.
ID: 2040899 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040922 - Posted: 27 Mar 2020, 19:39:12 UTC - in response to Message 2040899.  
Last modified: 27 Mar 2020, 19:41:00 UTC

why waste the time and energy fixing

Why? Because if it's broken for calls made by one science app, it can be broken for calls made by other science apps.

we're not likely to get clear answers, so we can only speculate

I intend to not speculate, as you have done. I intend to get answers, and get it fixed.


you have no power to get it fixed. and Nvidia isn't likely to give you any answers on why it broke in the first place. they didn't tell you last time.

why would they fix something that will only affect and be verifiable using a benchmark tool created to run workunits for an EOL project? at some point the juice isn't worth the squeeze. and unless someone can prove that it's impacting another project, it's not.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040922 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2040926 - Posted: 27 Mar 2020, 19:51:54 UTC - in response to Message 2040922.  

why waste the time and energy fixing

Why? Because if it's broken for calls made by one science app, it can be broken for calls made by other science apps.

we're not likely to get clear answers, so we can only speculate

I intend to not speculate, as you have done. I intend to get answers, and get it fixed.


you have no power to get it fixed. and Nvidia isn't likely to give you any answers on why it broke in the first place. they didn't tell you last time.

why would they fix something that will only affect and be verifiable using a benchmark tool created to run workunits for an EOL project? at some point the juice isn't worth the squeeze. and unless someone can prove that it's impacting another project, it's not.

I've seen a message over on Einstein@Home saying that they were probably also affected. They aren't shutting down.
ID: 2040926 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040928 - Posted: 27 Mar 2020, 20:02:23 UTC - in response to Message 2040926.  

Please link the message.

The only message I see is the same thing that Keith posted here. I searched through the messages at Einstein and could not find a verifiable post of someone having this problem at Einstein. If it was a real problem before you would see lots of posts about it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040928 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2040935 - Posted: 27 Mar 2020, 20:35:07 UTC

There was a discussion in Problems and Bug Reports forum about progress stalling out like it does on SoG tasks in Win10. Only affected Win10 and 7 was fine. So likely the same issue.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2040935 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2040977 - Posted: 28 Mar 2020, 0:34:22 UTC

I hope you won't impede my attempts to retain optimism and work towards a fix.
Thank you.
ID: 2040977 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040978 - Posted: 28 Mar 2020, 0:41:34 UTC - in response to Message 2040935.  
Last modified: 28 Mar 2020, 0:58:37 UTC

can you link directly to what you're referencing? the only thing I can find is the AMD issue on RX5700 cards (like this one), nothing about Nvidia driver problems, these two issues were happening at about the same time, and fixed at about the same time also. I looked through 4 pages of threads on the Problems and Bug reporting board, which reaches back to early September, before it was reported here. I really think you're confusing the Nvidia issue and AMD issue.

Prime example right here: https://einsteinathome.org/host/12803450

This computer is using Win10 with 445 drivers no problem. Einstein doesn't have this issue. Seems to be SETI only, and it's day's are numbered.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040978 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2041302 - Posted: 29 Mar 2020, 12:49:45 UTC

Woohoo, today is the day I can test some drivers for this issue! :)

Here are my targets:
442.19
442.37
442.50
442.59
442.74
445.75
445.78

Be back in a few hours! ;)
ID: 2041302 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2041377 - Posted: 29 Mar 2020, 19:18:37 UTC

Nearing the end of my test run. So far, the behavior of my results looks like the solution was not included in R445. Will report full result set later today.
ID: 2041377 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2041450 - Posted: 29 Mar 2020, 23:18:27 UTC
Last modified: 29 Mar 2020, 23:52:50 UTC

My results are below.
The behavior looks like the fix did not get included in R445 drivers.
I sent a report to NVIDIA, to their Driver Feedback form here: https://forms.gle/kJ9Bqcaicvjb82SdA
If you have this problem too, please also send a report to NVIDIA using that form.

My test and result files, are all located here: https://1drv.ms/f/s!AgP0NBEuAPQRp6Fr322LD1BXy6rdAg

Thank you.

442.19 - 442.74
All good.

445.75
2080:
No GPU Usage, Ran forever with no progress
1050 Ti:
No GPU Usage, Ran forever with no progress
980 Ti:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
980:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
970:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.

445.78
2080:
No GPU Usage, Ran forever with no progress
1050 Ti:
No GPU Usage, Ran forever with no progress
980 Ti:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
980:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
970:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
ID: 2041450 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2041488 - Posted: 30 Mar 2020, 1:18:46 UTC - in response to Message 2041450.  
Last modified: 30 Mar 2020, 1:22:15 UTC

I'm trying to run similar tests with 445.

How can I identify an Arecibo VHAR workunit if the only workunits I can use for the tests are the ones I've downloaded but not finished yet?

I previously reported the problem but without running suitable tests. Nvidia replied that they would push to up to level 2, but had not seen relevant problem reports for 445 yet.
ID: 2041488 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2041498 - Posted: 30 Mar 2020, 1:51:06 UTC - in response to Message 2041488.  
Last modified: 30 Mar 2020, 1:53:29 UTC

robertmiles,

If you go to my OneDrive link, you'll find some folders for "CUDA Testing" and "OpenCL Testing".
- If you download those, you can use them to do some tests.
- You may only need to use the "Ex 1" (Example 1) folders, to get the results that you want to check.
- It is set up to be able to run tests on up-to-3 GPUs in the system (dev0, dev1, dev2).

Perhaps you could:
- Download and extract those folders
- Run the .cmd files for your GPU (correct dev folder - dev0 for 1 GPU)
- Look for GPU Usage
- When it is done, inspect the .txt file in the Testdatas for the result.
- Report the results here, and
- Report the results to the NVIDIA Driver Feedback link.

Regards,
Jacob
ID: 2041498 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2041506 - Posted: 30 Mar 2020, 3:15:29 UTC - in response to Message 2041498.  
Last modified: 30 Mar 2020, 3:18:02 UTC

I tried what you suggested.

The commend file started, showed several lines, then appeared to freeze.

The best I can tell, the seti*.exe program is still running, but using so little CPU time that it rounds off to zero.

How to I check how much the GPU is being used?

How long do I wait before deciding that the command file will never finish?

Using a GTX 1080.
ID: 2041506 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2041509 - Posted: 30 Mar 2020, 3:37:15 UTC - in response to Message 2041506.  

You can use GPU-Z to monitor GPU information including GPU Usage.
The OpenCL Test should start using GPU after a couple seconds. If it hasn't started within 30 seconds, it won't start, and you'll need to close the window and kill the seti executable in Task Manager.

Hopefully you can:
- Test 442.74 (the last R440 driver), to verify it works correctly.
- Test 445.75 (the first R445 driver), to verify the problem has been reintroduced.
ID: 2041509 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2041517 - Posted: 30 Mar 2020, 4:35:30 UTC - in response to Message 2041509.  

How do you use GPU-Z to monitor GPU use? It showed me a lot of information about the GPU, not including whether it was being used.

GTX 1080 445.75 hangs.
GTX 1080 442.19 finishes in a few minutes, but the *-benchMB.txt has a lot of messages about files not found.

It's too late here to download another 442 version of the driver - I'll try tomorrow.
ID: 2041517 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2041519 - Posted: 30 Mar 2020, 4:37:15 UTC - in response to Message 2041517.  

How do you use GPU-Z to monitor GPU use?
Click on the Sensors tab.
Grant
Darwin NT
ID: 2041519 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2041607 - Posted: 30 Mar 2020, 13:56:23 UTC

GTX 1080 442.74 finishes in about 3 minutes, but the *-benchMB.txt has a lot of messages about files not found.
The GPU use was about 97%.

The sensors tab of GPU-Z made GPU use obvious AFTER I had observed it both with and without another BOINC project using the GPU.
ID: 2041607 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2041611 - Posted: 30 Mar 2020, 14:20:38 UTC - in response to Message 2041607.  

GTX 1080 442.74 finishes in about 3 minutes, but the *-benchMB.txt has a lot of messages about files not found.
What file is missing? state.sah would be worrying, result.sah would be catastrophic.
ID: 2041611 · Report as offensive     Reply Quote
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.