Cuda 50 V8 Weirdness

Message boards : Number crunching : Cuda 50 V8 Weirdness
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1781503 - Posted: 23 Apr 2016, 3:36:03 UTC - in response to Message 1781387.  

It's just that we still have that not-so-correct client we need to deal with and it would be then better to have the finish file created later as a work-around.


Yep, works for me.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1781503 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1781505 - Posted: 23 Apr 2016, 3:41:12 UTC - in response to Message 1781358.  

@Jim, drop in test build Lunatics_x41zj_win32_cuda50.7z


Been running over a day now, looks pretty nice! As you've probably noticed, no errors. Do still see the odd screen freeze, though as I mentioned occurrence is reduced. Once you're good with the test as we stand now, I may up the priority in MBCuda.cfg to Normal and if past experience holds, this will probably eliminate the freezes.
Regards, Jim ...



Thanks Jim!
Yeah I think we nailed it down to long standing Boinc fussyness pretty well, thanks for the help! Yeah, you can run as preferred (whatever helps). If that system gets the odd freeze even with Normal, and possibly reducing those pulsefind settings if you like, then there could be some system or driver issues somewhere, .freeing a CPU core *might* help, especially if under system RAM pressure, if only just by thrshing memory a bit less. Does the system page to disk a lot ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1781505 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1781525 - Posted: 23 Apr 2016, 4:59:54 UTC - in response to Message 1781502.  

Looks like I caught a "finish file present too long" (task 4874874171) on my Win Vista machine this morning. It appears to have been dragged down by Windows Update while in its "Checking for Updates" phase, probably just about the time that phase completed (after running for about 2 hours).

EDIT: That was a CPU task, by the way, which ran for over 4 hours.


Interesting. Not my shutdown code involved there :-D, probably closer to my laptop simplified scenario, meant to illustrate the illusion of control by time constraints.

Yeah, that's my second one on a CPU task this month, but on different machines. Different BOINC versions, too, with today's happening on 7.2.33 while the one on April 4 was on 7.6.9. It appears the CPU app doesn't write anything to Stderr after the call to boinc_finish, but is still susceptible to outside interference, such as Windows Update hogging cycles.

By the way, I was looking at another Program Monitor log from last July's testing on a different machine and noticed that Search Indexing (SearchIndexer.exe and SearchProtocolHost.exe) was very active on the slot folders, potentially being another candidate for inducing delays. I'm not sure why I hadn't disabled it before then for that machine, but I'm pretty sure it is now!
ID: 1781525 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1781531 - Posted: 23 Apr 2016, 5:13:18 UTC - in response to Message 1781525.  

By the way, I was looking at another Program Monitor log from last July's testing on a different machine and noticed that Search Indexing (SearchIndexer.exe and SearchProtocolHost.exe) was very active on the slot folders, potentially being another candidate for inducing delays. I'm not sure why I hadn't disabled it before then for that machine, but I'm pretty sure it is now!



Very good to know, since to me it confirms the high contention relationships.
I disable the search indexing service myself (since I usually know where to find stuff on my systems), but I know some of my friends like having that active.

I think with Windows telemetry NSA backdoors thrown into the mix, we've got a reasonable indication that any weak logic will exhibit more failures. Probably just as well to find a way to take the lemons and make lemonade, lol.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1781531 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1781535 - Posted: 23 Apr 2016, 5:22:52 UTC - in response to Message 1781505.  
Last modified: 23 Apr 2016, 5:27:32 UTC

Thanks Jim!
Yeah I think we nailed it down to long standing Boinc fussyness pretty well, thanks for the help! Yeah, you can run as preferred (whatever helps). If that system gets the odd freeze even with Normal, and possibly reducing those pulsefind settings if you like, then there could be some system or driver issues somewhere, .freeing a CPU core *might* help, especially if under system RAM pressure, if only just by thrshing memory a bit less. Does the system page to disk a lot ?

OK, back to normal and see if that evens things out a bit. Been keeping 1 of 4 cores free throughout all this, and I think I'll continue that way. As far as memory, here's a snapshot, elapsed since I installed zj:



Don't see any issues there, this looks pretty typical for that machine, and don't see that there would be any reason for excessive page file activity. When I upgraded the CPU from Core2Duo to Core2Quad, I also bumped the RAM from 4G to 8G, as I wanted the extra headroom for this. In the event that I am thrashing the page files, it's on a 120gig SSD, so that should be pretty fast as such things go.
Thanks again for all the help, and feel free to give a shout if I can help out by testing something.
Much appreciated, Jason.
Jim ...
ID: 1781535 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1781779 - Posted: 23 Apr 2016, 21:06:16 UTC - in response to Message 1781501.  

So what magic number do you put on how long the valid result must take to shut down ?


Well, I've had some thoughts about that every now and then, not really worked it through though. Instead of a fixed magic number the client could see if the app has been given a chance to exit. Take a snapshot of the app's CPU time when finish file is first seen and wait until the app's CPU time is snapshot+10.

I had one Rosetta CPU tasks trashed by finish file too present long stuff. The CPU may have been busy with something else at that time, or the disk may have been busy, or the machine may have been swapping heavily. Or maybe even all three at the same time. A simple CPU time snapshot wouldn't catch too busy disk. To solve that it would need... umm, something.

At which point the word over-engineering comes to mind. There's already the maximum time check. On one hand it makes me feel like it would be a waste of resources to wait for maximum time instead of waiting just ten seconds for the app to exit. On the other hand, just how much resources have been wasted for the "finish file present too long"?

So I don't know. Maybe getting rid of the check is the right way.


(It's as if I'm changing opinions every day...)
ID: 1781779 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1781906 - Posted: 24 Apr 2016, 2:55:55 UTC - in response to Message 1781779.  
Last modified: 24 Apr 2016, 3:01:37 UTC

At which point the word over-engineering comes to mind. There's already the maximum time check. On one hand it makes me feel like it would be a waste of resources to wait for maximum time instead of waiting just ten seconds for the app to exit. On the other hand, just how much resources have been wasted for the "finish file present too long"?

So I don't know. Maybe getting rid of the check is the right way.


(It's as if I'm changing opinions every day...)


Continuing the Musing, actually too many or over-tight safeties that interfere with normal operation of a mechanism, is a sign of under-engineering :D probably it's as you said, checking the wrong things in the wrong context.

For magic hardwired numbers (if they really need to exist), which I'm guilty of using inappropriately myself, the answer is it should be an option/variable that is adjustable via the configuration file, with defaults chosen to cover the most use cases. e.g. <slot_finish_timeout>300</slot_finish_timeout> if I want to.

If option creep from that than becomes a concern (lots of similar magic numbers in this code :)), then the example of <slot_finish_timeout> could be readily generalised to apply to other situations in bulk, by calling it <file_operation_timeout> instead.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1781906 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1781988 - Posted: 24 Apr 2016, 9:27:12 UTC - in response to Message 1781779.  
Last modified: 24 Apr 2016, 9:30:57 UTC

So what magic number do you put on how long the valid result must take to shut down ?


Well, I've had some thoughts about that every now and then, not really worked it through though. Instead of a fixed magic number the client could see if the app has been given a chance to exit. Take a snapshot of the app's CPU time when finish file is first seen and wait until the app's CPU time is snapshot+10.

I had one Rosetta CPU tasks trashed by finish file too present long stuff. The CPU may have been busy with something else at that time, or the disk may have been busy, or the machine may have been swapping heavily. Or maybe even all three at the same time. A simple CPU time snapshot wouldn't catch too busy disk. To solve that it would need... umm, something.

At which point the word over-engineering comes to mind. There's already the maximum time check. On one hand it makes me feel like it would be a waste of resources to wait for maximum time instead of waiting just ten seconds for the app to exit. On the other hand, just how much resources have been wasted for the "finish file present too long"?

So I don't know. Maybe getting rid of the check is the right way.


Good points rised indeed... but IMHO in wrong place. Until someone (like Richard) will deliberately place lot of efforts to bring this ideas to BOINC devs they will be "lost in the noise" here. BOINC devs don't read these boards it seems at all. The place with better (only better) chances to get heard are the BOINC's dev/project mail-lists.

More on topic - I would agree with definitions of what it's all about and priority of science result preservation over "programmatically correct exit". Seems BOINC's part of code changing is required though.

Regarding additional timer of exit - what if to restart from last checkpoint in case of BOINC-perceived failure instead of bring computation error? That will attempt to save most of time spent for computation still not awaiting too long to realize there are some difficulties on exit (as would be in case of awaiting full time reserved for task processing).

And awaiting for CPU progress instead of elapsed time can have some negative effect with GPU app. Different runtimes react differently on failure (like driver restart). For OpenCL apps I saw at least 3 reactions:
1) return failure code to API call - most nice one cause allow app to know about issue in some way.
2) doesn't return from API call, 100% core consumption - here CPU time check can see progress.
3) doesn't return from API call, zero CPU consumption - here app will never progress on CPU time until external process termination.
ID: 1781988 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1788605 - Posted: 19 May 2016, 2:45:59 UTC

Since this seems to be the most recent thread to contain a significant discussion of "finish file present too long", I'll add my latest one to keep the topic alive, although I realize there's probably little (if any) new info here. This is Task 4933949457, which happens to be a CPU task. Three of my boxes shut down automatically just before noon on weekdays (to avoid peak electricity rates) and this task apparently wrote the finish file about 2 seconds before BOINC and the app got the shutdown order (after nearly 10 hours of processing). When the machine came back up 6 hours later, the task restarted at 100%, "finished" again, and then got saddled with the error message by BOINC.

The last part of the Stderr shows:

Best spike: peak=25.48348, time=33.55, d_freq=1420242980.35, chirp=-22.546, fft_len=128k
Best autocorr: peak=17.51294, time=33.55, delay=0.32082, d_freq=1420243409.99, chirp=-21.775, fft_len=128k
Best gaussian: peak=3.908507, mean=0.5687883, ChiSq=1.354028, time=86.4, d_freq=1420247658.34,
score=-1.190842, null_hyp=2.15327, chirp=10.974, fft_len=16k
Best pulse: peak=9.018718, time=82.5, period=3.487, d_freq=1420246111.86, score=0.9831, chirp=19.502, fft_len=512
Best triplet: peak=9.375947, time=89.27, period=3.139, d_freq=1420245324.24, chirp=97.01, fft_len=128


Flopcounter: 38277932676800.242000

Spike count: 4
Autocorr count: 0
Pulse count: 0
Triplet count: 1
Gaussian count: 0
Wallclock time elapsed since last restart: 35591.6 seconds

11:59:29 (3536): called boinc_finish(0)

Build features: SETI8 Non-graphics FFTW USE_SSE3 x86
CPUID: Intel(R) Pentium(R) 4 CPU 3.00GHz

Cache: L1=64K L2=2048K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
ar=0.422608 NumCfft=197029 NumGauss=1118263970 NumPulse=226290096308 NumTriplet=452688784082
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Restarted at 100.00 percent.

Best spike: peak=25.48348, time=33.56, d_freq=1420242980.35, chirp=-22.546, fft_len=128k
Best autocorr: peak=17.51294, time=33.56, delay=0.32082, d_freq=1420243409.99, chirp=-21.775, fft_len=128k
Best gaussian: peak=3.908507, mean=0.5687883, ChiSq=1.354028, time=86.4, d_freq=1420247658.34,
score=-1.190842, null_hyp=2.15327, chirp=10.974, fft_len=16k
Best pulse: peak=9.018718, time=82.49, period=3.487, d_freq=1420246111.86, score=0.9831, chirp=19.502, fft_len=512
Best triplet: peak=9.375947, time=89.28, period=3.139, d_freq=1420245324.24, chirp=97.01, fft_len=128


Flopcounter: 38278103594688.242000

Spike count: 4
Autocorr count: 0
Pulse count: 0
Triplet count: 1
Gaussian count: 0
Wallclock time elapsed since last restart: 3.9 seconds

18:02:18 (2168): called boinc_finish(0)

</stderr_txt>
<message>
finish file present too long
</message>
]]>

One thing I found rather curious is that the event log doesn't show the original "Computation .... finished" for the task before the shutdown. It appears that while the finish file got written, the app didn't manage to notify BOINC that it actually was finished. The log shows:
18-May-2016 11:57:41 [SETI@home] Computation for task 05my10aa.26128.18885.6.33.190_1 finished
18-May-2016 11:57:41 [SETI@home] Starting task 06se10ab.8669.7020.6.33.174_0 using setiathome_v8 version 800 (cuda50) in slot 2
18-May-2016 11:57:43 [SETI@home] Started upload of 05my10aa.26128.18885.6.33.190_1_0
18-May-2016 11:57:46 [SETI@home] Finished upload of 05my10aa.26128.18885.6.33.190_1_0
18-May-2016 11:59:31 [---] Exit requested by user
18-May-2016 18:01:51 [---] Starting BOINC client version 7.2.33 for windows_intelx86

18-May-2016 18:01:51 [---] log flags: file_xfer, sched_ops, task
18-May-2016 18:01:51 [---] Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6
18-May-2016 18:01:51 [---] Data directory: C:\Documents and Settings\All Users\Application Data\BOINC
...[blah]
...[blah]
...blah]
18-May-2016 18:02:01 Initialization completed
18-May-2016 18:02:01 [SETI@home] Restarting task 04ap10ac.26001.18885.9.36.25_1 using setiathome_v8 version 800 in slot 0
18-May-2016 18:02:01 [SETI@home] Restarting task 04ap10ac.32106.4979.12.39.190_1 using setiathome_v8 version 800 in slot 1
18-May-2016 18:02:01 [SETI@home] Restarting task 06se10ab.8669.7020.6.33.174_0 using setiathome_v8 version 800 (cuda50) in slot 2
18-May-2016 18:02:22 [SETI@home] Computation for task 04ap10ac.26001.18885.9.36.25_1 finished
18-May-2016 18:02:22 [SETI@home] Starting task 06se10ab.12164.67.11.38.200_1 using setiathome_v8 version 800 in slot 0
18-May-2016 18:02:26 [SETI@home] Started upload of 04ap10ac.26001.18885.9.36.25_1_0
18-May-2016 18:02:29 [SETI@home] Finished upload of 04ap10ac.26001.18885.9.36.25_1_0

Notice that the times for the second "called boinc_finish(0)" in the Stderr and the "Computation ..... finished" message in the event log are 4 seconds apart, whereas the first "called boinc_finish(0)" was only 2 seconds before the log shows the exit request.

So, nearly 10 hours of processing gone to waste and a task that will now have to be resent to another host whose time could be better spent on a fresh, unprocessed task. I swear, there's just got to be some better way for an app to communicate to BOINC when it's all done than the use of this "finish file" kluge. Sigh.....
ID: 1788605 · Report as offensive
To Infinity And Beyond - did I turn the lights off?

Send message
Joined: 10 Aug 07
Posts: 3
Credit: 10,733
RAC: 0
United Kingdom
Message 1788665 - Posted: 19 May 2016, 10:31:58 UTC - in response to Message 1780345.  

Heya Jim,
As a baseline sanity check, so we can eliminate system driver quality as an issue, could you post a screencap of DPC Latency checker while crunching ? ( http://www.thesycon.de/eng/latency_check.shtml )

Additionally, any chance you have some Windows update associated tasks stuck running in task manager ? Cheers!


I came across this thread while trying to get to the bottom of bigger problems on one of my machines and tried running the DPCLatency checker and all was OK - everything in the green ~ 100uS.

I had left DPC running in the background and went back to MSI Afterburner to see what the GPU was doing. I then detached the graph section to get a better view.

Suddenly it showed off the scale spikes of red i.e. ~ 25,000uS with a few green traces. (Max was 26,063uS)
This did not go away when the graph window was closed.

I then started trying various things. I had been having problems with occasional thermal throttling because of a combination of local temps and probably needing to replace my CPU fan thermal paste and had set BOINC to "use at most 70% of CPU time" to avoid this.

Reverting this to 100% and the DPC latency checker reverted to all green without closing the graph window.

I then started trying various "use at most CPU" values and found that 70% seems just happen to be the sweet spot for the problem.

100 - 80 OK

70 and below pretty constant red.

Some of the lower values take a few seconds before the spikes appear.

You seem to have to revert to 100% before changing to another value to clear the problem.

I do not yet know if this is linked to the problems I am having with display lags etc (which go beyond GUPPI VLARS etc) but if anyone else has MSI afterburner installed it would be good to know if this is repeatable beyond my machine.

I haven't tried yet to see which element is actually at the bottom of the spiking.

Machine is Windows 7 64 bit GTX570 with Nvidia driver 353.62 with MSI Afterburner v4.2.0 and BOINC 7.6.22

Extras:
All at 70% while running a CUDA42 task.

Suspending GPU task does not clear the problem.
Detaching graph while GPU task suspended infrequently causes spikes.
Resuming GPU task with graph detached does not cause spikes.

Hmmmm .... does it depend on what the GPU tasks was doing at the time. Had difficulty repeating this as task reached the end.
New task - GUPPI VLAR also cuda42 - same effects.

Totally suspending BOINC but still in memory - no red spikes.

Resuming BOINC with graph detached does not cause spikes.

Closing MSIAfterburner does not stop the spikes until BOINC is returned to 100%.

I had begun to like MSI Afterburner but will try uninstalling and test with EVGAPrecisionX.
ID: 1788665 · Report as offensive
To Infinity And Beyond - did I turn the lights off?

Send message
Joined: 10 Aug 07
Posts: 3
Credit: 10,733
RAC: 0
United Kingdom
Message 1788666 - Posted: 19 May 2016, 10:36:17 UTC - in response to Message 1788665.  
Last modified: 19 May 2016, 10:43:24 UTC

Red herring alert. Have closed down MSIAfterburner and I'm still getting red spikes in DPCLatency when BOINC running at 70% - weird. So MSIAfterburner just helped to trigger the effect.

P.S. This is a test a/c hence the low credit total. I've suspended my main installation using Lunatics and reverted to stock to run these tests.
ID: 1788666 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1788671 - Posted: 19 May 2016, 11:39:37 UTC - in response to Message 1788666.  

Under 'ideal' conditions, you shouldn't get red spikes at all, even under load. Seeing as the machine is i7-920 on Intel Chipset, first thing I would check is the drivers for the system Chipset devices such as PCI express controller and others say Intel(R), reasonably recent version+dates.

Beyond that, Since there are many possible sources for the high DPC latencies, possibly not directly related to crunching (though certainly load), it can be a challenge to track down what's going on. In my case years ago there were first the described chipset driver issues, then a fairly lousy wifi-adaptor+driver, and a lingering Intel Raid driver that was improved later.

LatencyMon can reveal a little more detail about which drivers/hardware specifically are generating the high DPCs, though takes a little figuring out.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1788671 · Report as offensive
To Infinity And Beyond - did I turn the lights off?

Send message
Joined: 10 Aug 07
Posts: 3
Credit: 10,733
RAC: 0
United Kingdom
Message 1788724 - Posted: 19 May 2016, 16:01:38 UTC
Last modified: 19 May 2016, 16:06:09 UTC

I've since tried LatencyMon and it complains about DirectX dxgkrnl.sys NDIS.SYS and Nvidia driver nvlddmkm.sys and usbport.sys etc.

Have upgraded Nivdia to latest 365.19 and unable to replicate the red spikes at the moment - however I am now out of VLARs.

I haven't checked the intel chipset drivers lately on this machine - will give that a go.
ID: 1788724 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Cuda 50 V8 Weirdness


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.