Message boards :
Number crunching :
Cuda 50 V8 Weirdness
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It's just that we still have that not-so-correct client we need to deal with and it would be then better to have the finish file created later as a work-around. Yep, works for me. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
@Jim, drop in test build Lunatics_x41zj_win32_cuda50.7z Thanks Jim! Yeah I think we nailed it down to long standing Boinc fussyness pretty well, thanks for the help! Yeah, you can run as preferred (whatever helps). If that system gets the odd freeze even with Normal, and possibly reducing those pulsefind settings if you like, then there could be some system or driver issues somewhere, .freeing a CPU core *might* help, especially if under system RAM pressure, if only just by thrshing memory a bit less. Does the system page to disk a lot ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Looks like I caught a "finish file present too long" (task 4874874171) on my Win Vista machine this morning. It appears to have been dragged down by Windows Update while in its "Checking for Updates" phase, probably just about the time that phase completed (after running for about 2 hours). Yeah, that's my second one on a CPU task this month, but on different machines. Different BOINC versions, too, with today's happening on 7.2.33 while the one on April 4 was on 7.6.9. It appears the CPU app doesn't write anything to Stderr after the call to boinc_finish, but is still susceptible to outside interference, such as Windows Update hogging cycles. By the way, I was looking at another Program Monitor log from last July's testing on a different machine and noticed that Search Indexing (SearchIndexer.exe and SearchProtocolHost.exe) was very active on the slot folders, potentially being another candidate for inducing delays. I'm not sure why I hadn't disabled it before then for that machine, but I'm pretty sure it is now! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
By the way, I was looking at another Program Monitor log from last July's testing on a different machine and noticed that Search Indexing (SearchIndexer.exe and SearchProtocolHost.exe) was very active on the slot folders, potentially being another candidate for inducing delays. I'm not sure why I hadn't disabled it before then for that machine, but I'm pretty sure it is now! Very good to know, since to me it confirms the high contention relationships. I disable the search indexing service myself (since I usually know where to find stuff on my systems), but I know some of my friends like having that active. I think with Windows telemetry NSA backdoors thrown into the mix, we've got a reasonable indication that any weak logic will exhibit more failures. Probably just as well to find a way to take the lemons and make lemonade, lol. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Thanks Jim! OK, back to normal and see if that evens things out a bit. Been keeping 1 of 4 cores free throughout all this, and I think I'll continue that way. As far as memory, here's a snapshot, elapsed since I installed zj: Don't see any issues there, this looks pretty typical for that machine, and don't see that there would be any reason for excessive page file activity. When I upgraded the CPU from Core2Duo to Core2Quad, I also bumped the RAM from 4G to 8G, as I wanted the extra headroom for this. In the event that I am thrashing the page files, it's on a 120gig SSD, so that should be pretty fast as such things go. Thanks again for all the help, and feel free to give a shout if I can help out by testing something. Much appreciated, Jason. Jim ... |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
So what magic number do you put on how long the valid result must take to shut down ? Well, I've had some thoughts about that every now and then, not really worked it through though. Instead of a fixed magic number the client could see if the app has been given a chance to exit. Take a snapshot of the app's CPU time when finish file is first seen and wait until the app's CPU time is snapshot+10. I had one Rosetta CPU tasks trashed by finish file too present long stuff. The CPU may have been busy with something else at that time, or the disk may have been busy, or the machine may have been swapping heavily. Or maybe even all three at the same time. A simple CPU time snapshot wouldn't catch too busy disk. To solve that it would need... umm, something. At which point the word over-engineering comes to mind. There's already the maximum time check. On one hand it makes me feel like it would be a waste of resources to wait for maximum time instead of waiting just ten seconds for the app to exit. On the other hand, just how much resources have been wasted for the "finish file present too long"? So I don't know. Maybe getting rid of the check is the right way. (It's as if I'm changing opinions every day...) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
At which point the word over-engineering comes to mind. There's already the maximum time check. On one hand it makes me feel like it would be a waste of resources to wait for maximum time instead of waiting just ten seconds for the app to exit. On the other hand, just how much resources have been wasted for the "finish file present too long"? Continuing the Musing, actually too many or over-tight safeties that interfere with normal operation of a mechanism, is a sign of under-engineering :D probably it's as you said, checking the wrong things in the wrong context. For magic hardwired numbers (if they really need to exist), which I'm guilty of using inappropriately myself, the answer is it should be an option/variable that is adjustable via the configuration file, with defaults chosen to cover the most use cases. e.g. <slot_finish_timeout>300</slot_finish_timeout> if I want to. If option creep from that than becomes a concern (lots of similar magic numbers in this code :)), then the example of <slot_finish_timeout> could be readily generalised to apply to other situations in bulk, by calling it <file_operation_timeout> instead. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
So what magic number do you put on how long the valid result must take to shut down ? Good points rised indeed... but IMHO in wrong place. Until someone (like Richard) will deliberately place lot of efforts to bring this ideas to BOINC devs they will be "lost in the noise" here. BOINC devs don't read these boards it seems at all. The place with better (only better) chances to get heard are the BOINC's dev/project mail-lists. More on topic - I would agree with definitions of what it's all about and priority of science result preservation over "programmatically correct exit". Seems BOINC's part of code changing is required though. Regarding additional timer of exit - what if to restart from last checkpoint in case of BOINC-perceived failure instead of bring computation error? That will attempt to save most of time spent for computation still not awaiting too long to realize there are some difficulties on exit (as would be in case of awaiting full time reserved for task processing). And awaiting for CPU progress instead of elapsed time can have some negative effect with GPU app. Different runtimes react differently on failure (like driver restart). For OpenCL apps I saw at least 3 reactions: 1) return failure code to API call - most nice one cause allow app to know about issue in some way. 2) doesn't return from API call, 100% core consumption - here CPU time check can see progress. 3) doesn't return from API call, zero CPU consumption - here app will never progress on CPU time until external process termination. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Since this seems to be the most recent thread to contain a significant discussion of "finish file present too long", I'll add my latest one to keep the topic alive, although I realize there's probably little (if any) new info here. This is Task 4933949457, which happens to be a CPU task. Three of my boxes shut down automatically just before noon on weekdays (to avoid peak electricity rates) and this task apparently wrote the finish file about 2 seconds before BOINC and the app got the shutdown order (after nearly 10 hours of processing). When the machine came back up 6 hours later, the task restarted at 100%, "finished" again, and then got saddled with the error message by BOINC. The last part of the Stderr shows: Best spike: peak=25.48348, time=33.55, d_freq=1420242980.35, chirp=-22.546, fft_len=128k One thing I found rather curious is that the event log doesn't show the original "Computation .... finished" for the task before the shutdown. It appears that while the finish file got written, the app didn't manage to notify BOINC that it actually was finished. The log shows: 18-May-2016 11:57:41 [SETI@home] Computation for task 05my10aa.26128.18885.6.33.190_1 finished Notice that the times for the second "called boinc_finish(0)" in the Stderr and the "Computation ..... finished" message in the event log are 4 seconds apart, whereas the first "called boinc_finish(0)" was only 2 seconds before the log shows the exit request. So, nearly 10 hours of processing gone to waste and a task that will now have to be resent to another host whose time could be better spent on a fresh, unprocessed task. I swear, there's just got to be some better way for an app to communicate to BOINC when it's all done than the use of this "finish file" kluge. Sigh..... |
To Infinity And Beyond - did I turn the lights off? Send message Joined: 10 Aug 07 Posts: 3 Credit: 10,733 RAC: 0 |
Heya Jim, I came across this thread while trying to get to the bottom of bigger problems on one of my machines and tried running the DPCLatency checker and all was OK - everything in the green ~ 100uS. I had left DPC running in the background and went back to MSI Afterburner to see what the GPU was doing. I then detached the graph section to get a better view. Suddenly it showed off the scale spikes of red i.e. ~ 25,000uS with a few green traces. (Max was 26,063uS) This did not go away when the graph window was closed. I then started trying various things. I had been having problems with occasional thermal throttling because of a combination of local temps and probably needing to replace my CPU fan thermal paste and had set BOINC to "use at most 70% of CPU time" to avoid this. Reverting this to 100% and the DPC latency checker reverted to all green without closing the graph window. I then started trying various "use at most CPU" values and found that 70% seems just happen to be the sweet spot for the problem. 100 - 80 OK 70 and below pretty constant red. Some of the lower values take a few seconds before the spikes appear. You seem to have to revert to 100% before changing to another value to clear the problem. I do not yet know if this is linked to the problems I am having with display lags etc (which go beyond GUPPI VLARS etc) but if anyone else has MSI afterburner installed it would be good to know if this is repeatable beyond my machine. I haven't tried yet to see which element is actually at the bottom of the spiking. Machine is Windows 7 64 bit GTX570 with Nvidia driver 353.62 with MSI Afterburner v4.2.0 and BOINC 7.6.22 Extras: All at 70% while running a CUDA42 task. Suspending GPU task does not clear the problem. Detaching graph while GPU task suspended infrequently causes spikes. Resuming GPU task with graph detached does not cause spikes. Hmmmm .... does it depend on what the GPU tasks was doing at the time. Had difficulty repeating this as task reached the end. New task - GUPPI VLAR also cuda42 - same effects. Totally suspending BOINC but still in memory - no red spikes. Resuming BOINC with graph detached does not cause spikes. Closing MSIAfterburner does not stop the spikes until BOINC is returned to 100%. I had begun to like MSI Afterburner but will try uninstalling and test with EVGAPrecisionX. |
To Infinity And Beyond - did I turn the lights off? Send message Joined: 10 Aug 07 Posts: 3 Credit: 10,733 RAC: 0 |
Red herring alert. Have closed down MSIAfterburner and I'm still getting red spikes in DPCLatency when BOINC running at 70% - weird. So MSIAfterburner just helped to trigger the effect. P.S. This is a test a/c hence the low credit total. I've suspended my main installation using Lunatics and reverted to stock to run these tests. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Under 'ideal' conditions, you shouldn't get red spikes at all, even under load. Seeing as the machine is i7-920 on Intel Chipset, first thing I would check is the drivers for the system Chipset devices such as PCI express controller and others say Intel(R), reasonably recent version+dates. Beyond that, Since there are many possible sources for the high DPC latencies, possibly not directly related to crunching (though certainly load), it can be a challenge to track down what's going on. In my case years ago there were first the described chipset driver issues, then a fairly lousy wifi-adaptor+driver, and a lingering Intel Raid driver that was improved later. LatencyMon can reveal a little more detail about which drivers/hardware specifically are generating the high DPCs, though takes a little figuring out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
To Infinity And Beyond - did I turn the lights off? Send message Joined: 10 Aug 07 Posts: 3 Credit: 10,733 RAC: 0 |
I've since tried LatencyMon and it complains about DirectX dxgkrnl.sys NDIS.SYS and Nvidia driver nvlddmkm.sys and usbport.sys etc. Have upgraded Nivdia to latest 365.19 and unable to replicate the red spikes at the moment - however I am now out of VLARs. I haven't checked the intel chipset drivers lately on this machine - will give that a go. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.