Message boards :
Number crunching :
Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
The Windows 8 Host had over 7000 Consecutive valid tasks before the Invalid recorded yesterday. Most of those 7000 were completed by the 8800GT at around 60 per day. That means it had gone almost 4 Months without a CUDA Error/Invalid. 7.0.64 is sounding good about now. I think I will change the XP Host back... |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
The Windows 8 Host had over 7000 Consecutive valid tasks before the Invalid recorded yesterday. Most of those 7000 were completed by the 8800GT at around 60 per day. That means it had gone almost 4 Months without a CUDA Error/Invalid. Might be the go. As I'm wading through Boinc logs (to catch up a bit), much of it looks like the old problem of treating symptoms rather than finding and eliminating root causes. Dialling back to something that works for you seems sensible. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
7.0.64 is sounding good about now. I think I will change the XP Host back... See my example in Message 1461378. That was with 7.0.64 running stock Cuda42 under Win XP. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Could you enable the task_debug cc_config option ? Okay, I've bumped the checkpoints up to 300 seconds for now. I guess I'll add the task_debug to the other 2 machines that have most recently had truncated STDERRs and just kind of monitor them to see what happens. It seems that at least one of them should spit one out within a week or two. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
7.0.64 is sounding good about now. I think I will change the XP Host back... In the code areas I'm looking at, actual delays involved are very system dependant. The specific (yet to be identified) root causes probably go back to very early Boinc, but just manifest differently. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Okay, I've bumped the checkpoints up to 300 seconds for now. I guess I'll add the task_debug to the other 2 machines that have most recently had truncated STDERRs and just kind of monitor them to see what happens. It seems that at least one of them should spit one out within a week or two. Excellent! thanks very much :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Just perhaps for a baseline showing the extra debug messages, here's a snippet from the log for a task (3334867390) with a -9 overflow that just occurred, but with a complete STDERR. 1/14/2014 4:32:24 PM | SETI@home | [task] task_state=EXECUTING for 30oc13aa.26886.84409.438086664206.12.0_0 from start |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Just had task 3335282896 finish with a truncated STDERR. Name 31oc13aa.10747.120669.438086664206.12.255_1 Workunit 1403194608 Created 14 Jan 2014, 21:13:39 UTC Sent 14 Jan 2014, 22:28:09 UTC Received 15 Jan 2014, 2:06:45 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6980751 Report deadline 8 Mar 2014, 12:26:47 UTC Run time 12.63 CPU time 2.33 Validate state Initial Credit 0.00 Application version SETI@home v7 v7.00 (cuda50) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 5 pciBusID = 24, pciSlotID = 0 Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 5, pciSlotID = 0 Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 69, pciSlotID = 0 Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 88, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 660 is okay SETI@home using CUDA accelerated device GeForce GTX 660 mbcuda.cfg, processpriority key detected pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 5.00 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.442463 Kepler GPU current clockRate = 1162 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k </stderr_txt> ]]> This DOES NOT NECESSARILY mean that it will be marked Invalid. We'll have to wait and see what happens when the wingman reports. The following snippet from the log appears to contain at least one interesting tidbit. 1/14/2014 5:59:44 PM | SETI@home | [task] task_state=EXECUTING for 31oc13aa.10747.120669.438086664206.12.255_1 from start Note that although the task with the truncated STDERR was running in Slot 4, the task that kicked off when that one finished was assigned to Slot 16, not a normal occurrence. (Since 16 tasks were running, Slots 0 through 15 should have been the only ones in use.) Checking further down the log, I found that when the task in Slot 16 finished, the task that replaced it went back to Slot 4. Very interesting! Although I haven't a clue what it means. ;^) |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
[Slot juggling] Could be closer related to some of the last Boinc code commits Richard's mentioned. Indeed interesting, whether or not directly connected to the stderr truncation. Either way, both sets of anomalies all around cleanup are IMO symptoms related to a root cause. Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'. If you assume that things happen 'in order' as requested and 'promptly', then there is little room for weird behaviour in straightforward logic. Unfortunately actions (including logging) based on timing on any non real-time operating system ( i.e. non RTOS) rarely behave so linearly these days. The way around that is to use synchronisation methods (i.e. asking/requesting/acknowledging ) and so an established protocol chain. Ironically these 'asynchronous' methods usually end up more efficient and reliable than old blocking (synchronous) techniques, despite the extra initial overheads of designing in your own primitives. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
That didn't take long. Nice job Jeff. I just found a truncated Stderr output on the Win 8 host that validated, Workunit 1402539217 Created 13 Jan 2014, 21:46:49 UTC Sent 14 Jan 2014, 1:59:57 UTC Received 14 Jan 2014, 15:00:45 UTC Validate state Valid Credit 0.75 Stderr output <core_client_version>7.2.28</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> Strange.... I'm going to install BOINC 7.2.36 on the Windows 8 Host and enable Debugging. Hopefully something good will result. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'. Is this an issue related to "write caching" for the HD? I always have that feature enabled and was under the impression that only a sudden power outage or HD failure would have an impact for a non-removable drive. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
A couple more notes, having had skim over: - The format of the [task] handle exited app log message doesn't exactly match the current code, indicating some playing around going on there - Q: Why would you try to kill processes that already exited successfully? That's the first thing current handle_exited_app() code does... which seems a rather odd thing do do. More beer needed to decide on the logic there. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'. In a sense yes. along similar but somewhat distant lines. And even then proper transaction handling should prevent the most likely problems. That makes assumptions about layers of firmware, OS and drivers being bug free. Most likely the race conditions are in Boinc code, but the concepts are closely connected. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
[Slot juggling] Could be closer related to some of the last Boinc code commits Richard's mentioned. Indeed interesting, whether or not directly connected to the stderr truncation. Either way, both sets of anomalies all around cleanup are IMO symptoms related to a root cause. I haven't done any exploring on the incomplete stderr.txt issue, but I have - in general running - noticed that the cuda50 app takes the 'temporary exit' route quite often - and I don't think that appears in the general message log. When that happens, BOINC Manager/View/Tasks (whatever your tool of choice) shows the current task as 'waiting to run', and starts a new one in a new slot directory (because the old one is still occupied). Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. Alright, yeah we're going to have to tread carefully with respect to 'crossed symptoms'. The good news there is similar symptoms pointing in different directions tend to indicate 'not the root of the issue'. That just means 'not my fault, I did a temp exit because something else went wacko'. [Edit:] hmm, all temp exits should be being effectively logged by the client... if not, there's a breakage there (too) [Edit2:] ah, in current code he has them in task_debug "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. I tend to run with 'work fetch debug' activated, which dumps a huge volume of data into the log - I could well have missed exit information. I'll tone it down for a while, and then have a search (or spot check if I see it happening). That would answer whether temp exit is logged by v7.2.37 |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. I looked. as per crossing second edit: it'll be under task_debug in recent code. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. That would be why I didn't see them. OK, activated - just have to see how long Murphy can prevent her throwing any temp exits. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
When that happens, BOINC Manager/View/Tasks (whatever your tool of choice) shows the current task as 'waiting to run', and starts a new one in a new slot directory (because the old one is still occupied). I do see that happen from time to time, but in those instances once the "new" task completes, the task "waiting to run" resumes in its original slot. In this case, though, when the new task started in Slot 16, there wasn't actually any task in a "waiting to run" state, so when the task in Slot 16 finished, the next task in the queue started fresh in Slot 4. Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids. Yeah, the last actual Invalid was on January 6, the example I gave in Message 1461404. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
In checking overnight results, I see that WU 1403194608, referenced in Message 1464457 was successfully validated by my wingman, despite the truncated STDERR in my task. The wingman's STDERR shows a Spike count of 30, which is consistent with my observation that so far my only Invalids with a truncated STDERR have come when the wingman's -9 overflow has a Spike count of less than 30. But now here's another task, 3335606685, with a truncated STDERR but not yet reported by the wingman. Name 22oc13ab.4958.88550.438086664204.12.5_0 Workunit 1403344842 Created 15 Jan 2014, 0:57:06 UTC Sent 15 Jan 2014, 2:44:13 UTC Received 15 Jan 2014, 8:16:28 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6980751 Report deadline 8 Mar 2014, 16:49:28 UTC Run time 6.50 CPU time 2.14 Validate state Initial Credit 0.00 Application version SETI@home v7 v7.00 (cuda50) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 5 pciBusID = 24, pciSlotID = 0 Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 5, pciSlotID = 0 Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 69, pciSlotID = 0 Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 88, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 660 is okay SETI@home using CUDA accelerated device GeForce GTX 660 mbcuda.cfg, processpriority key detected pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 5.00 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.442358 Kepler GPU current clockRate = 1162 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k </stderr_txt> ]]> The following snippet from the log appears to be nearly identical to the last one I reported: 15-Jan-2014 00:13:11 [SETI@home] [task] task_state=EXECUTING for 22oc13ab.4958.88550.438086664204.12.5_0 from start Once again, when the task with the truncated STDERR "finished" in Slot 4, the next task started up in Slot 16, and when that task eventually finished, the next one reverted back to Slot 4. Quite curious! |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.