Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1464441 - Posted: 15 Jan 2014, 0:32:00 UTC - in response to Message 1464439.  

The Windows 8 Host had over 7000 Consecutive valid tasks before the Invalid recorded yesterday. Most of those 7000 were completed by the 8800GT at around 60 per day. That means it had gone almost 4 Months without a CUDA Error/Invalid.

7.0.64 is sounding good about now. I think I will change the XP Host back...
ID: 1464441 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464442 - Posted: 15 Jan 2014, 0:35:29 UTC - in response to Message 1464441.  

The Windows 8 Host had over 7000 Consecutive valid tasks before the Invalid recorded yesterday. Most of those 7000 were completed by the 8800GT at around 60 per day. That means it had gone almost 4 Months without a CUDA Error/Invalid.

7.0.64 is sounding good about now. I think I will change the XP Host back...


Might be the go. As I'm wading through Boinc logs (to catch up a bit), much of it looks like the old problem of treating symptoms rather than finding and eliminating root causes. Dialling back to something that works for you seems sensible.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464442 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464443 - Posted: 15 Jan 2014, 0:38:07 UTC - in response to Message 1464441.  

7.0.64 is sounding good about now. I think I will change the XP Host back...

See my example in Message 1461378. That was with 7.0.64 running stock Cuda42 under Win XP.
ID: 1464443 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464444 - Posted: 15 Jan 2014, 0:46:55 UTC - in response to Message 1464436.  

Could you enable the task_debug cc_config option ?

Turned it on on one machine. Looks like the log is going to fill up with a whole lot of checkpoint messages!


If you want to keep that down, you could up 'Write to disk at most every ...' setting.

Well, I doubled it from the default of 60 to 120, but with 16 tasks running concurrently (8 CPU and 8 GPU), that's still a lot of checkpointing. How high could that value be increased and still be "safe"?

15 seconds is nowhere near long enough to assume the OS got around to all its garbage collection, especially under heavy system contention.

In the example I gave earlier, the entire life-cycle of the task, from the time it started to execute until the time it finished uploading, totaled only 14 seconds (running on Win XP).


I've had my checkpoint period up to an hour (3600) with no ill effects apart from the added risk of having to reprocess more in the event of a power failure [or reboot etc].

At this stage we don't know whether the 'normal' log prints 'finished' before or after the garbage collection phase. from your quote of 14 seconds, I suspect as soon as the client receives the I am quitting message, meaning the task may have 'wanted' anywhere from a few extra millisconds to much longer. task_debug should hopefully show messages after the task 'xxxx has finished' illustrating the cleanup cycle.

Okay, I've bumped the checkpoints up to 300 seconds for now. I guess I'll add the task_debug to the other 2 machines that have most recently had truncated STDERRs and just kind of monitor them to see what happens. It seems that at least one of them should spit one out within a week or two.
ID: 1464444 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464445 - Posted: 15 Jan 2014, 0:47:36 UTC - in response to Message 1464443.  

7.0.64 is sounding good about now. I think I will change the XP Host back...

See my example in Message 1461378. That was with 7.0.64 running stock Cuda42 under Win XP.


In the code areas I'm looking at, actual delays involved are very system dependant. The specific (yet to be identified) root causes probably go back to very early Boinc, but just manifest differently.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464445 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464446 - Posted: 15 Jan 2014, 0:48:48 UTC - in response to Message 1464444.  

Okay, I've bumped the checkpoints up to 300 seconds for now. I guess I'll add the task_debug to the other 2 machines that have most recently had truncated STDERRs and just kind of monitor them to see what happens. It seems that at least one of them should spit one out within a week or two.


Excellent! thanks very much :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464446 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464447 - Posted: 15 Jan 2014, 1:00:51 UTC

Just perhaps for a baseline showing the extra debug messages, here's a snippet from the log for a task (3334867390) with a -9 overflow that just occurred, but with a complete STDERR.

1/14/2014 4:32:24 PM | SETI@home | [task] task_state=EXECUTING for 30oc13aa.26886.84409.438086664206.12.0_0 from start
1/14/2014 4:32:24 PM | SETI@home | Starting task 30oc13aa.26886.84409.438086664206.12.0_0 using setiathome_v7 version 700 (cuda50) in slot 2

1/14/2014 4:32:26 PM | SETI@home | Started upload of 31oc13aa.17195.121487.438086664202.12.76_0_0
1/14/2014 4:32:29 PM | SETI@home | Finished upload of 31oc13aa.17195.121487.438086664202.12.76_0_0
1/14/2014 4:32:29 PM | SETI@home | [task] result state=FILES_UPLOADED for 31oc13aa.17195.121487.438086664202.12.76_0 from CS::update_results
1/14/2014 4:32:36 PM | SETI@home | [task] Process for 30oc13aa.26886.84409.438086664206.12.0_0 exited, exit code 0, task state 1
1/14/2014 4:32:36 PM | SETI@home | [task] task_state=EXITED for 30oc13aa.26886.84409.438086664206.12.0_0 from handle_exited_app
1/14/2014 4:32:36 PM | SETI@home | [task] result 22oc13ab.32207.90595.438086664199.12.102_0 checkpointed
1/14/2014 4:32:36 PM | SETI@home | Computation for task 30oc13aa.26886.84409.438086664206.12.0_0 finished
1/14/2014 4:32:36 PM | SETI@home | [task] result state=FILES_UPLOADING for 30oc13aa.26886.84409.438086664206.12.0_0 from CS::app_finished
1/14/2014 4:32:36 PM | SETI@home | [task] task_state=EXECUTING for 30oc13aa.26886.84409.438086664206.12.15_1 from start
1/14/2014 4:32:36 PM | SETI@home | Starting task 30oc13aa.26886.84409.438086664206.12.15_1 using setiathome_v7 version 700 (cuda50) in slot 2
1/14/2014 4:32:38 PM | SETI@home | [task] result 31oc13aa.31499.110853.227633266699.12.187_1 checkpointed
1/14/2014 4:32:38 PM | SETI@home | Started upload of 30oc13aa.26886.84409.438086664206.12.0_0_0
1/14/2014 4:32:41 PM | SETI@home | [task] result 30oc13aa.32141.97906.438086664205.12.77_1 checkpointed
1/14/2014 4:32:41 PM | SETI@home | Finished upload of 30oc13aa.26886.84409.438086664206.12.0_0_0
1/14/2014 4:32:41 PM | SETI@home | [task] result state=FILES_UPLOADED for 30oc13aa.26886.84409.438086664206.12.0_0 from CS::update_results
ID: 1464447 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464457 - Posted: 15 Jan 2014, 2:54:59 UTC

Just had task 3335282896 finish with a truncated STDERR.

Name	31oc13aa.10747.120669.438086664206.12.255_1
Workunit	1403194608
Created	14 Jan 2014, 21:13:39 UTC
Sent	14 Jan 2014, 22:28:09 UTC
Received	15 Jan 2014, 2:06:45 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6980751
Report deadline	8 Mar 2014, 12:26:47 UTC
Run time	12.63
CPU time	2.33
Validate state	Initial
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 5 
     pciBusID = 24, pciSlotID = 0
  Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 5, pciSlotID = 0
  Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 69, pciSlotID = 0
  Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.442463

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>


This DOES NOT NECESSARILY mean that it will be marked Invalid. We'll have to wait and see what happens when the wingman reports.

The following snippet from the log appears to contain at least one interesting tidbit.

1/14/2014 5:59:44 PM | SETI@home | [task] task_state=EXECUTING for 31oc13aa.10747.120669.438086664206.12.255_1 from start
1/14/2014 5:59:44 PM | SETI@home | Starting task 31oc13aa.10747.120669.438086664206.12.255_1 using setiathome_v7 version 700 (cuda50) in slot 4

1/14/2014 5:59:46 PM | SETI@home | Started upload of 26oc13aa.21586.2930.438086664204.12.152_2_0
1/14/2014 5:59:47 PM | SETI@home | [task] result 24oc13ab.24751.13564.438086664195.12.212_1 checkpointed
1/14/2014 5:59:49 PM | SETI@home | Finished upload of 26oc13aa.21586.2930.438086664204.12.152_2_0
1/14/2014 5:59:49 PM | SETI@home | [task] result state=FILES_UPLOADED for 26oc13aa.21586.2930.438086664204.12.152_2 from CS::update_results
1/14/2014 5:59:57 PM | SETI@home | [task] Process for 31oc13aa.10747.120669.438086664206.12.255_1 exited, exit code 0, task state 1
1/14/2014 5:59:57 PM | SETI@home | [task] task_state=EXITED for 31oc13aa.10747.120669.438086664206.12.255_1 from handle_exited_app
1/14/2014 5:59:57 PM | SETI@home | Computation for task 31oc13aa.10747.120669.438086664206.12.255_1 finished
1/14/2014 5:59:57 PM | SETI@home | [task] result state=FILES_UPLOADING for 31oc13aa.10747.120669.438086664206.12.255_1 from CS::app_finished
1/14/2014 5:59:57 PM | SETI@home | [task] task_state=EXECUTING for 31oc13aa.10704.125168.438086664205.12.165_0 from start
1/14/2014 5:59:57 PM | SETI@home | Starting task 31oc13aa.10704.125168.438086664205.12.165_0 using setiathome_v7 version 700 (cuda50) in slot 16
1/14/2014 6:00:00 PM | SETI@home | Started upload of 31oc13aa.10747.120669.438086664206.12.255_1_0
1/14/2014 6:00:03 PM | SETI@home | Finished upload of 31oc13aa.10747.120669.438086664206.12.255_1_0
1/14/2014 6:00:03 PM | SETI@home | [task] result state=FILES_UPLOADED for 31oc13aa.10747.120669.438086664206.12.255_1 from CS::update_results


Note that although the task with the truncated STDERR was running in Slot 4, the task that kicked off when that one finished was assigned to Slot 16, not a normal occurrence. (Since 16 tasks were running, Slots 0 through 15 should have been the only ones in use.) Checking further down the log, I found that when the task in Slot 16 finished, the task that replaced it went back to Slot 4. Very interesting! Although I haven't a clue what it means. ;^)
ID: 1464457 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464460 - Posted: 15 Jan 2014, 3:23:33 UTC - in response to Message 1464457.  
Last modified: 15 Jan 2014, 3:33:36 UTC

[Slot juggling] Could be closer related to some of the last Boinc code commits Richard's mentioned. Indeed interesting, whether or not directly connected to the stderr truncation. Either way, both sets of anomalies all around cleanup are IMO symptoms related to a root cause.

Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'.

If you assume that things happen 'in order' as requested and 'promptly', then there is little room for weird behaviour in straightforward logic. Unfortunately actions (including logging) based on timing on any non real-time operating system ( i.e. non RTOS) rarely behave so linearly these days. The way around that is to use synchronisation methods (i.e. asking/requesting/acknowledging ) and so an established protocol chain. Ironically these 'asynchronous' methods usually end up more efficient and reliable than old blocking (synchronous) techniques, despite the extra initial overheads of designing in your own primitives.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464460 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1464467 - Posted: 15 Jan 2014, 3:47:56 UTC - in response to Message 1464457.  
Last modified: 15 Jan 2014, 3:51:07 UTC

That didn't take long. Nice job Jeff.

I just found a truncated Stderr output on the Win 8 host that validated,
Workunit 1402539217
Created 13 Jan 2014, 21:46:49 UTC
Sent 14 Jan 2014, 1:59:57 UTC
Received 14 Jan 2014, 15:00:45 UTC
Validate state Valid
Credit 0.75

Stderr output

<core_client_version>7.2.28</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

Strange....

I'm going to install BOINC 7.2.36 on the Windows 8 Host and enable Debugging. Hopefully something good will result.
ID: 1464467 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464468 - Posted: 15 Jan 2014, 3:49:51 UTC - in response to Message 1464460.  

Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'.

If you assume that things happen 'in order' as requested and 'promptly', then there is little room for weird behaviour in straightforward logic. Unfortunately actions (including logging) based on timing on any non real-time operating system ( i.e. non RTOS) rarely behave so linearly these days. The way around that is to use synchronisation methods (i.e. asking/requesting/acknowledging ) and so an established protocol chain. Ironically these 'asynchronous' methods usually end up more efficient and reliable than old blocking (synchronous) techniques, despite the extra initial overheads of designing in your own primitives.

Is this an issue related to "write caching" for the HD? I always have that feature enabled and was under the impression that only a sudden power outage or HD failure would have an impact for a non-removable drive.
ID: 1464468 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464470 - Posted: 15 Jan 2014, 3:50:52 UTC

A couple more notes, having had skim over:

- The format of the [task] handle exited app log message doesn't exactly match the current code, indicating some playing around going on there
- Q: Why would you try to kill processes that already exited successfully? That's the first thing current handle_exited_app() code does... which seems a rather odd thing do do. More beer needed to decide on the logic there.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464470 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464473 - Posted: 15 Jan 2014, 3:56:13 UTC - in response to Message 1464468.  
Last modified: 15 Jan 2014, 3:58:47 UTC

Where I (currently) believe the symptoms point, is toward a misunderanding of how operating and file systems 'work'.

If you assume that things happen 'in order' as requested and 'promptly', then there is little room for weird behaviour in straightforward logic. Unfortunately actions (including logging) based on timing on any non real-time operating system ( i.e. non RTOS) rarely behave so linearly these days. The way around that is to use synchronisation methods (i.e. asking/requesting/acknowledging ) and so an established protocol chain. Ironically these 'asynchronous' methods usually end up more efficient and reliable than old blocking (synchronous) techniques, despite the extra initial overheads of designing in your own primitives.

Is this an issue related to "write caching" for the HD? I always have that feature enabled and was under the impression that only a sudden power outage or HD failure would have an impact for a non-removable drive.


In a sense yes. along similar but somewhat distant lines. And even then proper transaction handling should prevent the most likely problems. That makes assumptions about layers of firmware, OS and drivers being bug free. Most likely the race conditions are in Boinc code, but the concepts are closely connected.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464473 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1464541 - Posted: 15 Jan 2014, 9:46:14 UTC - in response to Message 1464460.  
Last modified: 15 Jan 2014, 9:54:12 UTC

[Slot juggling] Could be closer related to some of the last Boinc code commits Richard's mentioned. Indeed interesting, whether or not directly connected to the stderr truncation. Either way, both sets of anomalies all around cleanup are IMO symptoms related to a root cause.

I haven't done any exploring on the incomplete stderr.txt issue, but I have - in general running - noticed that the cuda50 app takes the 'temporary exit' route quite often - and I don't think that appears in the general message log.

When that happens, BOINC Manager/View/Tasks (whatever your tool of choice) shows the current task as 'waiting to run', and starts a new one in a new slot directory (because the old one is still occupied).

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.
ID: 1464541 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464547 - Posted: 15 Jan 2014, 10:35:43 UTC - in response to Message 1464541.  
Last modified: 15 Jan 2014, 10:46:18 UTC

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.


Alright, yeah we're going to have to tread carefully with respect to 'crossed symptoms'. The good news there is similar symptoms pointing in different directions tend to indicate 'not the root of the issue'. That just means 'not my fault, I did a temp exit because something else went wacko'.

[Edit:] hmm, all temp exits should be being effectively logged by the client... if not, there's a breakage there (too)

[Edit2:] ah, in current code he has them in task_debug
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464547 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1464550 - Posted: 15 Jan 2014, 10:46:33 UTC - in response to Message 1464547.  

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.

Alright, yeah we're going to have to tread carefully with respect to 'crossed symptoms'. The good news there is similar symptoms pointing in different directions tend to indicate 'not the root of the issue'. That just means 'not my fault, I did a temp exit because something else went wacko'.

[Edit:] hmm, all temp exits should be being effectively logged by the client... if not, there's a breakage there (too)

I tend to run with 'work fetch debug' activated, which dumps a huge volume of data into the log - I could well have missed exit information. I'll tone it down for a while, and then have a search (or spot check if I see it happening). That would answer whether temp exit is logged by v7.2.37
ID: 1464550 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1464552 - Posted: 15 Jan 2014, 10:50:52 UTC - in response to Message 1464550.  

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.

Alright, yeah we're going to have to tread carefully with respect to 'crossed symptoms'. The good news there is similar symptoms pointing in different directions tend to indicate 'not the root of the issue'. That just means 'not my fault, I did a temp exit because something else went wacko'.

[Edit:] hmm, all temp exits should be being effectively logged by the client... if not, there's a breakage there (too)

I tend to run with 'work fetch debug' activated, which dumps a huge volume of data into the log - I could well have missed exit information. I'll tone it down for a while, and then have a search (or spot check if I see it happening). That would answer whether temp exit is logged by v7.2.37


I looked. as per crossing second edit: it'll be under task_debug in recent code.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1464552 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1464559 - Posted: 15 Jan 2014, 11:20:49 UTC - in response to Message 1464552.  

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.

Alright, yeah we're going to have to tread carefully with respect to 'crossed symptoms'. The good news there is similar symptoms pointing in different directions tend to indicate 'not the root of the issue'. That just means 'not my fault, I did a temp exit because something else went wacko'.

[Edit:] hmm, all temp exits should be being effectively logged by the client... if not, there's a breakage there (too)

I tend to run with 'work fetch debug' activated, which dumps a huge volume of data into the log - I could well have missed exit information. I'll tone it down for a while, and then have a search (or spot check if I see it happening). That would answer whether temp exit is logged by v7.2.37

I looked. as per crossing second edit: it'll be under task_debug in recent code.

That would be why I didn't see them. OK, activated - just have to see how long Murphy can prevent her throwing any temp exits.
ID: 1464559 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464643 - Posted: 15 Jan 2014, 17:15:09 UTC - in response to Message 1464541.  

When that happens, BOINC Manager/View/Tasks (whatever your tool of choice) shows the current task as 'waiting to run', and starts a new one in a new slot directory (because the old one is still occupied).

I do see that happen from time to time, but in those instances once the "new" task completes, the task "waiting to run" resumes in its original slot. In this case, though, when the new task started in Slot 16, there wasn't actually any task in a "waiting to run" state, so when the task in Slot 16 finished, the next task in the queue started fresh in Slot 4.

Edit - despite the temporary exits, that host is showing 1784 consecutive valid tasks, and no invalids.

Yeah, the last actual Invalid was on January 6, the example I gave in Message 1461404.
ID: 1464643 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1464659 - Posted: 15 Jan 2014, 17:38:06 UTC

In checking overnight results, I see that WU 1403194608, referenced in Message 1464457 was successfully validated by my wingman, despite the truncated STDERR in my task. The wingman's STDERR shows a Spike count of 30, which is consistent with my observation that so far my only Invalids with a truncated STDERR have come when the wingman's -9 overflow has a Spike count of less than 30.

But now here's another task, 3335606685, with a truncated STDERR but not yet reported by the wingman.

Name	22oc13ab.4958.88550.438086664204.12.5_0
Workunit	1403344842
Created	15 Jan 2014, 0:57:06 UTC
Sent	15 Jan 2014, 2:44:13 UTC
Received	15 Jan 2014, 8:16:28 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6980751
Report deadline	8 Mar 2014, 16:49:28 UTC
Run time	6.50
CPU time	2.14
Validate state	Initial
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 5 
     pciBusID = 24, pciSlotID = 0
  Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 5, pciSlotID = 0
  Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 69, pciSlotID = 0
  Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.442358

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>


The following snippet from the log appears to be nearly identical to the last one I reported:

15-Jan-2014 00:13:11 [SETI@home] [task] task_state=EXECUTING for 22oc13ab.4958.88550.438086664204.12.5_0 from start
15-Jan-2014 00:13:11 [SETI@home] Starting task 22oc13ab.4958.88550.438086664204.12.5_0 using setiathome_v7 version 700 (cuda50) in slot 4

15-Jan-2014 00:13:14 [SETI@home] Started upload of 22oc13ab.4958.88550.438086664204.12.1_0_0
15-Jan-2014 00:13:17 [SETI@home] Finished upload of 22oc13ab.4958.88550.438086664204.12.1_0_0
15-Jan-2014 00:13:17 [SETI@home] [task] result state=FILES_UPLOADED for 22oc13ab.4958.88550.438086664204.12.1_0 from CS::update_results
15-Jan-2014 00:13:19 [SETI@home] [task] Process for 22oc13ab.4958.88550.438086664204.12.5_0 exited, exit code 0, task state 1
15-Jan-2014 00:13:19 [SETI@home] [task] task_state=EXITED for 22oc13ab.4958.88550.438086664204.12.5_0 from handle_exited_app
15-Jan-2014 00:13:19 [SETI@home] Computation for task 22oc13ab.4958.88550.438086664204.12.5_0 finished
15-Jan-2014 00:13:19 [SETI@home] [task] result state=FILES_UPLOADING for 22oc13ab.4958.88550.438086664204.12.5_0 from CS::app_finished
15-Jan-2014 00:13:19 [SETI@home] [task] task_state=EXECUTING for 24oc13ab.9157.9065.438086664198.12.60_1 from start
15-Jan-2014 00:13:19 [SETI@home] Starting task 24oc13ab.9157.9065.438086664198.12.60_1 using setiathome_v7 version 700 (cuda50) in slot 16
15-Jan-2014 00:13:21 [SETI@home] Started upload of 22oc13ab.4958.88550.438086664204.12.5_0_0
15-Jan-2014 00:13:24 [SETI@home] Finished upload of 22oc13ab.4958.88550.438086664204.12.5_0_0
15-Jan-2014 00:13:24 [SETI@home] [task] result state=FILES_UPLOADED for 22oc13ab.4958.88550.438086664204.12.5_0 from CS::update_results

Once again, when the task with the truncated STDERR "finished" in Slot 4, the next task started up in Slot 16, and when that task eventually finished, the next one reverted back to Slot 4. Quite curious!
ID: 1464659 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.