Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1482802 - Posted: 28 Feb 2014, 15:51:26 UTC

I got a truncated stderr and invalid for task 3408715505. Lost all of 0.55 credits.

That's my GTX 470 host 4292666, running BOINC as a service under Windows XP.
ID: 1482802 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1482821 - Posted: 28 Feb 2014, 16:48:34 UTC

I got a truncated stderr and invalid for task 3408715505. Lost all of 0.55 credits.

That's my GTX 470 host 4292666, running BOINC as a service under Windows XP.

I also got one - Task 3411221085. I lost even less credits!

The workunit has validated so it won't be there for long. XP 32 bit, GTX 670. I've had several of these since the thread started.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1482821 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1482857 - Posted: 28 Feb 2014, 18:23:43 UTC

I've had 5 more since my last post 3 weeks ago, spread across 3 different machines. A couple of them have been quickies, but the 3 on my top cruncher have run for 20+ minutes before each overflow, like 3406082374, which garnered 51.01 credits for my wingmen.

I find it really irritating that what appeared to be a simple fix still hasn't been implemented, but fixing production bugs just doesn't seem to be a priority for our project admins. Oh, well.
ID: 1482857 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1483041 - Posted: 1 Mar 2014, 2:28:02 UTC - in response to Message 1482857.  

I've had 5 more since my last post 3 weeks ago, spread across 3 different machines. A couple of them have been quickies, but the 3 on my top cruncher have run for 20+ minutes before each overflow, like 3406082374, which garnered 51.01 credits for my wingmen.

I find it really irritating that what appeared to be a simple fix still hasn't been implemented, but fixing production bugs just doesn't seem to be a priority for our project admins. Oh, well.


Aside from Joe's described possible improvements to the validation logic,
from my end, the 'Workaround' I use, which serves OK for CUDA MB purposes apparently and perhaps helps in some of Raistmer's cases too, is just that. A workaround as opposed to a generalised fix that would apply to every project/application/OS.

Unfortunately the 'root cause', is somewhat more insidious and related to some design limitations in boincapi itself (linked into every Boinc enabled science application), whether reliant on stderr contents or not.

Designing a comprehensive patch for that, and getting it past 'the committee', is the next challenge, but at least once done 'properly', and once, should offer pulling Boincapi further toward post 2003 multithreaded operating system library awareness.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1483041 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1483065 - Posted: 1 Mar 2014, 4:28:02 UTC - in response to Message 1483041.  

Aside from Joe's described possible improvements to the validation logic,
from my end, the 'Workaround' I use, which serves OK for CUDA MB purposes apparently and perhaps helps in some of Raistmer's cases too, is just that. A workaround as opposed to a generalised fix that would apply to every project/application/OS.

Yes, it's Joe's recommended fix to the validation logic that I was referring to. It seems to be a simple fix, would resolve the "Instant Invalid" issue for any truncated Stderr regardless of processor type (NVIDIA, ATI, or CPU), and would only need to be applied on the server side, with no requirement for any action to be taken on the host side.

Of course, a fix to the validator code doesn't do anything to prevent the creation of truncated Stderr files in the first place, even though it should eliminate the primary damage they cause. From that perspective, it certainly appears from TBar's reports that your workaround would be equally effective at resolving the problem by eliminating the truncated Stderr output at the source. However, if I understand it correctly, it would only apply to NVIDIA tasks and, at least for now, requires manual installation and is limited to hosts running Lunatics. Under the circumstances, I'd have to vote for the validator fix. (As if any of us actually have a vote on such matters. ;^))
ID: 1483065 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1483070 - Posted: 1 Mar 2014, 5:03:48 UTC - in response to Message 1483065.  

Yep, I tend to *try* use a holistic approach, by preempting design issues that may cause future issues. I'd like to see both ends a lot more robust overall (for the sakes of user experience, as opposed to 'just' the valuable science results), but worked out that's going to be a fairly long haul ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1483070 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1483356 - Posted: 1 Mar 2014, 20:50:23 UTC - in response to Message 1483065.  
Last modified: 1 Mar 2014, 21:05:39 UTC

Yep, the commode trick doesn't work for ATI Hosts. See this Instant Invalid;

Workunit 1439991717
    Task 	Computer 	         Sent 	                     Time reported                             Status 	         Run time(sec) 	CPU time(sec) 	Credit 	             Application
3412630249 	5600503 	28 Feb 2014, 14:44:38 UTC 	1 Mar 2014, 19:30:50 UTC 	Completed, validation inconclusive  2,144.32 	194.31 	        pending  SETI@home v7 Anonymous platform (NVIDIA GPU)
3412630250 	6796475 	28 Feb 2014, 14:44:41 UTC 	1 Mar 2014, 17:58:14 UTC 	Completed, marked as invalid 	    1,717.16 	139.16 	         0.00 	 SETI@home v7 Anonymous platform (ATI GPU)
3415058026 	5613876 	1 Mar 2014, 20:05:13 UTC 	26 Apr 2014, 9:04:58 UTC 	In progress 	                       --- 	  --- 	         --- 	 SETI@home v7 Anonymous platform (NVIDIA GPU)


My Results:
Stderr output
<core_client_version>7.2.39</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

WingPerson:
Spike count: 27
Autocorr count: 0
Pulse count: 2
Triplet count: 0
Gaussian count: 1

A Truncated Stderr on an Overflow with a Spike count less than 30.
Oh, that one ran more than just a few seconds before being trashed by the server at first glance. Not to mention my 'Consecutive valid tasks' total was also trashed...
ID: 1483356 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1483412 - Posted: 2 Mar 2014, 0:10:02 UTC - in response to Message 1483356.  
Last modified: 2 Mar 2014, 0:15:48 UTC

A Truncated Stderr on an Overflow with a Spike count less than 30.
Oh, that one ran more than just a few seconds before being trashed by the server at first glance. Not to mention my 'Consecutive valid tasks' total was also trashed...


There's a number of issues with current boincapi itself, revolving around abrupt termination. Raistmer indicated to me that he wanted to wait for Berkeley to fix their BoincApi (fair enough IMO). I'm doing my best to get exit handling generalised for non-Cuda and non-Windows as well, enough to submit, though the (working) one in Cuda builds was submitted and turfed out of Berkeley trees 4-5 years ago, so I'm looking for a way they will accept/understand.

(Preferably some way that won't involve me flying to Berkeley with a stack of Operating System concepts textbooks... as a rescue package. )
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1483412 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1483491 - Posted: 2 Mar 2014, 5:17:12 UTC - in response to Message 1483412.  

Might as well add to the list. (I've had a few others in the past, but very, very infrequently).


Run time 4.09
CPU time 1.86

Stderr output
<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>



Run time 3.11
CPU time 1.23


Wingmate.

Stderr output
<core_client_version>7.2.39</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GTX 560, 1024 MiB, regsPerBlock 32768
computeCap 2.1, multiProcs 7
pciBusID = 1, pciSlotID = 0
clockRate = 1620 MHz
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 560 is okay
SETI@home using CUDA accelerated device GeForce GTX 560
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 2.707094
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
Cuda sync'd & freed.
Preemptively acknowledging a safe Exit. ->
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.

Flopcounter: 206310966.000000

Spike count: 0
Autocorr count: 0
Pulse count: 30
Triplet count: 0
Gaussian count: 0
Worker preemptively acknowledging an overflow exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0

</stderr_txt>
]]>
Grant
Darwin NT
ID: 1483491 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1486723 - Posted: 9 Mar 2014, 16:47:08 UTC

This happens on -9 units sometimes.
I discovered it a few weeks ago.
Its fixed in upcoming version i will host soon on my website.


With each crime and every kindness we birth our future.
ID: 1486723 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1486735 - Posted: 9 Mar 2014, 17:13:13 UTC - in response to Message 1486726.  

This happens on -9 units sometimes.
I discovered it a few weeks ago.
Its fixed in upcoming version i will host soon on my website.


New versions both for CPU, and GPU OpenCL? I've seen it on CPU tasks too.


CPU is different story.

Truncated stderr is boinc api issue.


With each crime and every kindness we birth our future.
ID: 1486735 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1486759 - Posted: 9 Mar 2014, 17:51:42 UTC - in response to Message 1486751.  

This happens on -9 units sometimes.
I discovered it a few weeks ago.
Its fixed in upcoming version i will host soon on my website.


New versions both for CPU, and GPU OpenCL? I've seen it on CPU tasks too.


CPU is different story.

Truncated stderr is boinc api issue.


So, what you're saying is that you have fixed another issue than the one we're talking about in this thread, the "Strange Invalid MB Overflow tasks with truncated Stderr outputs", and which was the one I posted about too? That issue also only happens, as is seen by all posts in this thread, with overflowed tasks.

Don't confuse this old man Mike :-)

I was getting my hopes up, and then you smacked me down again. Thanks for nothing. Hehe


I just say thats 2 different issues.
For OpenCL it should be fixed at least getting invalids.


With each crime and every kindness we birth our future.
ID: 1486759 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1487491 - Posted: 11 Mar 2014, 21:20:34 UTC

Both new opt CPU AP and OpenCL AP are going with commode.obj linkage. so, if this could help it will help for both apps with cases of truncation.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1487491 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1488927 - Posted: 14 Mar 2014, 19:04:22 UTC - in response to Message 1488888.  

If file deleted and re-created after small enough amount of time OS treats this new file as same file and doesn't change creation time at all.

It's compatibility measure to handle editors that update files via delete/re-create.

So, your observation per se doesn't show issue.

Also, if stderr is OK in slot and truncated LATER then no app changes will help with this issue, only chnage in BOINC software could help.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1488927 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.