Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1467157 - Posted: 22 Jan 2014, 4:20:11 UTC - in response to Message 1467142.  
Last modified: 22 Jan 2014, 4:24:53 UTC

I suspect it's related to this;

http://setiathome.berkeley.edu/forum_thread.php?id=73765&postid=1465130#1465130

I suppose, but v7 had been running for a month before the instant Invalids started to show up, at least on my machines. Perhaps they were still "tweaking" the validators, and finally just went one tweak too far. :-)

See post 46375 at SETI Beta and followups for some history of that final tweak. The source code change was checked in June 19. It's likely the change was implemented here during a Tuesday outage; June 25, July 2, or July 9.

Cutting off users who had changed their app_info.xml files so they were processing v7 tasks with apps unable to do the Autocorrelation search was really needed. But it would have been good if someone had realized the extent of BOINC's tendency to not include the full stderr.txt then, so more complete detection of overflow results could have been implemented.
                                                                  Joe

Ah HA! I think a light bulb just came on!

Joe, in an earlier post, you had explained (I think) how even though a task might have a truncated Stderr, it would still be validated if the Spike count equaled 30. And all of my tasks which met those conditions did indeed validate. On the other hand, most of the tasks with truncated Stderr and Spike counts of less than 30 also still validated, but some didn't, the ones that triggered this thread. According to the Raistmer post that you just referenced, his solution to the v6 apps running v7 tasks was to invalidate them "if it has no best autocorrelation result in it" and Eric quickly responded that he'd implemented that solution in Beta. But I think that what he may have actually done is invalidate any results which don't have at least one <autocorr> section rather than a <best_autocorr>. In reviewing both my Invalids and the details of the ones that TBar has posted, I've suddenly come to the realization that all of the Invalids with a Spike count of less than 30 had wingmen who had Autocorr counts of 0. Therefore the result file for the task marked Invalid would not have had any <autocorr> sections. Conversely, I think all of the tasks with truncated Stderr which were successfully validated showed an Autocorr count of at least 1 in the wingman's Stderr.

I would think this assumption should be easy enough to verify by someone who can look at (and understand - which leaves me out) that code change in the validator.

Edit: And perhaps to clarify one point, none of the overflow tasks would have the <best_autocorr> section, but they would have an <autocorr> if at least one was found.
ID: 1467157 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1467214 - Posted: 22 Jan 2014, 7:59:34 UTC - in response to Message 1467071.  

Well fingers crossed that's a sign that, if truncated result files are possible too, through the same mechanisms, that they are 'pretty rare', and being caught. Getting a well tested generalised Windows fix into seti@home's app codebases is relatively straightforward, even though a fair wait for public build(s) is likely. Getting boincapi patched for the stderr might be more problematic, since explaining the root causes of the problems can be more challenging.

Nope, explaining it is not the difficulty, getting David to pay attention and listen long enough with an open mind THAT is the challenge!
Like most people he tends to stick to his preconceptions instead of letting himself be guided by the data. [In this context, information published by MS with regard to the caching behaviour of their OS is data]
Most people don;t like to hear 'this isn;t a good idea, it doesn't work reliably enough'. It's like being told 'you have to be in first gear to start a car, it doesn't work in the third' and you reply 'but if I'm really careful it does work [and I don't have to remember to switch to first at each traffic light] no, I don't have an equivalent for people with automatic gearboxes.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1467214 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1467363 - Posted: 22 Jan 2014, 17:48:31 UTC
Last modified: 22 Jan 2014, 18:18:45 UTC

I've been testing Jason's test app for over 3 days now and it has been a 100% success. Not a single truncated Stderr output on either machine. But, I know how that goes. I've had 100% success testing the Beta ATI Mac App since July, and we still don't have a stock GPU Mac App at SETI. I guess I just keep running it until something happens...

It would be interesting if someone receiving these MB Instant Invalids with a complete Stderr were to run Jason's test app and see what happens. Since I'm not receiving that type of MB Invalid, it will have to be someone else.
CUDA Test App Download; http://jgopt.org/download.html
ID: 1467363 · Report as offensive
ralph

Send message
Joined: 19 Feb 12
Posts: 19
Credit: 31,993,767
RAC: 9
United States
Message 1467420 - Posted: 22 Jan 2014, 19:47:20 UTC
Last modified: 22 Jan 2014, 19:47:58 UTC

I have resisted posting here mainly because you all out class me in computer knowledge but I have been active with computers since the IBM PC Jr during the early 80's.

I have 2 computers: 6426852 running kubuntu 13.10 the other 7129692 running Win 8.1 which has been productive and with no invalids.

6426852 - not so much! Mid December it started acting up spiting out more invalids than I had seen before and by New Years the motherboard was dead no lights lite and would not boot. New MB, cpu and memory and being hopeful that this would cure the invalids. Didn't happen - the new parts only produced invalids at an alarming rate: it was running 8 wu at a time trashing most of them. I shut that box down and have only run a limited number of wu, just to test the waters.

Since Jan 18th there have only been three invalids. I have disconnected my video card and have a new GTX660 and a new PSU on order. ( I live up in the Rocky Mountains and Denver is 5 hour round trip - FedEx is better choice)

I just wanted you all to know that this problem lives outside of 8.1 and cuda. Also I am not the only one using linux with this problem.
ID: 1467420 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1467442 - Posted: 22 Jan 2014, 20:28:48 UTC - in response to Message 1467157.  

I suspect it's related to this;

http://setiathome.berkeley.edu/forum_thread.php?id=73765&postid=1465130#1465130

I suppose, but v7 had been running for a month before the instant Invalids started to show up, at least on my machines. Perhaps they were still "tweaking" the validators, and finally just went one tweak too far. :-)

See post 46375 at SETI Beta and followups for some history of that final tweak. The source code change was checked in June 19. It's likely the change was implemented here during a Tuesday outage; June 25, July 2, or July 9.

Cutting off users who had changed their app_info.xml files so they were processing v7 tasks with apps unable to do the Autocorrelation search was really needed. But it would have been good if someone had realized the extent of BOINC's tendency to not include the full stderr.txt then, so more complete detection of overflow results could have been implemented.
                                                                  Joe

Ah HA! I think a light bulb just came on!

Joe, in an earlier post, you had explained (I think) how even though a task might have a truncated Stderr, it would still be validated if the Spike count equaled 30. And all of my tasks which met those conditions did indeed validate. On the other hand, most of the tasks with truncated Stderr and Spike counts of less than 30 also still validated, but some didn't, the ones that triggered this thread. According to the Raistmer post that you just referenced, his solution to the v6 apps running v7 tasks was to invalidate them "if it has no best autocorrelation result in it" and Eric quickly responded that he'd implemented that solution in Beta. But I think that what he may have actually done is invalidate any results which don't have at least one <autocorr> section rather than a <best_autocorr>. In reviewing both my Invalids and the details of the ones that TBar has posted, I've suddenly come to the realization that all of the Invalids with a Spike count of less than 30 had wingmen who had Autocorr counts of 0. Therefore the result file for the task marked Invalid would not have had any <autocorr> sections. Conversely, I think all of the tasks with truncated Stderr which were successfully validated showed an Autocorr count of at least 1 in the wingman's Stderr.

I would think this assumption should be easy enough to verify by someone who can look at (and understand - which leaves me out) that code change in the validator.

Edit: And perhaps to clarify one point, none of the overflow tasks would have the <best_autocorr> section, but they would have an <autocorr> if at least one was found.

That's a good catch. Either an <autocorr> or a <best_autocorr> sets the 'found_best_autocorr' boolean true in the validation logic. Either proves that autocorr processing has been done so a real S@H v7 app was used, and further checks are not needed. The name of the boolean is perhaps confusing but not actually incorrect, the science application will have stored a best_autocorr internally on the first autocorr search, and possibly updated it many times.
                                                                 Joe
ID: 1467442 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1467491 - Posted: 22 Jan 2014, 23:19:57 UTC - in response to Message 1467442.  

That's a good catch. Either an <autocorr> or a <best_autocorr> sets the 'found_best_autocorr' boolean true in the validation logic. Either proves that autocorr processing has been done so a real S@H v7 app was used, and further checks are not needed. The name of the boolean is perhaps confusing but not actually incorrect, the science application will have stored a best_autocorr internally on the first autocorr search, and possibly updated it many times.
                                                                 Joe

Sure, that makes sense that it would have to check both. Otherwise, if it only checked <autocorr>, every task that didn't have an autocorr hit would be marked invalid, whether or not it was an overflow.

So now the question is, how can this be fixed? It seems to me that, although the change to block v6 apps from processing v7 tasks was certainly necessary to protect the science, the collateral damage from the perhaps somewhat ham-handed approach, however unintentional it was, has gone on long enough.

First of all, is this "block" still necessary at all, now that we're nearly 8 months into v7. Are there still nefarious users out there who continue to try to run v7 tasks through v6 apps after receiving nothing but Invalid results for all this time? Could that code now be rolled back?

If that's not an option, how about this approach. Currently you say that a task is immediately marked Invalid if there's no "result_overflow" on the Stderr and it fails the 30 Spike and "found_best_autocorr" tests. Would it be possible to first verify that neither Stderr for the two tasks being validated contain "result_overflow" before applying the other tests, rather than checking each separately? That way, if either one was an overflow, at least a normal attempt could be made at validation rather than immediately marking the one Invalid, the assumption in this approach being that the Stderr missing "result_overflow" might simply be truncated. I don't really see where that approach would let any "bad actors" through. Certainly if only one Stderr contains "result_overflow", it might be a false overflow from a runaway rig, but it that case it's highly unlikely to validate anyway. This could still leave a highly unusual situation where 2 tasks with truncated Stderr might be reported for the same WU, and both would still be marked Invalid, but so far nobody's mentioned that planetary alignment occurring yet (though it may very well have happened somewhere in the universe).

I'm sure there could very well be other and better ways to fix this, too, but my point is that it's high time it gets fixed....somehow!
ID: 1467491 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1467507 - Posted: 23 Jan 2014, 0:12:08 UTC

Probably not the right place to ask this but can somebody give me a "SIMPLE"
reason for this error?

finish file present too long

It comes up on my 64 bit XP machine with the two 450's every so often.

Like here;

http://setiathome.berkeley.edu/result.php?resultid=3346298071
ID: 1467507 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1467532 - Posted: 23 Jan 2014, 2:42:31 UTC - in response to Message 1467507.  
Last modified: 23 Jan 2014, 2:44:54 UTC

Probably not the right place to ask this but can somebody give me a "SIMPLE"
reason for this error?

finish file present too long

It comes up on my 64 bit XP machine with the two 450's every so often.

Like here;

http://setiathome.berkeley.edu/result.php?resultid=3346298071



Well 'SIMPLE' is of course relative, it's complex, but probably closely related to the file problems here, and Boinc's aggressive process management strategies.

In short, as simple as I can get while still being complete, Windows likes to take its time to do things in favour of desktop and foreground application performance. The Boinc client has expectations of more immediate response (10 seconds in the finished file case).

You already have set ABOVE_NORMAL priority, which should help. Since OS garbage collection is somewhat influenced by the buffered IO chain as well, you might also like to try my special commit mode test built ( http://jgopt.org/download.html )

I'd really be happy if this trial workaround (!) effectively bypasses that old 'finished file' chestnut as well, as pushing a general fix that addresses multiple long-standing issues would be easier.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1467532 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1467536 - Posted: 23 Jan 2014, 2:57:53 UTC - in response to Message 1467420.  
Last modified: 23 Jan 2014, 2:58:27 UTC

...I just wanted you all to know that this problem lives outside of 8.1 and cuda. Also I am not the only one using linux with this problem.


I expect that this will rise. That's because a lot of the user experience optimisations need to be embraced for OSes that support mobile devices etc. AFAIK that probably includes debian/unbuntu and Android.

Whether or not it's exactly the same mechanism, or the details of the problem differ, There is a tendency in the client and api code toward 1989/90 ANSI C code, despite liberal use of C++ and streams (Later ISO standards). In a sense that means that these weird (and other) behaviours arise from using/mixing an outdated programming model, especially when dealing with multithreaded runtimes and also GPUs.

I'm not sure if the exact root causes would be the same, but similar seems likely. As I derive a generalised Windows fix for boincapi etc, I'll scout around for Linux equivalents, on the presumption that at least some distributions have moved to multithreaded runtimes and buffered IO (also seems likely with mobile taking the stage).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1467536 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1467537 - Posted: 23 Jan 2014, 2:57:59 UTC

Thank you sir!!! I guess I should have been more spacific in my simple request.
Simpleton language would probably have been better, LOL!!!


That computer spit out a bunch of valadation inclunclusive's that turned into
errors back on back on 11-28. I see I still have a few of those in cue. But
I have seen that finish file message at least 3 times lately.

If you say that program will help, I will certainly try it. Is it pretty much
self explanatory after I open it?
ID: 1467537 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1467539 - Posted: 23 Jan 2014, 3:00:58 UTC - in response to Message 1467537.  
Last modified: 23 Jan 2014, 3:02:15 UTC

If you say that program will help, I will certainly try it. Is it pretty much
self explanatory after I open it?



Should be, but sing out if needed.
- straight extract exe file,
- put in project folder,
- update app_info.xml <file_info> and <file_ref>.
- restart Boinc client

Whether or not it works around that specific issue for your host, I don't know for sure, but there is a good chance that it *might*, and it would be good to know if it does or doesn't.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1467539 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1467544 - Posted: 23 Jan 2014, 4:09:31 UTC - in response to Message 1467491.  

...
First of all, is this "block" still necessary at all, now that we're nearly 8 months into v7. Are there still nefarious users out there who continue to try to run v7 tasks through v6 apps after receiving nothing but Invalid results for all this time? Could that code now be rolled back?

I reluctantly judge that it should be retained. Some people operate on the basis that whatever they can get away with is good, and they might well go back to a v6 app if doing so would boost their RAC.

If that's not an option, how about this approach. Currently you say that a task is immediately marked Invalid if there's no "result_overflow" on the Stderr and it fails the 30 Spike and "found_best_autocorr" tests.
...

The 30 spike test is:

if (num_spikes>=30) is_overflow=true;

Simply changing that to:

if (num_signals>=30) is_overflow=true;

would be the simplest fix for the problem, that num_signals count is already there and used in some server log entries. But that hard-coded 30 is an invitation to trouble in the future, the limit ought to be parsed from the <max_signals> element of the header. Alternatively, is_overflow could be set true if there's no best_gaussian as I noted in an earlier post.

The suspicion from other threads that BOINC is sometimes truncating result files may be true. If so the fix wouldn't be totally reliable, but both stderr and the result file would have to be truncated to have the validator think an actual overflow was full processing.
                                                                  Joe
ID: 1467544 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1467553 - Posted: 23 Jan 2014, 5:01:31 UTC - in response to Message 1467544.  

...
First of all, is this "block" still necessary at all, now that we're nearly 8 months into v7. Are there still nefarious users out there who continue to try to run v7 tasks through v6 apps after receiving nothing but Invalid results for all this time? Could that code now be rolled back?

I reluctantly judge that it should be retained. Some people operate on the basis that whatever they can get away with is good, and they might well go back to a v6 app if doing so would boost their RAC.

Sadly, you're probably right about that.

If that's not an option, how about this approach. Currently you say that a task is immediately marked Invalid if there's no "result_overflow" on the Stderr and it fails the 30 Spike and "found_best_autocorr" tests.
...

The 30 spike test is:

if (num_spikes>=30) is_overflow=true;

Simply changing that to:

if (num_signals>=30) is_overflow=true;

would be the simplest fix for the problem, that num_signals count is already there and used in some server log entries. But that hard-coded 30 is an invitation to trouble in the future, the limit ought to be parsed from the <max_signals> element of the header. Alternatively, is_overflow could be set true if there's no best_gaussian as I noted in an earlier post.

That sounds like a perfectly reasonable and simple fix to me. Certainly simpler than engineering a fix from the "truncated Stderr" side of the equation (which may be a BOINC issue but which appears to impact all types of S@H tasks, not just the Cuda ones that we've primarily been focused on). But I guess actually implementing a fix is the hard part! :^)

The suspicion from other threads that BOINC is sometimes truncating result files may be true. If so the fix wouldn't be totally reliable, but both stderr and the result file would have to be truncated to have the validator think an actual overflow was full processing.
                                                                  Joe

Well, for purposes of trying to help troubleshoot the problem specific to this thread, I've been running BoincLogX for several days now (on 3 machines, at present), and as far as I know, it hasn't captured any truncated result files. On the other hand, now that I'm scanning all my Stderr output, I'm seeing about half a dozen truncated Stderrs every day, and they seem to occur under every OS and on CPU, NVIDIA, and ATI tasks. Heck, I even found one from my old P4 laptop which only manages to run one or two tasks a day on its CPU!
ID: 1467553 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1467567 - Posted: 23 Jan 2014, 5:23:46 UTC - in response to Message 1467553.  
Last modified: 23 Jan 2014, 5:28:32 UTC

The suspicion from other threads that BOINC is sometimes truncating result files may be true. If so the fix wouldn't be totally reliable, but both stderr and the result file would have to be truncated to have the validator think an actual overflow was full processing.
                                                                  Joe

Well, for purposes of trying to help troubleshoot the problem specific to this thread, I've been running BoincLogX for several days now (on 3 machines, at present), and as far as I know, it hasn't captured any truncated result files. On the other hand, now that I'm scanning all my Stderr output, I'm seeing about half a dozen truncated Stderrs every day, and they seem to occur under every OS and on CPU, NVIDIA, and ATI tasks. Heck, I even found one from my old P4 laptop which only manages to run one or two tasks a day on its CPU!


Yep, the order things happen places the chances of damaged result files at a far lower probability. Depending on the specific parameters that invoke the particular stderr symptom(s), may well not be enough. Either way app/boincapi side I'll treat the cause holistically, hoping to reduce those probabilities to at or near zero.

Why I feel these symptoms have surfaced, though probably existed to some extent for a long time (or always), is a combination of hardware performance evolution, and application improvement. For example, the act of mitigating the relatively common -12 triplet errors, allows a new mole to surface to be whacked. I'd like to reach a point of heavy app(incl boincapi) side robustification, such that we can be certain the validation logic, improved or otherwise, has less garbage to deal with. Allowing filtering of results on task pages certainly raised awareness of issues too.

In general I think that as the technology (hardware and applications) get faster, then the chances for other new weird behaviours can crop up, solely as a function of the numbers.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1467567 · Report as offensive
djmotiska

Send message
Joined: 26 Jul 01
Posts: 20
Credit: 29,378,647
RAC: 105
Finland
Message 1468106 - Posted: 24 Jan 2014, 9:59:28 UTC
Last modified: 24 Jan 2014, 10:08:36 UTC

I've experienced this same behavior for some time. I upgraded to Boinc 7.2.33 last week to get my Radeon running better. I've monitored this thread among some others and noticed I too got empty stderr's, though none of them were invalidated. However, while I was tweaking the OpenCL app I noticed one recently finished result had empty stderr and had a look at slot directories. The one the OpenCL app had used was empty, except there was stderr.txt!

So I guess this bug is somehow related to write caching etc as Jason has suspected.
ID: 1468106 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1468129 - Posted: 24 Jan 2014, 11:24:08 UTC - in response to Message 1468106.  
Last modified: 24 Jan 2014, 11:30:28 UTC

I've experienced this same behavior for some time. I upgraded to Boinc 7.2.33 last week to get my Radeon running better. I've monitored this thread among some others and noticed I too got empty stderr's, though none of them were invalidated. However, while I was tweaking the OpenCL app I noticed one recently finished result had empty stderr and had a look at slot directories. The one the OpenCL app had used was empty, except there was stderr.txt!

So I guess this bug is somehow related to write caching etc as Jason has suspected.

If you still have that file in the slot, could you see what its creation time is, and compare it with the BOINC event log timings? I would imagine we're talking seconds at most here, so you might need to open the properties dialog for the file to get the timings in seconds. I think it would be interesting, in general terms, to get an idea how much longer BOINC should wait before doing the cleanup.

PS - just called up the the properties for stderr.txt for slot 0 - which is where my cuda apps tend to run. It's saying

Created: 11 December 2013 14:02:40
Modified: 24 January 2014 11:11:03
Accessed: 24 January 2014 11:10:55

I'm going to have to think about that for a bit!

Edit - I suspended that task briefly, and let a new one start in slot 5, previously empty. That stderr has consistent dates for all three fields. I'll watch what happens when they each finish. This all with Win 7/64bit on a laptop.
ID: 1468129 · Report as offensive
djmotiska

Send message
Joined: 26 Jul 01
Posts: 20
Credit: 29,378,647
RAC: 105
Finland
Message 1468143 - Posted: 24 Jan 2014, 12:06:53 UTC - in response to Message 1468129.  

I checked the file properties and modification time matched the result, it was actually last modified 2 seconds before the task completed. The contents of the file were intact and complete. Creation time, however, was 1 day earlier, so maybe there's been similar problems in this slot directory earlier and every time a task is started in this directory the file is overwritten but not deleted.
ID: 1468143 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1468145 - Posted: 24 Jan 2014, 12:10:50 UTC - in response to Message 1468143.  

Thanks. I think we'll need Jason, or somebody with that sort of detailed internal knowledge of Windows, to tell us whether the properties dialog displays the time when the change was committed to the write cache, or the time when the magnetic domains on the hard disk platter were flipped.
ID: 1468145 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1468164 - Posted: 24 Jan 2014, 12:44:12 UTC
Last modified: 24 Jan 2014, 12:45:25 UTC

OK, I've just watched slot 0 while a task (the one I paused briefly) exited and a new one replaced it.

Within the time resolution available with Windows Explorer, there was no point where the slot directory was visibly empty - BOINC created the files (syslinks) for the new task immediately the old one finished. Which is what we would want.

The file 'stderr.txt' still has a creation date of 11 December, but all the text for the exiting task was collected and transferred to client_state.xml

(and now visible at task 3349319391)
ID: 1468164 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1468167 - Posted: 24 Jan 2014, 12:49:21 UTC - in response to Message 1468145.  
Last modified: 24 Jan 2014, 13:11:24 UTC

Thanks. I think we'll need Jason, or somebody with that sort of detailed internal knowledge of Windows, to tell us whether the properties dialog displays the time when the change was committed to the write cache, or the time when the magnetic domains on the hard disk platter were flipped.


The timestamp will be 'written' (to an intermediate buffer) on flush sometime prior to commit to disk, which in turn would be one or more driver and firmware layers above where the HDD LED turns on. Below the flush (C-Runtime) level it'll pretty much look like transactions with a bunch of ones and zeros. How many layers there depends on the drivers, whether RAID etc, firmware, and hardware caching (including on the HDD itself). It could even be a network appliance of some sort, or a network share with more other layers involved, including network, encryption services etc.

The problem is that boincapi has commit to disk turned off for the file(s), while by design expects 'old-school' commit behaviour. Buffered IO is the default when building programs on Windows (since about 2003 I think) for performance reasons, which uses multithreaded libraries and driver agents in user mode etc.

This implies Windows can commit it whenever it 'feels' like, which could be more or less immediately, or next Tuesday, possibly well after Boinc silently called TerminateProcess() on the app, effectively cancelling/coirrupting the user mode driver helper threads that marshal the write.

[Edit:] while my commit mode workaround gets around these issues, it's far from an ideal solution. The 'correct' way to do things on modern Windows would be:

- Just like never kill a process in task manager GUI unless you have to, never kill a process with TerminateProcess() unless you have to ... it's the code form of the same thing... expect fallout, it's brutal.

- Use synchronisation / mutex primitives for conbtrolling state/actions, rather than synchronising to files on the filesystem.

- Use communication mechanisms for 'asking' and 'negotiating' with OS and applications, instead of 'commanding'. Issuing imperative orders on systems stressed by your own (boinc client) doing is likely to end in tears.

You don't drive your car like this: "Car I order you to drive at 100mph then stop immediately in wet road conditions...[car points at nearest pole].. too late, self destruct..." You get the idea.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1468167 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · Next

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.