Message boards :
Number crunching :
Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next
Author | Message |
---|---|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Compare with a 30-spike overflow from last night: result 3336479319. That stderr should continue something like this: re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Cuda sync'd & freed. Preemptively acknowledging a safe Exit. -> SETI@Home Informational message -9 result_overflow NOTE: The number of results detected equals the storage space allocated. Flopcounter: 18770723446.872066 Spike count: 30 Autocorr count: 0 Pulse count: 0 Triplet count: 0 Gaussian count: 0 Worker preemptively acknowledging an overflow exit.-> called boinc_finish Exit Status: 0 boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Cuda threadsafe ExitProcess() initiated, rval 0 Even with my task overflowing after 4.11 seconds (and running on a shared - 2-up - GPU), it had time to write all of that, and to write a complete, validateable, result file as well. One odd thing we'll have to ask Jason about. Without knowing the code details, I'd have thought Thread call stack limit is: 1k was part of the initial start-up process, and cudaAcc_free() called... the start of the clean-up at end. Why doesn't the truncation occur between those two lines, if it's timing related? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis. Improving BOINC's handling of stderr.txt is not likely to happen very quickly. Perhaps the S@H Validator logic can be modified to detect overflowed tasks from the actual uploaded file rather than relying on the reported stderr.txt content. There's a best_gaussian in the result file for a complete run (though it's full of zeroes for VHAR and VLAR tasks), but an overflowed task has no best_* signals at all. Joe |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Which leaves us with the most likely explanation being marginal or failing user hardware. That can be tested (as, again, I think I've said before). If only someone experiencing a non-zero incidence of these failures would carry out the experiment... I understand what you're asking, Richard, but think it might be impractical. If these Invalids were happening more frequently and predictably, say once a day on a single machine, it might be feasible to try this. But these things are completely unpredictable and might happen on any machine at any time. Perhaps if we could narrow it down somehow.... Here's what my records have shown, by host, since I first got startled by one of these Invalids with the truncated Stderr last summer: 6949656: December 2 6912878: July 31; October 16; January 4 6979886: September 17; October 1, 24; January 3 6980751: July 2, 20; August 13; September 7, 20, 25; October 18, 30; November 14, 23; December 2, 4, 14, 15, 20; January 6 7057115: August 23; September 11; October 28; December 14, 19; January 2 I only have the actual truncated Stderr files for these tasks in my archives back to September and only started recording the counts that the wingmen were getting in late November, when I thought I was starting to see a pattern. Obviously not very frequent, and not very predictable. Also, I think, not a sign of failing or marginal hardware. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Why doesn't the truncation occur between those two lines, if it's timing related? In some of them it does seem to occur there, but not all. If you look at the vaious examples I've posted so far, I think you'll find at least 3 different truncation points, one of which I guess could be called Line 0, where there's nothing at all between the <stderr_txt> tags. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. Thanks Joe. I was aware that stderr.txt was 'read' by the validator, and parsed for the presence of "result_overflow" - but I only knew of two uses for that data. 1) to set the 'outlier' flag, so the task runtime isn't used to update the application details averages 2) to compile the current 'overflow rate' on the science status page. Does 7.0% feel high to anyone? Neither of those two affects the validation outcome of the individual task, so I simplified them out of my reply. If it's used as an internal cross-check against the uploaded result file, to catch out v6 apps, I'll have to re-write my internal script for the next time I have to post about it... |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. This task is Completed and Pending. Shouldn't someone be able to access the Result File. I highly expect this pending task to fail as soon as the wingman reports; |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
This task is Completed and Pending. Shouldn't someone be able to access the Result File. No. In this state, we can access the data file sent out at the beginning of the task, so we can re-run the job on a known good machine, or on different hardware, or with a different application - lots of ways to generate other result files to compare with the one on the medical examiner's table. But result files, once uploaded, aren't accessible to anyone outside the lab. That has to be captured from the wild. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
But result files, once uploaded, aren't accessible to anyone outside the lab. That's the someone I had in mind. I'm sure someone in the lab has heard of this thread by now. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. I'm sure the logic I described is what's in the sah_validate.cpp in the project's repository, and I think it accounts for the cases where a good overflowed result is being marked invalid. But it's certainly a puzzle why some cases do get validated. Possibilities include the "Task details" pages not actually showing all the received stderr, or the project running some validators with different code. Neither of those seem very likely. Joe |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application. Joe, If you have the chance, I think it might be a good idea if you could run a sanity check over that code - your C++ is a good deal better than mine. Although my SVN checkout shows 16 copies of sah_validate.cpp scattered throughout the repository (including several in our optimised branches), the live one appears to be r1877, updated 19 June 2013 by Eric specifically to catch the v6/v7 cross-validation case. I'm being confused by the two different structures, RESULT and SAH_RESULT SAH_RESULT has three separate values which seem relevant: bool overflow; bool found_best_autocorr; bool is_overflow; overflow appears not to be used is_overflow is populated from the result file, but only for 30+ spikes, not for 30+ signals overall. The RESULT structure has a field 'runtime_outlier', which is populated from stderr.txt and seems to perform a similar function to 'is_overflow': the same test sets and adds a 'RESULT_FLAG_OVERFLOW' to the RESULT.opaque field, but I can't find where, whether or how that's used. And there I'm stuck. Over to you? |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it. I'm probably going to step way out of my depth here, but isn't there some way to identify an event trigger that could be used to run a scheduled task each time a S@H GPU task terminates, that could simply make a copy of each result file before it gets uploaded? My knowledge and use of scheduled tasks is currently limited to just a few run by TOD triggers, but it seems to me that there are a multitude of event triggers available to hook, as well. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it. BoincLogX can do exactly that. Unfortunately, my copy stopped working when I upgraded the machine it was running on from Vista to Windows 7: I suspect a permissions or UAC problem, but I never used it enough to make it worth while finding out exactly what the problem was, and fixing it. One thing I did find while I was running it under Vista was that you have to set a very aggressive capture interval to capture the completed file in the short interval between the last/best signals being written, and the file being deleted when uploading is complete. You have maybe 5 seconds maximum. And at that sort of capture interval, BoincLogX itself uses quite a lot of CPU time - possibly aggravating the problem we're trying to solve. But it might be worth a look again. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it. Okay, thanks Richard. I'll take a look at it. Seems to me I tried it briefly way back last spring, shortly after I'd rejoined the project, but decided it didn't do much for me at the time. Perhaps in this specific situation, it might. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
- wait for v7.2.38 v7.2.38 appeared in the download directory about 20 minutes ago. I've installed it (64-bit version) on just one host so far, and it's running without having trashed anything yet: but anyone who follows the links below is running very much at your own risk. This version is untested http://boinc.berkeley.edu/dl/boinc_7.2.38_windows_x86_64.exe http://boinc.berkeley.edu/dl/boinc_7.2.38_windows_intelx86.exe |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Here's a couple new short overflows from the 7.0.64 host. Both Validated, one had a normal Stderr the other was truncated. http://setiathome.berkeley.edu/result.php?resultid=3337831865 http://setiathome.berkeley.edu/result.php?resultid=3337831894 Both had spikes of 30. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
... One odd thing we'll have to ask Jason about. Without knowing the code details, I'd have thought 'Timing related' would be a tenuous description on a non-realtime multithreaded OS using buffered IO. Logically if the issue never appears between those lines, it could be simply that the application hasn't exited yet, or 'whatever' triggers the event hasn't occured yet. My suspicion is the triggering event appears after the boinc finished file is created and detected by the client, so after the files have been written and flushed, but prior to Windows IO completion and garbage collection. There are numerous reasons that cleanup can be delayed for a very long time, just one being buffered IO in flight while running under high system contention (such as a machine in full overcommit, heavy hard drive use etc), another being desktop responsiveness optimisations unique to Windows. For some mystical boinc logic reasons in current code, after the client detects the exited task, subsequently going into handle_exited_app() (or whatever) it runs a TerminateProcess() on the app, as though it doesn't believe the app actually exited. Using that is bad juju, a last resort if every option fails. In a sense, that skepticism is well placed, because applications go through a whole cycle of events well after leaving application control, these events being C-runtime, system and OS determined. I have a way to test one possible mechanism, somewhat related to Matt's forced commits, but more complete. Produce an x41zc build with commode.obj linked in, which disables buffered IO. If the affected systems can try that and no longer manifest the same symptoms, then we know Boinc's logic is fighting against normal OS functionality. Can have different symptoms too, due to other possible problem areas, but any behaviour change is an indicator of messing in the right area. From my perspective, one sign for the bone pointing scenario is that I've never seen truncated stderr under standalone bench, and indeed Claggy posted some affected task bench reprocesses a while back. That points to things being some function of running under an active Boinc client, as opposed to an application side issue. Disabling buffered IO, effectively forcing commit after ever single file fflush() in the app, is of course non-ideal and not desirable, but would be sufficient to point the bone straight at Boinc's app termination sequence (once again), or put it off the radar (iff identical symptoms reappear). With or without that particular smoking gun, of a crippled test app not replicating the fault(s), there are bizarre logic things going on there. These can sometimes be signs that the author doesn't know how the OS works, and expects it to work in certain ways, perhaps unrealistically. Another hot day today, but while the wrangling over looking at (non)intact result files takes place and things cool down at night I can look at that. I would need two affected systems to run the suboptimal modified build for as long as it takes to either re-encounter the same set of symptoms, different ones, or no reappearance at all until some level of confidence is achieved that buffered IO on termination is related. Volunteers, please step forward with links to affected hosts and name your Cuda version. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
I have two affected machines to test with the x41zc build. Up to this point it has been the XP host with the most affected short Overflows. For some reason, the Windows 8.1 host hasn't been receiving any of the short overflows which seem to be the ones causing the most problems. The only 'Immediate' Invalid the Win8 machine has had wasn't a short overflow either, and it is still pending - Workunit 1398658616. Both machines can use CUDA 2.3 or 3.2, they are using 2.3 at present. Since downgrading the XP host to 7.0.64 from 7.2.33 all tasks have validated. I was thinking of Updating the Win8 host to 7.2.38 a little later tonight. I could use any version necessary for the test. XP - http://setiathome.berkeley.edu/show_host_detail.php?hostid=6979629 Win 8.1 - http://setiathome.berkeley.edu/show_host_detail.php?hostid=6796475 I spent some more time looking at a couple of the top Linux CUDA host results. So far, I haven't found any truncated Stderr outputs. The second host just reported on the Win8 task, so, I guess it will be gone tomorrow night. It would have been nice to see the result file on that task that was labeled Invalid as soon as the first Wingman reported. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Thanks, looks like along the right lines to me so far. Those lines are operating system IO mechanism 'misunderstandngs'. I'll take it that a Cuda 2.3 build would serve your needs in both cases, though if you are running 'stock', the anon platform variable would need to be eliminated by running 'optimised' for some period. The app is identical, but removing variables early seems prudent. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.