Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465125 - Posted: 16 Jan 2014, 18:35:24 UTC - in response to Message 1465112.  

Compare with a 30-spike overflow from last night: result 3336479319. That stderr should continue something like this:

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
Cuda sync'd & freed.
Preemptively acknowledging a safe Exit. ->
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.

Flopcounter: 18770723446.872066

Spike count:    30
Autocorr count: 0
Pulse count:    0
Triplet count:  0
Gaussian count: 0
Worker preemptively acknowledging an overflow exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0

Even with my task overflowing after 4.11 seconds (and running on a shared - 2-up - GPU), it had time to write all of that, and to write a complete, validateable, result file as well.

One odd thing we'll have to ask Jason about. Without knowing the code details, I'd have thought

Thread call stack limit is: 1k

was part of the initial start-up process, and

cudaAcc_free() called...

the start of the clean-up at end. Why doesn't the truncation occur between those two lines, if it's timing related?
ID: 1465125 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1465130 - Posted: 16 Jan 2014, 18:50:07 UTC

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Improving BOINC's handling of stderr.txt is not likely to happen very quickly. Perhaps the S@H Validator logic can be modified to detect overflowed tasks from the actual uploaded file rather than relying on the reported stderr.txt content. There's a best_gaussian in the result file for a complete run (though it's full of zeroes for VHAR and VLAR tasks), but an overflowed task has no best_* signals at all.
                                                                   Joe
ID: 1465130 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1465131 - Posted: 16 Jan 2014, 18:51:06 UTC - in response to Message 1465041.  

Which leaves us with the most likely explanation being marginal or failing user hardware. That can be tested (as, again, I think I've said before). If only someone experiencing a non-zero incidence of these failures would carry out the experiment...

Collect a decent batch of tasks ready to run. Set BOINC to 'Networking disabled'. Let the tasks run, and get stuck in the 'uploading' state. Copy both the WU files and the matching result files (same name, with added _n_0) from the project folder to somewhere safe. Allow the whole batch to upload and report, fetch new work, rinse and repeat.

Now, go through the whole set of reported tasks. See if you've got an example of the type of problem you're hunting down. If you have, find both the WU file and the matching result file in your 'safe place', and put both files together (but nothing else) into a standard archive format (.zip, .rar, .7z - anything you like, so long as it preserves file contents and attributes exactly). Then ask here who to send it to - people with the tag 'volunteer tester' should be equipped to take it from there, just depends who's available on the day.

I understand what you're asking, Richard, but think it might be impractical. If these Invalids were happening more frequently and predictably, say once a day on a single machine, it might be feasible to try this. But these things are completely unpredictable and might happen on any machine at any time. Perhaps if we could narrow it down somehow....

Here's what my records have shown, by host, since I first got startled by one of these Invalids with the truncated Stderr last summer:

6949656: December 2
6912878: July 31; October 16; January 4
6979886: September 17; October 1, 24; January 3
6980751: July 2, 20; August 13; September 7, 20, 25; October 18, 30; November 14, 23; December 2, 4, 14, 15, 20; January 6
7057115: August 23; September 11; October 28; December 14, 19; January 2

I only have the actual truncated Stderr files for these tasks in my archives back to September and only started recording the counts that the wingmen were getting in late November, when I thought I was starting to see a pattern.

Obviously not very frequent, and not very predictable. Also, I think, not a sign of failing or marginal hardware.
ID: 1465131 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1465136 - Posted: 16 Jan 2014, 19:01:18 UTC - in response to Message 1465125.  

Why doesn't the truncation occur between those two lines, if it's timing related?

In some of them it does seem to occur there, but not all. If you look at the vaious examples I've posted so far, I think you'll find at least 3 different truncation points, one of which I guess could be called Line 0, where there's nothing at all between the <stderr_txt> tags.
ID: 1465136 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1465138 - Posted: 16 Jan 2014, 19:11:11 UTC - in response to Message 1465130.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid.
ID: 1465138 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465141 - Posted: 16 Jan 2014, 19:16:12 UTC - in response to Message 1465130.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Improving BOINC's handling of stderr.txt is not likely to happen very quickly. Perhaps the S@H Validator logic can be modified to detect overflowed tasks from the actual uploaded file rather than relying on the reported stderr.txt content. There's a best_gaussian in the result file for a complete run (though it's full of zeroes for VHAR and VLAR tasks), but an overflowed task has no best_* signals at all.
                                                                   Joe

Thanks Joe. I was aware that stderr.txt was 'read' by the validator, and parsed for the presence of "result_overflow" - but I only knew of two uses for that data.

1) to set the 'outlier' flag, so the task runtime isn't used to update the application details averages
2) to compile the current 'overflow rate' on the science status page. Does 7.0% feel high to anyone?

Neither of those two affects the validation outcome of the individual task, so I simplified them out of my reply. If it's used as an internal cross-check against the uploaded result file, to catch out v6 apps, I'll have to re-write my internal script for the next time I have to post about it...
ID: 1465141 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465142 - Posted: 16 Jan 2014, 19:20:11 UTC - in response to Message 1465138.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid.

Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it.
ID: 1465142 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1465148 - Posted: 16 Jan 2014, 19:41:06 UTC - in response to Message 1465142.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid.

Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it.

This task is Completed and Pending. Shouldn't someone be able to access the Result File.

I highly expect this pending task to fail as soon as the wingman reports;
Workunit 1402612938

Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce 8800 GT, 511 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 14
clockRate = 1620 MHz
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce 8800 GT is okay
SETI@home using CUDA accelerated device GeForce 8800 GT
mbcuda.cfg, processpriority key detected
mbcuda.cfg, Global pfblockspersm key being used for this device
pulsefind: blocks per SM 6
mbcuda.cfg, Global pfperiodsperlaunch key being used for this device
pulsefind: periods per launch 512
Priority of process set to NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 2.30

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.442866
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes

</stderr_txt>
]]>
ID: 1465148 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465163 - Posted: 16 Jan 2014, 20:03:55 UTC - in response to Message 1465148.  

This task is Completed and Pending. Shouldn't someone be able to access the Result File.

No. In this state, we can access the data file sent out at the beginning of the task, so we can re-run the job on a known good machine, or on different hardware, or with a different application - lots of ways to generate other result files to compare with the one on the medical examiner's table.

But result files, once uploaded, aren't accessible to anyone outside the lab. That has to be captured from the wild.
ID: 1465163 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1465167 - Posted: 16 Jan 2014, 20:10:10 UTC - in response to Message 1465163.  

But result files, once uploaded, aren't accessible to anyone outside the lab.

That's the someone I had in mind. I'm sure someone in the lab has heard of this thread by now.
ID: 1465167 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1465171 - Posted: 16 Jan 2014, 20:29:22 UTC - in response to Message 1465138.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid.

I'm sure the logic I described is what's in the sah_validate.cpp in the project's repository, and I think it accounts for the cases where a good overflowed result is being marked invalid.

But it's certainly a puzzle why some cases do get validated. Possibilities include the "Task details" pages not actually showing all the received stderr, or the project running some validators with different code. Neither of those seem very likely.
                                                                  Joe
ID: 1465171 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465201 - Posted: 16 Jan 2014, 21:36:00 UTC - in response to Message 1465171.  

For S@H v7, the stderr.txt DOES play a role in validation. The validator checks whether "result_overflow" is in stderr. If it isn't there, the task is assumed to have run full processing and will have a best_autocorr signal if processed by a v7 application.

Truncation of the stderr means the "result_overflow" is not there. But for an actual overflow condition, there is no best_autocorr in the result file and the Validator concludes that the result was produced by a S@H v6 application and immediately marks it as invalid on that basis.

Are you sure about that? I ask because I think most of mine with the truncated Stderr do, in fact, get validated even without result_overflow appearing in the Stderr. For instance, see my examples in Message 1464457 and Message 1464659, both of which were successfully validated. It's only the occasional one which gets marked Invalid.

I'm sure the logic I described is what's in the sah_validate.cpp in the project's repository, and I think it accounts for the cases where a good overflowed result is being marked invalid.

But it's certainly a puzzle why some cases do get validated. Possibilities include the "Task details" pages not actually showing all the received stderr, or the project running some validators with different code. Neither of those seem very likely.
                                                                  Joe

Joe,

If you have the chance, I think it might be a good idea if you could run a sanity check over that code - your C++ is a good deal better than mine.

Although my SVN checkout shows 16 copies of sah_validate.cpp scattered throughout the repository (including several in our optimised branches), the live one appears to be r1877, updated 19 June 2013 by Eric specifically to catch the v6/v7 cross-validation case.

I'm being confused by the two different structures, RESULT and SAH_RESULT

SAH_RESULT has three separate values which seem relevant:

    bool overflow;
    bool found_best_autocorr;
    bool is_overflow;

overflow appears not to be used
is_overflow is populated from the result file, but only for 30+ spikes, not for 30+ signals overall.

The RESULT structure has a field 'runtime_outlier', which is populated from stderr.txt and seems to perform a similar function to 'is_overflow': the same test sets and adds a 'RESULT_FLAG_OVERFLOW' to the RESULT.opaque field, but I can't find where, whether or how that's used.

And there I'm stuck. Over to you?
ID: 1465201 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1465204 - Posted: 16 Jan 2014, 21:41:30 UTC - in response to Message 1465142.  

Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it.

I'm probably going to step way out of my depth here, but isn't there some way to identify an event trigger that could be used to run a scheduled task each time a S@H GPU task terminates, that could simply make a copy of each result file before it gets uploaded? My knowledge and use of scheduled tasks is currently limited to just a few run by TOD triggers, but it seems to me that there are a multitude of event triggers available to hook, as well.
ID: 1465204 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465213 - Posted: 16 Jan 2014, 21:51:30 UTC - in response to Message 1465204.  

Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it.

I'm probably going to step way out of my depth here, but isn't there some way to identify an event trigger that could be used to run a scheduled task each time a S@H GPU task terminates, that could simply make a copy of each result file before it gets uploaded? My knowledge and use of scheduled tasks is currently limited to just a few run by TOD triggers, but it seems to me that there are a multitude of event triggers available to hook, as well.

BoincLogX can do exactly that.

Unfortunately, my copy stopped working when I upgraded the machine it was running on from Vista to Windows 7: I suspect a permissions or UAC problem, but I never used it enough to make it worth while finding out exactly what the problem was, and fixing it.

One thing I did find while I was running it under Vista was that you have to set a very aggressive capture interval to capture the completed file in the short interval between the last/best signals being written, and the file being deleted when uploading is complete. You have maybe 5 seconds maximum. And at that sort of capture interval, BoincLogX itself uses quite a lot of CPU time - possibly aggravating the problem we're trying to solve. But it might be worth a look again.
ID: 1465213 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1465218 - Posted: 16 Jan 2014, 21:59:41 UTC - in response to Message 1465213.  

Which, unfortunately, reinforces my point about needing to see the contents of the actual result files to get to the bottom of this. Pity, but I don't see how else we can do it.

I'm probably going to step way out of my depth here, but isn't there some way to identify an event trigger that could be used to run a scheduled task each time a S@H GPU task terminates, that could simply make a copy of each result file before it gets uploaded? My knowledge and use of scheduled tasks is currently limited to just a few run by TOD triggers, but it seems to me that there are a multitude of event triggers available to hook, as well.

BoincLogX can do exactly that.

Unfortunately, my copy stopped working when I upgraded the machine it was running on from Vista to Windows 7: I suspect a permissions or UAC problem, but I never used it enough to make it worth while finding out exactly what the problem was, and fixing it.

One thing I did find while I was running it under Vista was that you have to set a very aggressive capture interval to capture the completed file in the short interval between the last/best signals being written, and the file being deleted when uploading is complete. You have maybe 5 seconds maximum. And at that sort of capture interval, BoincLogX itself uses quite a lot of CPU time - possibly aggravating the problem we're trying to solve. But it might be worth a look again.

Okay, thanks Richard. I'll take a look at it. Seems to me I tried it briefly way back last spring, shortly after I'd rejoined the project, but decided it didn't do much for me at the time. Perhaps in this specific situation, it might.
ID: 1465218 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1465219 - Posted: 16 Jan 2014, 22:02:07 UTC - in response to Message 1464280.  

- wait for v7.2.38

Hopefully that will be soon.

I'll let you know as soon as I see one available for testing.

v7.2.38 appeared in the download directory about 20 minutes ago. I've installed it (64-bit version) on just one host so far, and it's running without having trashed anything yet: but anyone who follows the links below is running very much at your own risk. This version is untested

http://boinc.berkeley.edu/dl/boinc_7.2.38_windows_x86_64.exe
http://boinc.berkeley.edu/dl/boinc_7.2.38_windows_intelx86.exe
ID: 1465219 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1465226 - Posted: 16 Jan 2014, 22:33:57 UTC

Here's a couple new short overflows from the 7.0.64 host. Both Validated, one had a normal Stderr the other was truncated.
http://setiathome.berkeley.edu/result.php?resultid=3337831865
http://setiathome.berkeley.edu/result.php?resultid=3337831894
Both had spikes of 30.
ID: 1465226 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1465237 - Posted: 16 Jan 2014, 23:20:52 UTC - in response to Message 1465125.  
Last modified: 16 Jan 2014, 23:57:38 UTC

... One odd thing we'll have to ask Jason about. Without knowing the code details, I'd have thought

Thread call stack limit is: 1k

was part of the initial start-up process, and

cudaAcc_free() called...

the start of the clean-up at end. Why doesn't the truncation occur between those two lines, if it's timing related?


'Timing related' would be a tenuous description on a non-realtime multithreaded OS using buffered IO. Logically if the issue never appears between those lines, it could be simply that the application hasn't exited yet, or 'whatever' triggers the event hasn't occured yet.

My suspicion is the triggering event appears after the boinc finished file is created and detected by the client, so after the files have been written and flushed, but prior to Windows IO completion and garbage collection. There are numerous reasons that cleanup can be delayed for a very long time, just one being buffered IO in flight while running under high system contention (such as a machine in full overcommit, heavy hard drive use etc), another being desktop responsiveness optimisations unique to Windows.

For some mystical boinc logic reasons in current code, after the client detects the exited task, subsequently going into handle_exited_app() (or whatever) it runs a TerminateProcess() on the app, as though it doesn't believe the app actually exited. Using that is bad juju, a last resort if every option fails. In a sense, that skepticism is well placed, because applications go through a whole cycle of events well after leaving application control, these events being C-runtime, system and OS determined.

I have a way to test one possible mechanism, somewhat related to Matt's forced commits, but more complete. Produce an x41zc build with commode.obj linked in, which disables buffered IO. If the affected systems can try that and no longer manifest the same symptoms, then we know Boinc's logic is fighting against normal OS functionality. Can have different symptoms too, due to other possible problem areas, but any behaviour change is an indicator of messing in the right area.

From my perspective, one sign for the bone pointing scenario is that I've never seen truncated stderr under standalone bench, and indeed Claggy posted some affected task bench reprocesses a while back. That points to things being some function of running under an active Boinc client, as opposed to an application side issue.

Disabling buffered IO, effectively forcing commit after ever single file fflush() in the app, is of course non-ideal and not desirable, but would be sufficient to point the bone straight at Boinc's app termination sequence (once again), or put it off the radar (iff identical symptoms reappear). With or without that particular smoking gun, of a crippled test app not replicating the fault(s), there are bizarre logic things going on there. These can sometimes be signs that the author doesn't know how the OS works, and expects it to work in certain ways, perhaps unrealistically.

Another hot day today, but while the wrangling over looking at (non)intact result files takes place and things cool down at night I can look at that. I would need two affected systems to run the suboptimal modified build for as long as it takes to either re-encounter the same set of symptoms, different ones, or no reappearance at all until some level of confidence is achieved that buffered IO on termination is related.

Volunteers, please step forward with links to affected hosts and name your Cuda version.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465237 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1465275 - Posted: 17 Jan 2014, 2:25:24 UTC - in response to Message 1465237.  
Last modified: 17 Jan 2014, 2:36:35 UTC

I have two affected machines to test with the x41zc build. Up to this point it has been the XP host with the most affected short Overflows. For some reason, the Windows 8.1 host hasn't been receiving any of the short overflows which seem to be the ones causing the most problems. The only 'Immediate' Invalid the Win8 machine has had wasn't a short overflow either, and it is still pending - Workunit 1398658616. Both machines can use CUDA 2.3 or 3.2, they are using 2.3 at present.

Since downgrading the XP host to 7.0.64 from 7.2.33 all tasks have validated. I was thinking of Updating the Win8 host to 7.2.38 a little later tonight. I could use any version necessary for the test.
XP - http://setiathome.berkeley.edu/show_host_detail.php?hostid=6979629
Win 8.1 - http://setiathome.berkeley.edu/show_host_detail.php?hostid=6796475

I spent some more time looking at a couple of the top Linux CUDA host results. So far, I haven't found any truncated Stderr outputs.

The second host just reported on the Win8 task, so, I guess it will be gone tomorrow night. It would have been nice to see the result file on that task that was labeled Invalid as soon as the first Wingman reported.
ID: 1465275 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1465278 - Posted: 17 Jan 2014, 2:31:09 UTC - in response to Message 1465275.  
Last modified: 17 Jan 2014, 2:31:37 UTC

Thanks, looks like along the right lines to me so far. Those lines are operating system IO mechanism 'misunderstandngs'. I'll take it that a Cuda 2.3 build would serve your needs in both cases, though if you are running 'stock', the anon platform variable would need to be eliminated by running 'optimised' for some period. The app is identical, but removing variables early seems prudent.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465278 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.