Message boards :
Number crunching :
Panic Mode On (98) Server Problems?
Message board moderation
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 30 · Next
Author | Message |
---|---|
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I'm in the same boat ... with 350.12 I suddenly stopped downloading AP GPU tasks. So I reverted to 347.88 and reinstalled Lunatics (AP worked fine then), now it's hard to find a single std_err for ANY MB GPU task that is not empty. DAMN I was thinking the same, did a Windows update change my HD cache settings some where along the line? Is it the driver? although many other people use it! Maybe I should move my BOINC folder to my SSD. IDK, I'm confuzzled as to why shit happens. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Interesting to see some of your SETI MB get hit by the blank stderr.txt bug. I've only experienced it on MilkyWay tasks that complete in under 50 seconds. I thought the short runtimes of those task might have some bearing on the problem. Interesting to see the problem on a standard MB GPU task that completes in the normal 20 minute range. You would have to run the beta 7.6.2 BoincManager and set some debug flags to help Richard out with the analyzing of this problem. I'm doing that now and have caught one invalid MW blank stderr.txt task already and still looking for more in the logs. Good luck. Keith [Edit} I am running SSD's so doesn't seem to help the problem. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Interesting to see some of your SETI MB get hit by the blank stderr.txt bug. I've only experienced it on MilkyWay tasks that complete in under 50 seconds. I thought the short runtimes of those task might have some bearing on the problem. Interesting to see the problem on a standard MB GPU task that completes in the normal 20 minute range. You would have to run the beta 7.6.2 BoincManager and set some debug flags to help Richard out with the analyzing of this problem. I'm doing that now and have caught one invalid MW blank stderr.txt task already and still looking for more in the logs. You have another PM, but it's bedtime on this side of the pond. I'll look to see if you've caught another one in the morning. |
TimeLord04 Send message Joined: 9 Mar 06 Posts: 21140 Credit: 33,933,039 RAC: 23 |
Just picked up a few APs!!! :-) I hope I get more. :-) TimeLord04 Have TARDIS, will travel... Come along K-9! Join Calm Chaos |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I attribute it to BOINC 7.2.33 which seems to be the best version I've come across. Same write caching on both drives. The OS is on an old SATA drive, the Data folder is on an older PATA drive with just other data on it. I did revert back to the first driver that works with Win 8.1 a few weeks ago, seems to work better with my GTS250 than the 337.88 version I was using earlier. I wasn't having the problem with 337.88 either though. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
08/07/2015 19:14:55 | SETI@home | Finished upload of 19ja15ab.29055.20324.438086664197.12.199_1_0Look at this, i have selected to download other WUs, if AP is not available, which obviously works for the GPU but *NOT* for the CPU. This is server side related, *NOT* on my end! If I check all (CPU, Intel iGPU, ATI GPU and NV GPU (although no ATI GPU installed)) and in my app_info.xml file are AP and SETI entries for all 3 devices, my PC (CPU and GPUs) get AP and SETI WUs if I uncheck SETI, check AP and check 'If no work for selected applications is available, accept work from other applications'. Resource share 1000000 for SETI, 0 for Milkyway. It looks like your PC got now AP WUs for the CPU... Do you changed something in app_info.xml file, there is still a fine SETI entry for CPU? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I attribute it to BOINC 7.2.33 which seems to be the best version I've come across. Nevermind. Seems now that I'm looking I'm finding a few truncated stderr.txt. Time to go back to the Commode build. No problem, the .exe is still in my setiathome.berkeley.edu folder, I'll just change the names in my app_info file back before I get any of those Instant Invalids. Oh, well... |
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
(...) It seems it sorted out. The only thing i did was, to select all apps and submit the changes. Then i waited a few minutes and selected AP7 only and other apps yes, if AP7 is not available. It is like sometimes in Windows: An option is selected, but doesn't work. So you unselect it and hit "ok". Then you select the option again and hit "ok" and "magically" now the option works... Aloha, Uli |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I gave it some more thought and realized sometime between using the Commode version and the regular version I had moved the BOINC Data folder to a 2nd hard drive. Maybe having the Data folder on a different HD than the OS makes a difference? I dunno, just trying to guess why I don't seem to be having the problem anymore... Heh, I thought "truncation immunity by HDD" just seemed too easy. ;^) At least Jason's commode build appears to eliminate them for the NVIDIA GPUs, but that still doesn't help with CPU tasks or the ATI GPUs. We still need the validator fix to actually eliminate the Invalid problem. I haven't had any Invalids so far this month, but did pick up 3 truncations overnight: 4250582637, 4250582741, 4250665618. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I gave it some more thought and realized sometime between using the Commode version and the regular version I had moved the BOINC Data folder to a 2nd hard drive. Maybe having the Data folder on a different HD than the OS makes a difference? I dunno, just trying to guess why I don't seem to be having the problem anymore... Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? Race condition is of course the correct technical term for "Boincapi doesn't use threadsafe process termination practices, which therefore compromises any use of standard multithreaded runtime libraries in applications, which have been mandatory in Windows since visual studio 2005, and an effective method of user experience optimisation not unique to Windows [C-runtime] libraries [but in widespread adoption anywhere non-blocking IO is desirable]" "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? Hmmm...that's very interesting. Perhaps we could get some of that same log info over here at S@h. Is the <slot_debug> option only available starting with BOINC v7.6.2? (I've been clinging to 7.2.33.) Also, we've noticed that truncation appears primarily in two flavors. One is the completely empty Stderr, such as the one TBar posted, what Keith is getting on MW, and what I usually see on my daily driver (and also used to see on my Win 8.1 box when it was active). The other is the truncation like the 3 that I posted this morning, after the line that reads "Thread call stack limit is: 1k", which is what usually happens on that box. On rare occasions, there have been other truncation locations spotted, too. Do you think the race condition could account for those non-empty truncations, as well? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... Do you think the race condition could account for those non-empty truncations, as well? Yes it does, bearing in mind as soon as you invoke the term 'race condition', the result is more or less chance. [Edit:] note that it's far more a software engineering problem, than a code-bug one. The code is most likely perfectly fine for single core machines using single threaded libraries, circa 2003 "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? The basic structure is there, but you have to invoke it manually. v7.4 has a nice GUI to turn things on and off, and v7.6.2 (specifically) has a little more detail which helped with the previous bug. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? Something which Keith said, but which I left out of my final report as unconfirmed and possibly a distraction, is that at Milkyway he sees most problems with the very short tasks (Modified Fit) which run for less than a minute on his machine. The longer tasks run, the greater the chance that the first section of the file has made it to disk before it's read back into the report. There's probably a difference depending whether the sequence is "open file - write a bit - close file" or "open file - add a bit - add a bit - add a bit - add a bit - close app and file together". I don't even know whether that is down to the individual programmer (in which case, it might vary between projects, or even between applications in a single project), or whether it's determined by the BOINC API. Jason should be able to answer that. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
... Do you think the race condition could account for those non-empty truncations, as well? Which is partly why I suggested the "test before proceeding" solution, rather than any form of "predict when it'll be ready". |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... I don't even know whether that is down to the individual programmer (in which case, it might vary between projects, or even between applications in a single project), or whether it's determined by the BOINC API. Jason should be able to answer that. Largely system dependant on many levels, of which there are many buffers floating between the app code and disk. Top level is the app code, which is being sumarily terminated, hopefully having completed its writes. If it did, it goes into the standard C libraries, which reside in user space in buffers. Once in those buffers it can be asked by app code or OS schedulers to commit to disk, anywhere from immediately to minutes or longer (this amounts to delayed writes, write back caching at the application level, which the commode object file disables, *nix equivalent I believe opening the file in commit mode in the first place). After that you are down into kernel & driver internals. In most cases you'd get incomplete or missing contents if the calling application was killed. The problem here is that there is no fundamental control of this underlying mechanism (other than to bypass/disable it, so the old methods of forcing things simply do not work. You ask, wait/sleep patiently, then check, and follow up. That's non-blocking multithreaded programming, which works just as well as managing real live people/workers (As opposed to killing them, burning down the factory, the instant a report is overdue, after obstructing the fire exit with a broken piano). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Do you reckon http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2015-July/021827.html gets any closer to the mark? Well, initially I'll try just adding <slot_debug> to cc_config under 7.2.33 and see if anything interesting pops. I might be upgrading that xw9400 from XP to Win 7 this weekend and, if it doesn't go as smoothly as I'd like, I might have to reinstall everything else, too, in which case perhaps I'll go ahead and try v7.6.2. Something which Keith said, but which I left out of my final report as unconfirmed and possibly a distraction, is that at Milkyway he sees most problems with the very short tasks (Modified Fit) which run for less than a minute on his machine. The longer tasks run, the greater the chance that the first section of the file has made it to disk before it's read back into the report. I don't know that we're seeing that here at S@h. True, it's only the -9 overflows that are at risk for being invalidated if they get truncated, but even those aren't necessarily quickies unless the overflow is entirely due to 30 Spikes. The speediest of the 3 truncations I reported this morning took over 11 minutes (and was probably not an overflow.) The last one I had here on my daily driver, on July 4, ran for over an hour and a half (on a GT 630). |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I should add that terminating a process cancels or completes IO writes depending on which level they are at, which is fairly arbitrary, but will tend to be in chunks. [And that Boincapi's MFILE/MIOFILE implementation expressly uses memory mapped files for buffer based performance features, so extra buffer levels and delays there... and more threads] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Richard, to add a little more information to the discussion, I have also seen the race condition show up at MilkyWay on the longer 1.02 app. Those task typically run for 2-2:30 minutes on my machines. I discovered this case at the time that SETI was down and no work was available so I was heavily processing MW tasks on each card. I had also stopped accepting work for the 1.36 troublesome short tasks in an endeavor to try and figure out what was causing the truncated stderr.txt results. In that experiment, I discovered I could also produce truncated stderr.txt for the longer 1.02 app. I surmise that this was because of the heavily stressed file system that exposed the race condition. Just some more info to chew on. Thanks for the boinc-dev report submission. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.