1)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702888)
Posted 18 Jul 2015 by Rom Walton (BOINC) Post: 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Security Essentials is MsMpEng.exe. |
2)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702886)
Posted 18 Jul 2015 by Rom Walton (BOINC) Post: Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. |
3)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702353)
Posted 17 Jul 2015 by Rom Walton (BOINC) Post: And also that in the Milkyway case, it's the OpenCL component of the NVidia driver/runtime suite which is active. True, but I suspect that the OpenCL compiler just converts OpenCL code into CUDA instructions. |
4)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702352)
Posted 17 Jul 2015 by Rom Walton (BOINC) Post: If I could just interject one thing (without understanding much other than CUDA in that post), it would be that truncated Stderr is not unique to the NVIDIA GPUs. It happens on ATI cards and CPUs as well. Okay, that blows that theory out of the water. |
5)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702345)
Posted 16 Jul 2015 by Rom Walton (BOINC) Post: I don't see the commit mode change actually fixing the problem. The writes have been delayed somewhere by something for some reason. I have a hypothesis on this, but I don't have a way to prove or disprove it yet. Suppose that when an app calls cuInit() to initialize the CUDA/OpenCL library it passes the current stderr/stdout handles to the CUDA kernel code so that fatal compiler errors can be trapped/written to a file for the calling app. During this process they duplicate and internalize the handle thereby causing it to increase its ref count. Normally the CUDA library assumes it can clean things up during the dllmain unload event, but because boinc_exit() calls TerminateProcess() the event is never fired. The kernel decrements the ref count of the handle, after TerminateProcess() is called and the process is cleaned up, but doesn't close it down because its ref count is still greater than 1. It isn't until the CUDA kernel driver has attempted to do something that it discovers that a handle it holds is no longer valid and cleans things up on its end thereby releasing the write lock on stderr.txt. The CUDA library doesn't really provide a clean-up routine you are supposed to call after you are done, so there isn't a way to test this. We would need to talk to somebody at Nvidia to find out what underlying assumption the CUDA library is making with regards to cleaning up on application shutdown to know what is really going on. |
6)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702334)
Posted 16 Jul 2015 by Rom Walton (BOINC) Post: [Edit:] for the important technical points you make, and questions you raise, the core issues relate to that multithreaded C-Runtimes became standard circa 2005 in the case of Windows, so it requires a mindset shift from sequential/procedural to parallel and out-of order operation. That's proven over time to be a lot tougher than most I know expected, and for me too. Non-deterministic behaviour is a pretty big red flag for this kindof thing too. Back even further than that. 1992 (Windows NT 3.1 October Beta) is when I had to hunker down and learn the basics of processes, threads, and thread sync mechanisms. Prior experience to that was just Windows 3.1 (16-bit preemptive tasking). IIRC, the Microsoft CRT hadn't even been developed yet. It would be a year or two later, when vendors didn't jump on the NT bandwagon fast enough complaining about difficulties in porting their software to Windows NT. Anyways, the difficulties are the primary reason why BOINC is not already multi-threaded. At this point it would be more trouble than it is worth. BOINC itself doesn't use much CPU time and, for the most part, isn't time sensitive in that it doesn't require millisecond response times. So going multi-threaded just adds complexity and debugging headaches. More so for platforms other than Windows. |
7)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702287)
Posted 16 Jul 2015 by Rom Walton (BOINC) Post: ... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.) Funny, we have been through 10 or so code reviews and security audits in the last ten years by various companies. IBM (in-house, we have to go through a full audit every time they want a new branded client), Intel (through a third party), a bank or two, an oil company, and at least one hospital. Your gripes appear to be more about aesthetics than how solid/stable something is. Don't confuse the two. |
8)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702054)
Posted 16 Jul 2015 by Rom Walton (BOINC) Post: [Edit:] Now that I know the reasoning, and that not much of it applies for specific applications at all, optional exit handling strategies & thread management become a lot easier to work in. Honestly, the majority of scientific applications are single threaded applications anyway. Most scientists are not computer scientists, so multi-threaded applications are a rarity. Things used to be pretty simple and straight forward. Windows Vista changed all that with the 5 second shutdown requirement. Now we have to be able to have BOINC and all of its child processes shutdown within 5 seconds, when the bulk of the child applications are themselves single threaded. It causes quite a few contourtions. The biggest stumbling block I found related to threading was the fact that the pthread style threading libraries didn't have an equivalent of EnumThreads(). Yet the threading library used by Linux, Mac OS X, BSD, and Android is pthread. You can make the Windows threading API conform with pthread much easier than the reverse. |
9)
Message boards :
Number crunching :
Stderr Truncations
(Message 1702016)
Posted 16 Jul 2015 by Rom Walton (BOINC) Post: Only problem there, is they decided (for unknown reasons) to use an asynchronous TerminateProcess() call and put a 1 second sleep and hard crash, instead of waiting on a synchronisation primitive. Normal methods for exiting a process involve a risk of the process not shutting down at all. Normally ExitProcess() (of which exit() and _exit() end up calling) only kill the calling thread (or at least the last time I tested this out in 2006/2007). Even the ExitProcess docs state:
Since an application can start and use any number of threads, boinc_exit() cannot assume to know the state of any thread outside the thread that called boinc_exit(). Hence, TerminateProcess(). The one second sleep is an attempt to give the Windows thread scheduler a chance to act on all the newly terminated threads. The majority case scenario is that all threads are halted and cleaned up during TerminateProcess(). Using WaitForSingleObject() on the process handle is problematic in that its state is unknown after a TerminateProcess call. Sleep probably does a WaitForSingleObject() against the thread handle though. In any case the overall objective is just to force the thread scheduler to clean things up. DebugBreak() is for when all else fails, try to capture whatever is left in a debugger that is already attached to the running process. No idea why the entire codebase seems to be allergic to synchronisation, and likes to use magic numbers (fixed time intervals) on a non-realtime OS. While the threading model(s) on Windows are rather rich, pthread is not. So we code to the lowest common denominator. BOINC at its core is a single threaded application. |
10)
Message boards :
Number crunching :
Bad News on BOINC funding
(Message 1700064)
Posted 10 Jul 2015 by Rom Walton (BOINC) Post: You'll just have to email the PMC email list, which errm no one knows what it is. ;-) When in doubt, send email to boinc_dev@ssl.berkeley.edu. Overall things are not in that bad of shape. A new grant request has been submitted to the NSF based on feedback from various sources. We don't expect to hear anything back from NSF until early next year. Until then, my deal with IBM/WCG gives me some room to continue some of my BOINC related work. I suspect most things will continue on auto-pilot for awhile as far as general maintenance goes. |
11)
Questions and Answers :
GPU applications :
Boinc V7.0.64 Bug?
(Message 1390977)
Posted 15 Jul 2013 by Rom Walton (BOINC) Post: This might be a bit simplistic, but it seems to me that if BOINC 7 recognises an RDP connection, it would work correctly if it remembered the state of the GPU at the connection time and resets it when the connection is broken. Well, if I remember the situation correctly. Before BOINC learned about how to deal with remote connections jobs were just randomly failing. It turns how the identifier a science app uses to communicate with the GPU is invalidated once an RDP connection is established. The app would usually crash at the end of its run. That is why we opted to just shutdown GPU apps when a remote session was established, there was no point continuing to run an app that eventually was going to crash and invalidate the work accomplished so far. I'm actually quite surprised that you have a configuration that continues to work after an RDP connection has been established and released. ----- Rom |
12)
Questions and Answers :
GPU applications :
Boinc V7.0.64 Bug?
(Message 1390968)
Posted 15 Jul 2013 by Rom Walton (BOINC) Post: I see. That would be a problem. Is this a remote thing or just the switch user option? Can you not tell the difference between a switch user and a remote connection or are they exactly the same thing? If they're not the same thing, can you tell when the original user switches back? According to the APIs I've found so far a switch user and remote connection are the same thing. I haven't found a way to discern the two conditions programmatically yet. |
13)
Questions and Answers :
GPU applications :
Boinc V7.0.64 Bug?
(Message 1389614)
Posted 10 Jul 2013 by Rom Walton (BOINC) Post: FYI, I have reverted to BOINC V6 and it does handle RDP in a well behaved way. It seems you already had stated some things related to the questions I had. I'm going to have to think about this a bit. The basic problem is BOINC is detecting a condition (RDP is in use but disconnected) which does not appear to be a problem for you, but causes science applications to crash on home computers when they use Fast User Switching. Fast User Switching/Remote Desktop are different names for the same technology. |
14)
Message boards :
Number crunching :
BOINC and Domain Controller
(Message 967662)
Posted 2 Feb 2010 by Rom Walton (BOINC) Post: or novell network shares If I remember my networking history correctly NetBIOS over TCP/IP is really just a hack. In the begining Windows used NetBEUI and Browse Masters/Domain Controllers (pre-Active Directory) basically provided a mechinism for replicating computer names across logical ethernet segments. NetBEUI networks were also drop dead easy. The only requirement was a unique computer name. NetBEUI wasn't a routable protocol. NetBEUI was also a very chatty protocol, I remember one installation where we had a thick net backbone and 100 nodes, network utilization at night (machines idle) was something like 15%. Thick net was a 10 MBit network. IPX/SPX was routable, and primarily used in Novell Netware environments. Basically both Novell and Microsoft saw the writing on the wall with TCP/IP becoming the standard and changed directions. Microsoft created WINS as a way to migrate name resolution of computer names from a NetBIOS/NETBEUI centric environment to the longer term DNS name resolution scheme. I haven't tried lately, but I believe in the Active Directory/DNS world, you can do away with WINS. Both the UNC spec and the SMB/CIFS spec support DNS name resolution. The computer browser lists are handled via UDP I believe. |
15)
Questions and Answers :
Windows :
Mass Uninstall Help
(Message 937182)
Posted 1 Oct 2009 by Rom Walton (BOINC) Post: Terminix, could you email me at rwalton @ ssl DOT berkeley DOT edu? Depending on how everything was installed it might be as easy as deprecating the application within AD, then Windows will uninstall the app during the next Group Policy update. Thanks in advance. |
16)
Questions and Answers :
Preferences :
Screensaver does not shut down
(Message 885713)
Posted 16 Apr 2009 by Rom Walton (BOINC) Post: Could you try 6.6.23, it contained another screensaver fix for machines that have multiple video chipsets embedded within them which was causing the screensaver code to get confused. |
17)
Questions and Answers :
Windows :
Bonic Never Suspends - I have activity set to run based on prefs
(Message 794255)
Posted 7 Aug 2008 by Rom Walton (BOINC) Post: I believe I have figured out what the problem is and have posted a new build for those who want to give it a go. Windows x86 Windows x64 Please post a message here if this fixes your keyboard and mouse activity detection issues. Thanks in advance. |
18)
Message boards :
Number crunching :
Please, STOP wasting our CPU time!
(Message 506900)
Posted 22 Jan 2007 by Rom Walton (BOINC) Post:
Hmmmmmmm, so I guess getting that new multi-bean receiver, which is able to see further into space, was wasted effort. Or getting seti_enhanced onine so the new data could be processed was wasted effort. Shesh. Conspiracies theories are everywhere. [rant mode on] How can any advancement be made without somebody crying foul, they are taking away my credits, or they are taking too much time to process a workunit? Any before anybody jumps down my throat about the credits crack, think about it this way. The #1 complaint people had when the enhanced app came online was that it took so long. So to cut down on the amount of time it took to crunch, Eric and crew needed to optimize it. That in turn cut down on the the processing time as well as the overall advantage the 3rd party optimized apps had over the stock app in terms of credit, even though the overall effiency of the whole system was better for it. Each time an advance is made, somebody has to claim foul play, because something doesn't work like it used too. You ask for something better, and then when they deliver, you turn around and slap them down for it later. Damned if they do, damned if they don't. [rant mode off] |
19)
Message boards :
Number crunching :
General Prefs. & Crunch3r BOINC 5.9.0.32
(Message 506865)
Posted 22 Jan 2007 by Rom Walton (BOINC) Post: I need to point out for people that the 5.9 client isn't anywhere near stable. It is still a few months from even begining the testing process. We won't actually start to increment the version of the 5.9.x clints until we think we are done with the feature set and start the stabilization process. So if you are using 5.9 right now, you are using a very unpolished build. Expect that a few things don't work or are only half implemented. |
20)
Message boards :
Number crunching :
64 bit BOINC client & BOINC MANAGER for Windows XP64
(Message 465428)
Posted 25 Nov 2006 by Rom Walton (BOINC) Post:
Crunch3r et al, For those using VS 2005 you'll need to add some additional code to the S@H app which can be found here: http://boinc.berkeley.edu/app_debug_win.php#Common0xc000000d Microsoft redesigned a bunch of stuff to reduce buffer overflows and the like, one of the things they did was enforce parameter validation for the CRT functions used in writing portable applications. So when you see a 0xc000000d error code with an application that was compiled using VS2005, that is part of the problem. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.