Posts by Rom Walton (BOINC)

1) Message boards : Number crunching : Stderr Truncations (Message 1702888)
Posted 18 Jul 2015 by Profile Rom Walton (BOINC)
Post:
99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.

For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^)


Security Essentials is MsMpEng.exe.
2) Message boards : Number crunching : Stderr Truncations (Message 1702886)
Posted 18 Jul 2015 by Profile Rom Walton (BOINC)
Post:
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)

To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower.

If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one!


Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest.


99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.
3) Message boards : Number crunching : Stderr Truncations (Message 1702353)
Posted 17 Jul 2015 by Profile Rom Walton (BOINC)
Post:
And also that in the Milkyway case, it's the OpenCL component of the NVidia driver/runtime suite which is active.


True, but I suspect that the OpenCL compiler just converts OpenCL code into CUDA instructions.
4) Message boards : Number crunching : Stderr Truncations (Message 1702352)
Posted 17 Jul 2015 by Profile Rom Walton (BOINC)
Post:
If I could just interject one thing (without understanding much other than CUDA in that post), it would be that truncated Stderr is not unique to the NVIDIA GPUs. It happens on ATI cards and CPUs as well.


Okay, that blows that theory out of the water.
5) Message boards : Number crunching : Stderr Truncations (Message 1702345)
Posted 16 Jul 2015 by Profile Rom Walton (BOINC)
Post:
I don't see the commit mode change actually fixing the problem. The writes have been delayed somewhere by something for some reason.


I have a hypothesis on this, but I don't have a way to prove or disprove it yet.

Suppose that when an app calls cuInit() to initialize the CUDA/OpenCL library it passes the current stderr/stdout handles to the CUDA kernel code so that fatal compiler errors can be trapped/written to a file for the calling app.

During this process they duplicate and internalize the handle thereby causing it to increase its ref count.

Normally the CUDA library assumes it can clean things up during the dllmain unload event, but because boinc_exit() calls TerminateProcess() the event is never fired.

The kernel decrements the ref count of the handle, after TerminateProcess() is called and the process is cleaned up, but doesn't close it down because its ref count is still greater than 1.

It isn't until the CUDA kernel driver has attempted to do something that it discovers that a handle it holds is no longer valid and cleans things up on its end thereby releasing the write lock on stderr.txt.

The CUDA library doesn't really provide a clean-up routine you are supposed to call after you are done, so there isn't a way to test this.

We would need to talk to somebody at Nvidia to find out what underlying assumption the CUDA library is making with regards to cleaning up on application shutdown to know what is really going on.
6) Message boards : Number crunching : Stderr Truncations (Message 1702334)
Posted 16 Jul 2015 by Profile Rom Walton (BOINC)
Post:
[Edit:] for the important technical points you make, and questions you raise, the core issues relate to that multithreaded C-Runtimes became standard circa 2005 in the case of Windows, so it requires a mindset shift from sequential/procedural to parallel and out-of order operation. That's proven over time to be a lot tougher than most I know expected, and for me too. Non-deterministic behaviour is a pretty big red flag for this kindof thing too.


Back even further than that. 1992 (Windows NT 3.1 October Beta) is when I had to hunker down and learn the basics of processes, threads, and thread sync mechanisms. Prior experience to that was just Windows 3.1 (16-bit preemptive tasking).

IIRC, the Microsoft CRT hadn't even been developed yet. It would be a year or two later, when vendors didn't jump on the NT bandwagon fast enough complaining about difficulties in porting their software to Windows NT.

Anyways, the difficulties are the primary reason why BOINC is not already multi-threaded. At this point it would be more trouble than it is worth. BOINC itself doesn't use much CPU time and, for the most part, isn't time sensitive in that it doesn't require millisecond response times. So going multi-threaded just adds complexity and debugging headaches. More so for platforms other than Windows.
7) Message boards : Number crunching : Stderr Truncations (Message 1702287)
Posted 16 Jul 2015 by Profile Rom Walton (BOINC)
Post:
... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.)


Lol, when you reach that point, It's actually a pretty unique feeling isn't it ? Kindof relief that something's done, mixed with dissapointment at the particular choice of chewing gum, bits of string, and duct tape to plug the holes.

Fingers crossed this raft never gets used on the open ocean.

[E.T calls up via Arecibo just to ask "What the heck is that thing you're driving? Impressive! Don't build spaceships though!"

Heh, my first reaction when I thought I understood why the truncations really were fixed was quite satisfying, the second reaction, after crawling back into bed, was to wonder whether the cure might simply introduce a different disease, not exactly a sleep-inducing thought.

Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)


Boinc committee next meeting, urgent matters: "Oh no! the Windows people have discovered process monitor!"

A lot becomes clearer if you go look at the scheduler authentication code. It makes this part look solid as... well, something really really solid.

[Edit:] sometimes not being trained in something can be an advantage too. It's easier to point out that it doesn't work, lol.


Funny, we have been through 10 or so code reviews and security audits in the last ten years by various companies. IBM (in-house, we have to go through a full audit every time they want a new branded client), Intel (through a third party), a bank or two, an oil company, and at least one hospital.

Your gripes appear to be more about aesthetics than how solid/stable something is. Don't confuse the two.
8) Message boards : Number crunching : Stderr Truncations (Message 1702054)
Posted 16 Jul 2015 by Profile Rom Walton (BOINC)
Post:
[Edit:] Now that I know the reasoning, and that not much of it applies for specific applications at all, optional exit handling strategies & thread management become a lot easier to work in.


Honestly, the majority of scientific applications are single threaded applications anyway. Most scientists are not computer scientists, so multi-threaded applications are a rarity.

Things used to be pretty simple and straight forward. Windows Vista changed all that with the 5 second shutdown requirement. Now we have to be able to have BOINC and all of its child processes shutdown within 5 seconds, when the bulk of the child applications are themselves single threaded. It causes quite a few contourtions.

The biggest stumbling block I found related to threading was the fact that the pthread style threading libraries didn't have an equivalent of EnumThreads(). Yet the threading library used by Linux, Mac OS X, BSD, and Android is pthread.

You can make the Windows threading API conform with pthread much easier than the reverse.
9) Message boards : Number crunching : Stderr Truncations (Message 1702016)
Posted 16 Jul 2015 by Profile Rom Walton (BOINC)
Post:
Only problem there, is they decided (for unknown reasons) to use an asynchronous TerminateProcess() call and put a 1 second sleep and hard crash, instead of waiting on a synchronisation primitive.


Normal methods for exiting a process involve a risk of the process not shutting down at all.

Normally ExitProcess() (of which exit() and _exit() end up calling) only kill the calling thread (or at least the last time I tested this out in 2006/2007).

Even the ExitProcess docs state:

If one of the terminated threads in the process holds a lock and the DLL detach code in one of the loaded DLLs attempts to acquire the same lock, then calling ExitProcess results in a deadlock. In contrast, if a process terminates by calling TerminateProcess, the DLLs that the process is attached to are not notified of the process termination. Therefore, if you do not know the state of all threads in your process, it is better to call TerminateProcess than ExitProcess. Note that returning from the main function of an application results in a call to ExitProcess.


Since an application can start and use any number of threads, boinc_exit() cannot assume to know the state of any thread outside the thread that called boinc_exit().

Hence, TerminateProcess().

The one second sleep is an attempt to give the Windows thread scheduler a chance to act on all the newly terminated threads. The majority case scenario is that all threads are halted and cleaned up during TerminateProcess(). Using WaitForSingleObject() on the process handle is problematic in that its state is unknown after a TerminateProcess call. Sleep probably does a WaitForSingleObject() against the thread handle though. In any case the overall objective is just to force the thread scheduler to clean things up.

DebugBreak() is for when all else fails, try to capture whatever is left in a debugger that is already attached to the running process.

No idea why the entire codebase seems to be allergic to synchronisation, and likes to use magic numbers (fixed time intervals) on a non-realtime OS.


While the threading model(s) on Windows are rather rich, pthread is not. So we code to the lowest common denominator. BOINC at its core is a single threaded application.
10) Message boards : Number crunching : Bad News on BOINC funding (Message 1700064)
Posted 10 Jul 2015 by Profile Rom Walton (BOINC)
Post:
You'll just have to email the PMC email list, which errm no one knows what it is. ;-)


When in doubt, send email to boinc_dev@ssl.berkeley.edu.

Overall things are not in that bad of shape. A new grant request has been submitted to the NSF based on feedback from various sources. We don't expect to hear anything back from NSF until early next year.

Until then, my deal with IBM/WCG gives me some room to continue some of my BOINC related work. I suspect most things will continue on auto-pilot for awhile as far as general maintenance goes.
11) Questions and Answers : GPU applications : Boinc V7.0.64 Bug? (Message 1390977)
Posted 15 Jul 2013 by Profile Rom Walton (BOINC)
Post:
This might be a bit simplistic, but it seems to me that if BOINC 7 recognises an RDP connection, it would work correctly if it remembered the state of the GPU at the connection time and resets it when the connection is broken.

So if the GPU was disabled it would remain disabled but if it was enabled it would be re-enabled.


Well, if I remember the situation correctly. Before BOINC learned about how to deal with remote connections jobs were just randomly failing.

It turns how the identifier a science app uses to communicate with the GPU is invalidated once an RDP connection is established. The app would usually crash at the end of its run. That is why we opted to just shutdown GPU apps when a remote session was established, there was no point continuing to run an app that eventually was going to crash and invalidate the work accomplished so far.

I'm actually quite surprised that you have a configuration that continues to work after an RDP connection has been established and released.

----- Rom
12) Questions and Answers : GPU applications : Boinc V7.0.64 Bug? (Message 1390968)
Posted 15 Jul 2013 by Profile Rom Walton (BOINC)
Post:
I see. That would be a problem. Is this a remote thing or just the switch user option? Can you not tell the difference between a switch user and a remote connection or are they exactly the same thing? If they're not the same thing, can you tell when the original user switches back?

I wonder what happens in BOINC V6 if another user RDPs to my work PC.

If I remember, I'll give it a test next time I'm physically in the office. Of course only one user at a time can RDP to a PC. If another user tries, the first user is thrown out. At least that's how it worked a few years ago. So all BOINC V6 has to do is recognise the disconnection, connection in the right order to still work properly. Or does it go to sleep and can't catch those steps. Hmm! I can see the problems.

Rick


According to the APIs I've found so far a switch user and remote connection are the same thing. I haven't found a way to discern the two conditions programmatically yet.
13) Questions and Answers : GPU applications : Boinc V7.0.64 Bug? (Message 1389614)
Posted 10 Jul 2013 by Profile Rom Walton (BOINC)
Post:
FYI, I have reverted to BOINC V6 and it does handle RDP in a well behaved way.

Below is an extract of the log showing me connecting, disconnecting 10 minutes later and then reconnecting.

If the way that BOINC V7 handles RDP is not a bug, it's certainly not an enhancement. I guess most users don't use RDP, but I have to.

Rick


-- After connecting
10-Jul-2013 19:22:39 [---] GPUs have become unusable; disabling tasks

10-Jul-2013 19:23:02 [SETI@home] update requested by user
10-Jul-2013 19:23:05 [SETI@home] Sending scheduler request: Requested by user.
10-Jul-2013 19:23:05 [SETI@home] Reporting 2 completed tasks, not requesting new tasks
10-Jul-2013 19:23:09 [SETI@home] Scheduler request completed

-- After disconnecting - Notice the AP GPU task restarting.
10-Jul-2013 19:33:14 [---] GPUs have become usable; enabling tasks
10-Jul-2013 19:33:16 [SETI@home] Restarting task ap_23fe09aa_B5_P1_00173_20130710_30000.wu_1 using astropulse_v6 version 606

-- After reconnecting
10-Jul-2013 19:34:19 [---] GPUs have become unusable; disabling tasks


It seems you already had stated some things related to the questions I had. I'm going to have to think about this a bit.

The basic problem is BOINC is detecting a condition (RDP is in use but disconnected) which does not appear to be a problem for you, but causes science applications to crash on home computers when they use Fast User Switching.

Fast User Switching/Remote Desktop are different names for the same technology.
14) Message boards : Number crunching : BOINC and Domain Controller (Message 967662)
Posted 2 Feb 2010 by Profile Rom Walton (BOINC)
Post:
or novell network shares


Just so long as you're not still running IPX/SPX?? :-)

IPX/SPX had one huge advantage not shared by NetBIOS over TCP/IP: it's drop-dead simple.

Our "modern" windows networks were born as an IBM product designed for networks around five nodes, and have been continually kluged to make them "scale" to the size of an enterprise. "Browsers" to cut down on broadcasts, "Master Browsers" to cut down on browser traffic, Domain Controllers to layer better security, WINS to map NetBIOS names to IP addresses, then the DNS kluge to replace WINS (and put internal and external resolution into the same pile) and finally Active Directory.

All of that while IPX just worked.


If I remember my networking history correctly NetBIOS over TCP/IP is really just a hack.

In the begining Windows used NetBEUI and Browse Masters/Domain Controllers (pre-Active Directory) basically provided a mechinism for replicating computer names across logical ethernet segments. NetBEUI networks were also drop dead easy. The only requirement was a unique computer name.

NetBEUI wasn't a routable protocol. NetBEUI was also a very chatty protocol, I remember one installation where we had a thick net backbone and 100 nodes, network utilization at night (machines idle) was something like 15%. Thick net was a 10 MBit network.

IPX/SPX was routable, and primarily used in Novell Netware environments.

Basically both Novell and Microsoft saw the writing on the wall with TCP/IP becoming the standard and changed directions.

Microsoft created WINS as a way to migrate name resolution of computer names from a NetBIOS/NETBEUI centric environment to the longer term DNS name resolution scheme.

I haven't tried lately, but I believe in the Active Directory/DNS world, you can do away with WINS. Both the UNC spec and the SMB/CIFS spec support DNS name resolution. The computer browser lists are handled via UDP I believe.
15) Questions and Answers : Windows : Mass Uninstall Help (Message 937182)
Posted 1 Oct 2009 by Profile Rom Walton (BOINC)
Post:
Terminix, could you email me at rwalton @ ssl DOT berkeley DOT edu?

Depending on how everything was installed it might be as easy as deprecating the application within AD, then Windows will uninstall the app during the next Group Policy update.

Thanks in advance.
16) Questions and Answers : Preferences : Screensaver does not shut down (Message 885713)
Posted 16 Apr 2009 by Profile Rom Walton (BOINC)
Post:
Could you try 6.6.23, it contained another screensaver fix for machines that have multiple video chipsets embedded within them which was causing the screensaver code to get confused.
17) Questions and Answers : Windows : Bonic Never Suspends - I have activity set to run based on prefs (Message 794255)
Posted 7 Aug 2008 by Profile Rom Walton (BOINC)
Post:
I believe I have figured out what the problem is and have posted a new build for those who want to give it a go.

Windows x86
Windows x64

Please post a message here if this fixes your keyboard and mouse activity detection issues.

Thanks in advance.
18) Message boards : Number crunching : Please, STOP wasting our CPU time! (Message 506900)
Posted 22 Jan 2007 by Profile Rom Walton (BOINC)
Post:

I must say I'm not sure that there is any actual research been done at all anymore. It's not like we see any of it or hear anything about it.

Perhaps they did finish the project when the classic SETI ended and this is just a sceem to get money. I mean, they are sending out those e-mails begging for money, like any spam e-mail company.


Hmmmmmmm, so I guess getting that new multi-bean receiver, which is able to see further into space, was wasted effort. Or getting seti_enhanced onine so the new data could be processed was wasted effort.

Shesh.

Conspiracies theories are everywhere.

[rant mode on]

How can any advancement be made without somebody crying foul, they are taking away my credits, or they are taking too much time to process a workunit?

Any before anybody jumps down my throat about the credits crack, think about it this way. The #1 complaint people had when the enhanced app came online was that it took so long. So to cut down on the amount of time it took to crunch, Eric and crew needed to optimize it. That in turn cut down on the the processing time as well as the overall advantage the 3rd party optimized apps had over the stock app in terms of credit, even though the overall effiency of the whole system was better for it.

Each time an advance is made, somebody has to claim foul play, because something doesn't work like it used too.

You ask for something better, and then when they deliver, you turn around and slap them down for it later. Damned if they do, damned if they don't.

[rant mode off]
19) Message boards : Number crunching : General Prefs. & Crunch3r BOINC 5.9.0.32 (Message 506865)
Posted 22 Jan 2007 by Profile Rom Walton (BOINC)
Post:
I need to point out for people that the 5.9 client isn't anywhere near stable.

It is still a few months from even begining the testing process.

We won't actually start to increment the version of the 5.9.x clints until we think we are done with the feature set and start the stabilization process. So if you are using 5.9 right now, you are using a very unpolished build. Expect that a few things don't work or are only half implemented.
20) Message boards : Number crunching : 64 bit BOINC client & BOINC MANAGER for Windows XP64 (Message 465428)
Posted 25 Nov 2006 by Profile Rom Walton (BOINC)
Post:

@ Rom, We have an error that is affecting this XP64 Client and app. Could You please Help on this?


Crunch3r et al,

For those using VS 2005 you'll need to add some additional code to the S@H app which can be found here:
http://boinc.berkeley.edu/app_debug_win.php#Common0xc000000d

Microsoft redesigned a bunch of stuff to reduce buffer overflows and the like, one of the things they did was enforce parameter validation for the CRT functions used in writing portable applications.

So when you see a 0xc000000d error code with an application that was compiled using VS2005, that is part of the problem.


Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.