Message boards :
Number crunching :
The Saga Begins (LotsaCores 2.0)
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however. The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however. I'm thinking somewhat less abstract ;^), and more in terms of how many milliseconds are used writing each checkpoint. With Al's rescheduling runs now happening once per hour, and checkpoints at 60 seconds, that means that about 30 seconds per task out of every "task hour" is lost to reprocessing (about 0.8%). So, if 60 checkpoints are taken in that hour, what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). (Actually, with 56 cores and 6 GPUs, disk thrashing is probably already a given on that host.) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however. Yeah I agree. For the most part everything's cool. The problems come with 'Us weirdos', which either are naturally weirdos, or happen to be having some issue making us temporary weirdos. Then you run into a situation that 'ús weirdos'are normal. That works well if the clientele you're seeking aren't weirdos. You see it's the whole approach of cherrypicking types of users and computers that I personally take issue with. Their assumptions of time exclude new things. i like new things. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). If a Windows update happened to be occurring, and that included a .NET framework update, on a dual core CPU, potentially about 3-5 days lost. Obviously not directly a Boinc issue, though something that might well and truly warrant consideration, given that seti@home was promoted as a screensaver in the first place. Another angle might be that they have a crappy HDD ... Both examples easily cover the naive and arbitrary choices of 10 seconds hardcoded in Boinc code. [Edit: no options exposed in cc_config AFAIK] [Edit:] I should point out I have no real problem with time based constraints, only that introducing failures based on assumptions that hosts are just crunching is borderline moronic. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okaaaay! Moving on, then. ;^) I just dug into one of my old Process Monitor logs and, assuming that boinc_task_state.xml is where the checkpoint is written, here are the log entries for a couple in different slots: 6:40:39.6115090 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:40:39.6117095 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 6:40:39.6117489 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS 6:40:39.6309677 PM boinc.exe 1484 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS Offset: 0, Length: 508 6:40:39.6310223 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS 6:40:41.1149725 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:40:41.1151886 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6 SUCCESS Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 6:40:41.1152160 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6 SUCCESS 6:40:41.1156076 PM boinc.exe 1484 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS Offset: 0, Length: 516 6:40:41.1156731 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS It looks like the first one took 0.0195133 seconds, while the second took 0.0007006 seconds. Of course, that's elapsed time, not necessarily processing time, but still, multiplying that out by 60 checkpoints per hour would only represent a maximum of 1.170798 seconds per task hour in the first case and 0.042036 seconds per task hour in the second. (I hope I've figured that correctly.) In any event, that doesn't seem like a significant cost to processing. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yep, so assuming no-one owns systems 1% the speed of yours (storage wise), and doesn't use their computer for anything other than posting on Seti@homeNC, should be good :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Might this be part of the problem, trying to actively manage the SoG client, hence causing at least some of the plummeting RAC? As Jeff pointed out, the errors are a result of exiting BOINC to run the rescheduler- the Manager has issues when things don't get done in the time it thinks they should be done. The plummeting Credit is purely due to the recent current work mix of mostly Guppie & very little Arecibo (and what Arecibo work there was, was very noisy in large chunks). For a while there we had the opposite situation of no Guppie & all Arecibo which resulted in a much higher than usual amount of Credit being available. When the work mix evens out again, ands stays that way for a month or so, then you'll find out just what a "normal" RAC number is for your systems. I'd suggest running the rescheduler based on your cache size. If your cache is good for 24 hours, run the rescheduler every 12 hours. If it's good for only 6 hours, run it every 3 hours. See how that goes. If necessary, run it more often. If there's still plenty of work for both CPU & GPU then run it less often. That will help reduce the errors resulting from the Manager's poor behaviour. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
... assuming that boinc_task_state.xml is where the checkpoint is written ... Why not have a look inside it? boinc_task_state.xml is just 10 lines of text, with just time, memory, and disk usage values. state.sah is the one which contains ~180 lines of science, including the best signals found so far. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
... assuming that boinc_task_state.xml is where the checkpoint is written ... Well, I actually did look inside a boinc_task_state.xml and, once I saw the two checkpoint lines, made the rash assumption that they recorded the latest checkpoint. I also found, in the Process Monitor log that I was looking at, that the boinc_task_state files tended to be written in groups, spaced approximately by the checkpoint interval. So now I'm a bit confused. In looking through the same Process Monitor log, the state.sah files seem to be referenced only sporadically, not at any specific interval, and apparently only for Read access (which, in itself, makes no sense), to wit: 6:53:13.1178805 PM boinc.exe 1484 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Filter: state.sah, 1: state.sah 6:53:13.1178988 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS 6:53:13.1180423 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 6:53:13.1180629 PM boinc.exe 1484 QueryAttributeTagFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Attributes: A, ReparseTag: 0x0 6:53:13.1180710 PM boinc.exe 1484 SetDispositionInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Delete: True 6:53:13.1180820 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS If I remember correctly from a while back, during one of our bug hunting episodes, you had me turn on a config flag which resulted, as a side effect, in lots of checkpoint messages being written to the Event Log. I'm pretty sure they always appeared in groups, one for each running task, and at approximately the specified checkpoint interval. So, I guess I'll have to do some more digging to try to understand just what "checkpointing" means in a BOINC context, 'cause something's missing from the puzzle for me. :^) EDIT: Ah, I think I see the issue with only Read access for state.sah files showing up in the Process Monitor log. This particular log was limited to boinc.exe actions. I'll have to dig around and see if I've got one tucked away that includes science app activity, too. EDIT2: Okay, then. The apps do, indeed, appear to write the state.sah files at approximately the checkpoint-specified intervals. A couple examples: 6:43:36.6860645 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:43:36.6870820 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS Offset: 0, Length: 3,437 6:43:36.6872363 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS 6:43:36.7224494 PM Lunatics_x41zc_win32_cuda50.exe 4020 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:43:36.7235162 PM Lunatics_x41zc_win32_cuda50.exe 4020 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS Offset: 0, Length: 3,632 6:43:36.7236086 PM Lunatics_x41zc_win32_cuda50.exe 4020 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS I'll leave the math for the elapsed time for each to others. They look close enough to my earlier figures for the boinc_task_state files that the conclusions would be pretty much the same. The overhead is insignificant. EDIT3: And just to be clear, both the state.sah and boinc_task_state.xml files are written at the checkpoints for each task. |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
I'd suggest running the rescheduler based on your cache size. If your cache is good for 24 hours, run the rescheduler every 12 hours. If it's good for only 6 hours, run it every 3 hours. Well, on this machine, my cache is good for about 2 1/2 to theoretically 4 hours, depending on how far along the first chunk of 36 tasks is in processing (down from 48 as I am saving one core for each of the 12 GPU tasks running across 3 video cards. Is that still considered good practice?). Taking a quick look at the tasks in progress right now, it appears that each one is averaging 2-2 1/2 hours. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
You need to figure out what your GPU task pool turnover interval is. Since Qopt works on a one-to-one-swap basis, the GPU task refreshment is what should decide how often you run the Qopt program. And since the Manager processes tasks in FIFO order, as long as you run the program before the very oldest GPU task in the GPU task turnover pool finishes, then you should keep the VLARs off the GPU's. Of course that logic only works if there is an equal number of non-VLARs in the CPU pool for each Qopt run. But that should determine the optimal average time interval for running the optimization routine. It's never going to be perfect and you will run VLAR on the GPUs sometimes because the mix of tasks coming from the project varies so much. Witness the last week of almost no Arecibo tasks available or noisy, fast finishing work we have had. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Looking at the current set of tasks, it looks like the 980s are running at about 45 mins/task (4 tasks concurrently), and the 1080 is running at 25-30 minutes. I saw that 3/4's of them were running VLAR's when I just now looked, so I thought about backing it down to 45 minutes, but that isn't an option in Task Scheduler, only 30 or 60 mins, and you can't put in a value manually, I tried. So, back to every 30 mins and see what happens. But you're right of course, it all depends on what the project doles out. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. This is idea for the author of QOpt Tell Jimbocous to just delete all the boinc_finish_called files found in BOINC-Data\slots\* before restarting BOINC Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. That's an excellent idea! I used to do that manually back when we had the Zombie tasks problem with APs. Never did set anything up to do it automatically, though. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file): Del /S slots\boinc_finish_called h:\BOINC-Data>Del /S slots\boinc_finish_called Deleted file - h:\BOINC-Data\slots\boinc_finish_called Deleted file - h:\BOINC-Data\slots\3\boinc_finish_called Deleted file - h:\BOINC-Data\slots\4\boinc_finish_called Deleted file - h:\BOINC-Data\slots\7\boinc_finish_called  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. Just discovered this thread thanks to Stephen. Guess I should have been here for a while, eh? Let me go back, reread and see if I can pick up the thread here, and if something is needed I can sure do it. If I understand the issue, it's possible to do a shutdown while file clean-up is still happening? If so, would it not be better to just check for presence of a boinc_finish_called file and, if one exists, loop until it's gone before commencing actual client shutdown? I'm assuming these files don't stick around long. Obviously easy to check on startup as well, but maybe less disruptive? I need to better understand the structure. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Secondly, yes, the frequency of your rescheduling could have an impact, both on your RAC and on those "finish file present too long" errors you're getting. With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. When BOINC comes back up, more than 10 seconds have passed and the error is generated. Do we have any idea what the persistence of that file is?
Agreed. For testing scheduling, I've been testing it at 2 hrs between runs, but I think the most effective would be somewhere around 4-6 hours. Bottom line is, if you're getting runs where nothing gets done, it's clearly being run too often. Haven't been seeing any "finish file present too long" errors at all. I would suggest trying to determine how long a wu takes to get through from receipt to completion, and set the reschedule interval at about half that. Should be more than adequate. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. You can find a lot of discussion over the last few years in Number Crunching about the "finish file present too long" error, but here's a synopsis. A "boinc_finish_called" file is an empty file that's written to a task's slot directory by the science application at (or near) the end of processing for the task. In the Stderr, you'll see a line near the end such as "21:11:53 (3088): called boinc_finish(0)". This file is kind of a squirrelly way to communicate to BOINC that processing has completed. BOINC repeatedly checks the slot directories for the presence of the file. Unfortunately, BOINC expects that the file will exist for no more than about 10 seconds. If it's been there longer, for some dumb|insane|idiotic reason, BOINC thinks something is wrong and essentially discards the task with the "finish file present too long" error. There are a number of reasons why there might be a delay of more than 10 seconds between the writing of the finish file and BOINC discovering its existence. A few months back, I think I listed at least a half dozen reasons that had likely triggered the problem on my own systems. One of those is simply BOINC and/or the OS shutting down or crashing immediately after a finish file is written. Obviously, by the time BOINC restarts, more than 10 seconds have passed. What's happening recently, with the emphasis on VLAR rescheduling, etc., is that BOINC is getting shut down on some systems, like Al's, much more frequently than it ever would have in the past, thus increasing the likelihood of a task getting caught with its finish file hanging out, especially with a multi-core, multi-GPU system. The solution is simply, while BOINC is shut down, to delete any and all "boinc_finish_called" files found in the slot directories. There's no downside that I know of to doing this. When BOINC restarts, any tasks which had previously written a finish file will be restarted, usually at or close to 100%, "finish" a second time, and simply write a new "boinc_finish_called" file for BOINC to find. BilBg's suggestion is just to include a command to automate that slot cleanup as part of the rescheduler or the front end at some point between BOINC shutdown and restart. I think it's an excellent idea. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file): Wish they were all that easy. Incorporated into 1.02d. Thanks! |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file): Also a couple other issues resolved, and PMs are out to all testers with fresh link to the 1.02d update. Thanks for the help, and the heads-up on issues! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.