The Saga Begins (LotsaCores 2.0)

Message boards : Number crunching : The Saga Begins (LotsaCores 2.0)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819421 - Posted: 24 Sep 2016, 19:44:01 UTC - in response to Message 1819419.  

The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819421 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819427 - Posted: 24 Sep 2016, 20:21:17 UTC - in response to Message 1819421.  

The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.

I'm thinking somewhat less abstract ;^), and more in terms of how many milliseconds are used writing each checkpoint. With Al's rescheduling runs now happening once per hour, and checkpoints at 60 seconds, that means that about 30 seconds per task out of every "task hour" is lost to reprocessing (about 0.8%). So, if 60 checkpoints are taken in that hour, what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). (Actually, with 56 cores and 6 GPUs, disk thrashing is probably already a given on that host.)
ID: 1819427 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819438 - Posted: 24 Sep 2016, 20:41:19 UTC - in response to Message 1819427.  

The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.

I'm thinking somewhat less abstract ;^), and more in terms of how many milliseconds are used writing each checkpoint. With Al's rescheduling runs now happening once per hour, and checkpoints at 60 seconds, that means that about 30 seconds per task out of every "task hour" is lost to reprocessing (about 0.8%). So, if 60 checkpoints are taken in that hour, what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). (Actually, with 56 cores and 6 GPUs, disk thrashing is probably already a given on that host.)


Yeah I agree. For the most part everything's cool. The problems come with 'Us weirdos', which either are naturally weirdos, or happen to be having some issue making us temporary weirdos. Then you run into a situation that 'ús weirdos'are normal. That works well if the clientele you're seeking aren't weirdos.

You see it's the whole approach of cherrypicking types of users and computers that I personally take issue with. Their assumptions of time exclude new things. i like new things.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819438 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819440 - Posted: 24 Sep 2016, 21:01:14 UTC - in response to Message 1819438.  
Last modified: 24 Sep 2016, 21:13:31 UTC

what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding).


If a Windows update happened to be occurring, and that included a .NET framework update, on a dual core CPU, potentially about 3-5 days lost. Obviously not directly a Boinc issue, though something that might well and truly warrant consideration, given that seti@home was promoted as a screensaver in the first place. Another angle might be that they have a crappy HDD ... Both examples easily cover the naive and arbitrary choices of 10 seconds hardcoded in Boinc code. [Edit: no options exposed in cc_config AFAIK]

[Edit:] I should point out I have no real problem with time based constraints, only that introducing failures based on assumptions that hosts are just crunching is borderline moronic.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819440 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819447 - Posted: 24 Sep 2016, 21:33:44 UTC

Okaaaay! Moving on, then. ;^)

I just dug into one of my old Process Monitor logs and, assuming that boinc_task_state.xml is where the checkpoint is written, here are the log entries for a couple in different slots:
6:40:39.6115090 PM	boinc.exe	1484	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
6:40:39.6117095 PM	boinc.exe	1484	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
6:40:39.6117489 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	
6:40:39.6309677 PM	boinc.exe	1484	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml	SUCCESS	Offset: 0, Length: 508
6:40:39.6310223 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml	SUCCESS	

6:40:41.1149725 PM	boinc.exe	1484	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
6:40:41.1151886 PM	boinc.exe	1484	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\6	SUCCESS	Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
6:40:41.1152160 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\6	SUCCESS	
6:40:41.1156076 PM	boinc.exe	1484	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml	SUCCESS	Offset: 0, Length: 516
6:40:41.1156731 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml	SUCCESS	

It looks like the first one took 0.0195133 seconds, while the second took 0.0007006 seconds. Of course, that's elapsed time, not necessarily processing time, but still, multiplying that out by 60 checkpoints per hour would only represent a maximum of 1.170798 seconds per task hour in the first case and 0.042036 seconds per task hour in the second. (I hope I've figured that correctly.) In any event, that doesn't seem like a significant cost to processing.
ID: 1819447 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819452 - Posted: 24 Sep 2016, 21:45:35 UTC - in response to Message 1819447.  

Yep, so assuming no-one owns systems 1% the speed of yours (storage wise), and doesn't use their computer for anything other than posting on Seti@homeNC, should be good :D
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819452 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1819457 - Posted: 24 Sep 2016, 22:07:50 UTC - in response to Message 1819327.  

Might this be part of the problem, trying to actively manage the SoG client, hence causing at least some of the plummeting RAC?

As Jeff pointed out, the errors are a result of exiting BOINC to run the rescheduler- the Manager has issues when things don't get done in the time it thinks they should be done.

The plummeting Credit is purely due to the recent current work mix of mostly Guppie & very little Arecibo (and what Arecibo work there was, was very noisy in large chunks). For a while there we had the opposite situation of no Guppie & all Arecibo which resulted in a much higher than usual amount of Credit being available.
When the work mix evens out again, ands stays that way for a month or so, then you'll find out just what a "normal" RAC number is for your systems.


I'd suggest running the rescheduler based on your cache size. If your cache is good for 24 hours, run the rescheduler every 12 hours. If it's good for only 6 hours, run it every 3 hours.
See how that goes. If necessary, run it more often. If there's still plenty of work for both CPU & GPU then run it less often. That will help reduce the errors resulting from the Manager's poor behaviour.
Grant
Darwin NT
ID: 1819457 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1819463 - Posted: 24 Sep 2016, 22:16:17 UTC - in response to Message 1819447.  

... assuming that boinc_task_state.xml is where the checkpoint is written ...

Why not have a look inside it?

boinc_task_state.xml is just 10 lines of text, with just time, memory, and disk usage values.

state.sah is the one which contains ~180 lines of science, including the best signals found so far.
ID: 1819463 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819477 - Posted: 24 Sep 2016, 23:16:02 UTC - in response to Message 1819463.  
Last modified: 25 Sep 2016, 0:15:22 UTC

... assuming that boinc_task_state.xml is where the checkpoint is written ...

Why not have a look inside it?

boinc_task_state.xml is just 10 lines of text, with just time, memory, and disk usage values.

state.sah is the one which contains ~180 lines of science, including the best signals found so far.

Well, I actually did look inside a boinc_task_state.xml and, once I saw the two checkpoint lines, made the rash assumption that they recorded the latest checkpoint. I also found, in the Process Monitor log that I was looking at, that the boinc_task_state files tended to be written in groups, spaced approximately by the checkpoint interval.

So now I'm a bit confused. In looking through the same Process Monitor log, the state.sah files seem to be referenced only sporadically, not at any specific interval, and apparently only for Read access (which, in itself, makes no sense), to wit:
6:53:13.1178805 PM	boinc.exe	1484	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah	SUCCESS	Filter: state.sah, 1: state.sah
6:53:13.1178988 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2	SUCCESS	
6:53:13.1180423 PM	boinc.exe	1484	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah	SUCCESS	Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
6:53:13.1180629 PM	boinc.exe	1484	QueryAttributeTagFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah	SUCCESS	Attributes: A, ReparseTag: 0x0
6:53:13.1180710 PM	boinc.exe	1484	SetDispositionInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah	SUCCESS	Delete: True
6:53:13.1180820 PM	boinc.exe	1484	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah	SUCCESS

If I remember correctly from a while back, during one of our bug hunting episodes, you had me turn on a config flag which resulted, as a side effect, in lots of checkpoint messages being written to the Event Log. I'm pretty sure they always appeared in groups, one for each running task, and at approximately the specified checkpoint interval. So, I guess I'll have to do some more digging to try to understand just what "checkpointing" means in a BOINC context, 'cause something's missing from the puzzle for me. :^)

EDIT: Ah, I think I see the issue with only Read access for state.sah files showing up in the Process Monitor log. This particular log was limited to boinc.exe actions. I'll have to dig around and see if I've got one tucked away that includes science app activity, too.

EDIT2: Okay, then. The apps do, indeed, appear to write the state.sah files at approximately the checkpoint-specified intervals. A couple examples:
6:43:36.6860645 PM	AKv8c_r2549_winx86_SSE3xjfs.exe	3280	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
6:43:36.6870820 PM	AKv8c_r2549_winx86_SSE3xjfs.exe	3280	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah	SUCCESS	Offset: 0, Length: 3,437
6:43:36.6872363 PM	AKv8c_r2549_winx86_SSE3xjfs.exe	3280	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah	SUCCESS	
6:43:36.7224494 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
6:43:36.7235162 PM	Lunatics_x41zc_win32_cuda50.exe	4020	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah	SUCCESS	Offset: 0, Length: 3,632
6:43:36.7236086 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah	SUCCESS	

I'll leave the math for the elapsed time for each to others. They look close enough to my earlier figures for the boinc_task_state files that the conclusions would be pretty much the same. The overhead is insignificant.

EDIT3: And just to be clear, both the state.sah and boinc_task_state.xml files are written at the checkpoints for each task.
ID: 1819477 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819894 - Posted: 26 Sep 2016, 12:41:26 UTC - in response to Message 1819457.  

I'd suggest running the rescheduler based on your cache size. If your cache is good for 24 hours, run the rescheduler every 12 hours. If it's good for only 6 hours, run it every 3 hours.
See how that goes. If necessary, run it more often. If there's still plenty of work for both CPU & GPU then run it less often. That will help reduce the errors resulting from the Manager's poor behaviour.

Well, on this machine, my cache is good for about 2 1/2 to theoretically 4 hours, depending on how far along the first chunk of 36 tasks is in processing (down from 48 as I am saving one core for each of the 12 GPU tasks running across 3 video cards. Is that still considered good practice?). Taking a quick look at the tasks in progress right now, it appears that each one is averaging 2-2 1/2 hours.

ID: 1819894 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1819927 - Posted: 26 Sep 2016, 16:27:42 UTC - in response to Message 1819894.  

You need to figure out what your GPU task pool turnover interval is. Since Qopt works on a one-to-one-swap basis, the GPU task refreshment is what should decide how often you run the Qopt program. And since the Manager processes tasks in FIFO order, as long as you run the program before the very oldest GPU task in the GPU task turnover pool finishes, then you should keep the VLARs off the GPU's. Of course that logic only works if there is an equal number of non-VLARs in the CPU pool for each Qopt run. But that should determine the optimal average time interval for running the optimization routine. It's never going to be perfect and you will run VLAR on the GPUs sometimes because the mix of tasks coming from the project varies so much. Witness the last week of almost no Arecibo tasks available or noisy, fast finishing work we have had.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1819927 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819934 - Posted: 26 Sep 2016, 16:51:26 UTC - in response to Message 1819927.  

Looking at the current set of tasks, it looks like the 980s are running at about 45 mins/task (4 tasks concurrently), and the 1080 is running at 25-30 minutes. I saw that 3/4's of them were running VLAR's when I just now looked, so I thought about backing it down to 45 minutes, but that isn't an option in Task Scheduler, only 30 or 60 mins, and you can't put in a value manually, I tried. So, back to every 30 mins and see what happens. But you're right of course, it all depends on what the project doles out.

ID: 1819934 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1819966 - Posted: 26 Sep 2016, 20:24:27 UTC - in response to Message 1819360.  

With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence.

This is idea for the author of QOpt
Tell Jimbocous to just delete all the boinc_finish_called files found in BOINC-Data\slots\* before restarting BOINC
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1819966 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819982 - Posted: 26 Sep 2016, 21:13:19 UTC - in response to Message 1819966.  

With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence.

This is idea for the author of QOpt
Tell Jimbocous to just delete all the boinc_finish_called files found in BOINC-Data\slots\* before restarting BOINC

That's an excellent idea! I used to do that manually back when we had the Zombie tasks problem with APs. Never did set anything up to do it automatically, though.
ID: 1819982 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1819989 - Posted: 26 Sep 2016, 21:58:59 UTC - in response to Message 1819982.  

If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file):
Del /S slots\boinc_finish_called


h:\BOINC-Data>Del /S slots\boinc_finish_called
Deleted file - h:\BOINC-Data\slots\boinc_finish_called
Deleted file - h:\BOINC-Data\slots\3\boinc_finish_called
Deleted file - h:\BOINC-Data\slots\4\boinc_finish_called
Deleted file - h:\BOINC-Data\slots\7\boinc_finish_called
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1819989 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1820103 - Posted: 27 Sep 2016, 4:00:02 UTC - in response to Message 1819966.  

With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence.

This is idea for the author of QOpt
Tell Jimbocous to just delete all the boinc_finish_called files found in BOINC-Data\slots\* before restarting BOINC

Just discovered this thread thanks to Stephen. Guess I should have been here for a while, eh? Let me go back, reread and see if I can pick up the thread here, and if something is needed I can sure do it. If I understand the issue, it's possible to do a shutdown while file clean-up is still happening? If so, would it not be better to just check for presence of a boinc_finish_called file and, if one exists, loop until it's gone before commencing actual client shutdown? I'm assuming these files don't stick around long. Obviously easy to check on startup as well, but maybe less disruptive?
I need to better understand the structure.
ID: 1820103 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1820111 - Posted: 27 Sep 2016, 4:35:03 UTC - in response to Message 1819360.  

Secondly, yes, the frequency of your rescheduling could have an impact, both on your RAC and on those "finish file present too long" errors you're getting. With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. When BOINC comes back up, more than 10 seconds have passed and the error is generated.

Do we have any idea what the persistence of that file is?

Personally, I would recommend a far longer interval between rescheduling runs ...

Agreed. For testing scheduling, I've been testing it at 2 hrs between runs, but I think the most effective would be somewhere around 4-6 hours. Bottom line is, if you're getting runs where nothing gets done, it's clearly being run too often. Haven't been seeing any "finish file present too long" errors at all.
I would suggest trying to determine how long a wu takes to get through from receipt to completion, and set the reschedule interval at about half that. Should be more than adequate.
ID: 1820111 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820114 - Posted: 27 Sep 2016, 4:47:48 UTC - in response to Message 1820103.  
Last modified: 27 Sep 2016, 4:52:07 UTC

With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence.

This is idea for the author of QOpt
Tell Jimbocous to just delete all the boinc_finish_called files found in BOINC-Data\slots\* before restarting BOINC

Just discovered this thread thanks to Stephen. Guess I should have been here for a while, eh? Let me go back, reread and see if I can pick up the thread here, and if something is needed I can sure do it. If I understand the issue, it's possible to do a shutdown while file clean-up is still happening? If so, would it not be better to just check for presence of a boinc_finish_called file and, if one exists, loop until it's gone before commencing actual client shutdown? I'm assuming these files don't stick around long. Obviously easy to check on startup as well, but maybe less disruptive?
I need to better understand the structure.

You can find a lot of discussion over the last few years in Number Crunching about the "finish file present too long" error, but here's a synopsis.

A "boinc_finish_called" file is an empty file that's written to a task's slot directory by the science application at (or near) the end of processing for the task. In the Stderr, you'll see a line near the end such as "21:11:53 (3088): called boinc_finish(0)". This file is kind of a squirrelly way to communicate to BOINC that processing has completed. BOINC repeatedly checks the slot directories for the presence of the file. Unfortunately, BOINC expects that the file will exist for no more than about 10 seconds. If it's been there longer, for some dumb|insane|idiotic reason, BOINC thinks something is wrong and essentially discards the task with the "finish file present too long" error.

There are a number of reasons why there might be a delay of more than 10 seconds between the writing of the finish file and BOINC discovering its existence. A few months back, I think I listed at least a half dozen reasons that had likely triggered the problem on my own systems. One of those is simply BOINC and/or the OS shutting down or crashing immediately after a finish file is written. Obviously, by the time BOINC restarts, more than 10 seconds have passed.

What's happening recently, with the emphasis on VLAR rescheduling, etc., is that BOINC is getting shut down on some systems, like Al's, much more frequently than it ever would have in the past, thus increasing the likelihood of a task getting caught with its finish file hanging out, especially with a multi-core, multi-GPU system.

The solution is simply, while BOINC is shut down, to delete any and all "boinc_finish_called" files found in the slot directories. There's no downside that I know of to doing this. When BOINC restarts, any tasks which had previously written a finish file will be restarted, usually at or close to 100%, "finish" a second time, and simply write a new "boinc_finish_called" file for BOINC to find. BilBg's suggestion is just to include a command to automate that slot cleanup as part of the rescheduler or the front end at some point between BOINC shutdown and restart. I think it's an excellent idea.
ID: 1820114 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1820116 - Posted: 27 Sep 2016, 4:53:40 UTC - in response to Message 1819989.  

If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file):
Del /S slots\boinc_finish_called

Wish they were all that easy. Incorporated into 1.02d.
Thanks!
ID: 1820116 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1820143 - Posted: 27 Sep 2016, 6:55:22 UTC - in response to Message 1820116.  

If you are in BOINC Data directory this works (I put some boinc_finish_called files in empty slots to test; and the command is in .CMD file):
Del /S slots\boinc_finish_called

Wish they were all that easy. Incorporated into 1.02d.
Thanks!

Also a couple other issues resolved, and PMs are out to all testers with fresh link to the 1.02d update. Thanks for the help, and the heads-up on issues!
ID: 1820143 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : The Saga Begins (LotsaCores 2.0)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.