Questions and Answers :
Macintosh :
Over open file limit on dual-processor G5's?
Message board moderation
Author | Message |
---|---|
Deborah Goldsmith Send message Joined: 5 May 99 Posts: 9 Credit: 11,520,811 RAC: 43 |
I have been seeing a problem where the BOINC main client becomes unable to communicate on the network or open files, even though other processes on the same machine have no such difficulty. Killing and restarting the client solves the problem. This has only happened on my dual-processor G5 machine. It looks like it is going over the open file limit on Darwin This happened with the 4.43 client, and seems much worse with the new 5.2.5 client. Typical log entries below: 2005-11-02 00:40:04 [SETI@home] Started upload of 21my04aa.11937.4240.434644.131_2_0 2005-11-02 00:40:10 [SETI@home] Finished upload of 21my04aa.11937.4240.434644.131_2_0 2005-11-02 00:40:10 [SETI@home] Throughput 11143 bytes/sec 2005-11-02 00:51:49 [---] request_reschedule_cpus: process exited 2005-11-02 00:51:49 [Predictor @ Home] Computation for result h0012B_1_86050_2 finished 2005-11-02 00:51:49 [rosetta@home] Restarting result 1btn__abrelax_no_cst_29133_2 using rosetta version 477 2005-11-02 00:51:49 [Einstein@Home] Restarting result w1_0983.5__0983.7_0.1_T06_S4hC_1 using einstein version 12 2005-11-02 00:51:49 [Predictor @ Home] Pausing result h0012B_1_86511_3 (removed from memory) 2005-11-02 00:51:50 [---] request_reschedule_cpus: process exited 2005-11-02 00:51:51 [Predictor @ Home] Started upload of h0012B_1_86050_2_0 2005-11-02 00:51:51 [Predictor @ Home] Started upload of h0012B_1_86050_2_1 2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:51:56 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_0: system I/O 2005-11-02 00:51:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_0 2005-11-02 00:51:56 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_1: system I/O 2005-11-02 00:51:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_1 2005-11-02 00:51:56 [Predictor @ Home] Started upload of h0012B_1_86050_2_2 2005-11-02 00:52:01 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:52:01 [---] Couldn't connect to hostname [predictor.scripps.edu] 2005-11-02 00:52:01 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_2: system I/O 2005-11-02 00:52:01 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_2 2005-11-02 00:52:56 [Predictor @ Home] Started upload of h0012B_1_86050_2_0 2005-11-02 00:52:56 [---] HTTP_CURL:libcurl_exec(): Can't setup HTTP response output file blcqH0NkB 2005-11-02 00:52:56 [---] HTTP_CURL:libcurl_exec(): Can't setup HTTP response output file blcqH0NkB 2005-11-02 00:52:56 [Predictor @ Home] Couldn't start upload of h0012B_1_86050_2_1 2005-11-02 00:52:56 [Predictor @ Home] Couldn't start upload of h0012B_1_86050_2_1 2005-11-02 00:52:56 [Predictor @ Home] URL http://predictor.scripps.edu/predictor_cgi/file_upload_handler: system fopen 2005-11-02 00:52:56 [Predictor @ Home] URL http://predictor.scripps.edu/predictor_cgi/file_upload_handler: system fopen 2005-11-02 00:52:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_1 2005-11-02 00:52:56 [---] Can't open temp state file: client_state_next.xml 2005-11-02 00:52:56 [---] Can't open temp state file: client_state_next.xml 2005-11-02 00:52:56 [---] Couldn't write state file: system fopen 2005-11-02 00:52:56 [---] Couldn't write state file: system fopen ... 2005-11-02 00:53:56 [---] Can't open temp state file: client_state_next.xml 2005-11-02 00:53:56 [---] Can't open temp state file: client_state_next.xml 2005-11-02 00:53:56 [---] Couldn't write state file: system fopen 2005-11-02 00:53:56 [---] Couldn't write state file: system fopen ... 2005-11-02 00:57:56 [Predictor @ Home] Computation for result h0012B_1_86511_3 finished md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_0 md5_file: Too many open files md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_1 md5_file: Too many open files md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_2 md5_file: Too many open files 2005-11-02 00:57:56 [Predictor @ Home] Computation for result h0012B_1_87919_3 finished Is the BOINC main client leaking open file descriptors on dual-processor machines? |
Deborah Goldsmith Send message Joined: 5 May 99 Posts: 9 Credit: 11,520,811 RAC: 43 |
BTW the default limit is 256 unless a process calls setrlimit(2). That's an awful lot, so it seems like there must be a leak. |
Deborah Goldsmith Send message Joined: 5 May 99 Posts: 9 Credit: 11,520,811 RAC: 43 |
Definitely leaking file descriptors. I've been tracking the process with fs_usage, and it is opening files that it is not closing. The files whose descriptors are being left open all look like this: 21:54:53.418 open F=100 blca2dDeR 21:54:53.419 open F=102 blczA2YLG 21:54:57.649 open F=101 blcQkFsos 21:55:00.656 open F=103 blcIalbr7 21:55:03.666 open F=104 blcxXSo8a The above is a result of this command: fgrep 'F=xxx' boincfsusage.txt | egrep 'open|close' | tail -1 Interestingly, the files are not being left around, but the descriptors are being left open. |
Deborah Goldsmith Send message Joined: 5 May 99 Posts: 9 Credit: 11,520,811 RAC: 43 |
Found the problem. It has nothing to do with dual processors, and is likely happening on Linux as well. It's in client/http_curl.C: #else // use mkstemp on Mac & Linux due to security issues strcpy(outfile, "blcXXXXXX"); // a template for the mkstemp mkstemp(outfile); #endif From man mkstemp: The mkstemp() function makes the same replacement to the template and creates the template file, mode 0600, returning a file descriptor opened for reading and writing. This avoids the race between testing for a file's existence and opening it for use. You can see from the code that the open file descriptor that's returned is thrown away. The file is left open and the descriptor is wasted. Probably the reason I'm seeing this on my dual G5 is that it's so much faster than anything else, so it's chewing through the 256-descriptor limit more quickly. Possible fixes: 1. Use mktemp instead of mkstemp; mktemp does not open the file. 2. Retain the file descriptor, and call close() on it later once boinc_fopen is called on the name. |
Deborah Goldsmith Send message Joined: 5 May 99 Posts: 9 Credit: 11,520,811 RAC: 43 |
Fixed in boinc core client 5.2.7, but that hasn't been released for Mac OS X yet. |
Snake Doctor Send message Joined: 13 Jan 01 Posts: 3 Credit: 1,534,389 RAC: 0 |
Found the problem. It has nothing to do with dual processors, and is likely happening on Linux as well. It's in client/http_curl.C: I am see a situation on both a dual G5 and a dual G4 running BOINC 5.2.5 where about every 2 days they will error out all the WUs in the queue. I have been unable to trace a cause for this. Could this file error cause all the WUs to error? When this happens the system just stops all BOINC activity. I usually find it sitting with a long list of WUs shown with status "Client Error", and when I check the project stats there will usually be 50 or WUs shown as client error there as well. Usually I have to reboot the system to clear the problem. I have just (today) upgraded the Dual G$ to BOINC 5.2.8 to see if this will help, but it is too soon to tell. Has anyone else seen all the WUs in the queue suddenly error out for no apparent reason? Regards Phil <img src='http://www.boincsynergy.com/images/stats/comb-2033.jpg'> We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.