Over open file limit on dual-processor G5's?

Questions and Answers : Macintosh : Over open file limit on dual-processor G5's?
Message board moderation

To post messages, you must log in.

AuthorMessage
Deborah Goldsmith

Send message
Joined: 5 May 99
Posts: 9
Credit: 11,520,811
RAC: 43
United States
Message 185180 - Posted: 2 Nov 2005, 22:24:35 UTC

I have been seeing a problem where the BOINC main client becomes unable to communicate on the network or open files, even though other processes on the same machine have no such difficulty. Killing and restarting the client solves the problem. This has only happened on my dual-processor G5 machine. It looks like it is going over the open file limit on Darwin

This happened with the 4.43 client, and seems much worse with the new 5.2.5 client. Typical log entries below:

2005-11-02 00:40:04 [SETI@home] Started upload of 21my04aa.11937.4240.434644.131_2_0
2005-11-02 00:40:10 [SETI@home] Finished upload of 21my04aa.11937.4240.434644.131_2_0
2005-11-02 00:40:10 [SETI@home] Throughput 11143 bytes/sec
2005-11-02 00:51:49 [---] request_reschedule_cpus: process exited
2005-11-02 00:51:49 [Predictor @ Home] Computation for result h0012B_1_86050_2 finished
2005-11-02 00:51:49 [rosetta@home] Restarting result 1btn__abrelax_no_cst_29133_2 using rosetta version 477
2005-11-02 00:51:49 [Einstein@Home] Restarting result w1_0983.5__0983.7_0.1_T06_S4hC_1 using einstein version 12
2005-11-02 00:51:49 [Predictor @ Home] Pausing result h0012B_1_86511_3 (removed from memory)
2005-11-02 00:51:50 [---] request_reschedule_cpus: process exited
2005-11-02 00:51:51 [Predictor @ Home] Started upload of h0012B_1_86050_2_0
2005-11-02 00:51:51 [Predictor @ Home] Started upload of h0012B_1_86050_2_1
2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:51:56 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:51:56 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_0: system I/O
2005-11-02 00:51:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_0
2005-11-02 00:51:56 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_1: system I/O
2005-11-02 00:51:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_1
2005-11-02 00:51:56 [Predictor @ Home] Started upload of h0012B_1_86050_2_2
2005-11-02 00:52:01 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:52:01 [---] Couldn't connect to hostname [predictor.scripps.edu]
2005-11-02 00:52:01 [Predictor @ Home] Temporarily failed upload of h0012B_1_86050_2_2: system I/O
2005-11-02 00:52:01 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_2
2005-11-02 00:52:56 [Predictor @ Home] Started upload of h0012B_1_86050_2_0
2005-11-02 00:52:56 [---] HTTP_CURL:libcurl_exec(): Can't setup HTTP response output file blcqH0NkB
2005-11-02 00:52:56 [---] HTTP_CURL:libcurl_exec(): Can't setup HTTP response output file blcqH0NkB
2005-11-02 00:52:56 [Predictor @ Home] Couldn't start upload of h0012B_1_86050_2_1
2005-11-02 00:52:56 [Predictor @ Home] Couldn't start upload of h0012B_1_86050_2_1
2005-11-02 00:52:56 [Predictor @ Home] URL http://predictor.scripps.edu/predictor_cgi/file_upload_handler: system fopen
2005-11-02 00:52:56 [Predictor @ Home] URL http://predictor.scripps.edu/predictor_cgi/file_upload_handler: system fopen
2005-11-02 00:52:56 [Predictor @ Home] Backing off 1 minutes and 0 seconds on upload of file h0012B_1_86050_2_1
2005-11-02 00:52:56 [---] Can't open temp state file: client_state_next.xml
2005-11-02 00:52:56 [---] Can't open temp state file: client_state_next.xml
2005-11-02 00:52:56 [---] Couldn't write state file: system fopen
2005-11-02 00:52:56 [---] Couldn't write state file: system fopen
...
2005-11-02 00:53:56 [---] Can't open temp state file: client_state_next.xml
2005-11-02 00:53:56 [---] Can't open temp state file: client_state_next.xml
2005-11-02 00:53:56 [---] Couldn't write state file: system fopen
2005-11-02 00:53:56 [---] Couldn't write state file: system fopen
...
2005-11-02 00:57:56 [Predictor @ Home] Computation for result h0012B_1_86511_3 finished
md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_0
md5_file: Too many open files
md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_1
md5_file: Too many open files
md5_file: can't open projects/predictor1.scripps.edu/h0012B_1_86511_3_2
md5_file: Too many open files
2005-11-02 00:57:56 [Predictor @ Home] Computation for result h0012B_1_87919_3 finished

Is the BOINC main client leaking open file descriptors on dual-processor machines?


ID: 185180 · Report as offensive
Deborah Goldsmith

Send message
Joined: 5 May 99
Posts: 9
Credit: 11,520,811
RAC: 43
United States
Message 185185 - Posted: 2 Nov 2005, 22:29:38 UTC

BTW the default limit is 256 unless a process calls setrlimit(2). That's an awful lot, so it seems like there must be a leak.
ID: 185185 · Report as offensive
Deborah Goldsmith

Send message
Joined: 5 May 99
Posts: 9
Credit: 11,520,811
RAC: 43
United States
Message 185454 - Posted: 3 Nov 2005, 18:06:58 UTC

Definitely leaking file descriptors. I've been tracking the process with fs_usage, and it is opening files that it is not closing. The files whose descriptors are being left open all look like this:

21:54:53.418 open F=100 blca2dDeR
21:54:53.419 open F=102 blczA2YLG
21:54:57.649 open F=101 blcQkFsos
21:55:00.656 open F=103 blcIalbr7
21:55:03.666 open F=104 blcxXSo8a

The above is a result of this command:
fgrep 'F=xxx' boincfsusage.txt | egrep 'open|close' | tail -1

Interestingly, the files are not being left around, but the descriptors are being left open.

ID: 185454 · Report as offensive
Deborah Goldsmith

Send message
Joined: 5 May 99
Posts: 9
Credit: 11,520,811
RAC: 43
United States
Message 185461 - Posted: 3 Nov 2005, 18:30:48 UTC

Found the problem. It has nothing to do with dual processors, and is likely happening on Linux as well. It's in client/http_curl.C:

#else // use mkstemp on Mac & Linux due to security issues
strcpy(outfile, "blcXXXXXX"); // a template for the mkstemp
mkstemp(outfile);
#endif

From man mkstemp:
The mkstemp() function makes the same replacement to the template and
creates the template file, mode 0600, returning a file descriptor opened
for reading and writing. This avoids the race between testing for a
file's existence and opening it for use.

You can see from the code that the open file descriptor that's returned is thrown away. The file is left open and the descriptor is wasted. Probably the reason I'm seeing this on my dual G5 is that it's so much faster than anything else, so it's chewing through the 256-descriptor limit more quickly.

Possible fixes:
1. Use mktemp instead of mkstemp; mktemp does not open the file.
2. Retain the file descriptor, and call close() on it later once boinc_fopen is called on the name.

ID: 185461 · Report as offensive
Deborah Goldsmith

Send message
Joined: 5 May 99
Posts: 9
Credit: 11,520,811
RAC: 43
United States
Message 189678 - Posted: 17 Nov 2005, 2:12:09 UTC

Fixed in boinc core client 5.2.7, but that hasn't been released for Mac OS X yet.
ID: 189678 · Report as offensive
Profile Snake Doctor
Volunteer tester

Send message
Joined: 13 Jan 01
Posts: 3
Credit: 1,534,389
RAC: 0
United States
Message 193436 - Posted: 24 Nov 2005, 16:23:42 UTC - in response to Message 185461.  

Found the problem. It has nothing to do with dual processors, and is likely happening on Linux as well. It's in client/http_curl.C:

#else // use mkstemp on Mac & Linux due to security issues
strcpy(outfile, "blcXXXXXX"); // a template for the mkstemp
mkstemp(outfile);
#endif

From man mkstemp:
The mkstemp() function makes the same replacement to the template and
creates the template file, mode 0600, returning a file descriptor opened
for reading and writing. This avoids the race between testing for a
file's existence and opening it for use.

You can see from the code that the open file descriptor that's returned is thrown away. The file is left open and the descriptor is wasted. Probably the reason I'm seeing this on my dual G5 is that it's so much faster than anything else, so it's chewing through the 256-descriptor limit more quickly.

Possible fixes:
1. Use mktemp instead of mkstemp; mktemp does not open the file.
2. Retain the file descriptor, and call close() on it later once boinc_fopen is called on the name.


I am see a situation on both a dual G5 and a dual G4 running BOINC 5.2.5 where about every 2 days they will error out all the WUs in the queue. I have been unable to trace a cause for this. Could this file error cause all the WUs to error? When this happens the system just stops all BOINC activity. I usually find it sitting with a long list of WUs shown with status "Client Error", and when I check the project stats there will usually be 50 or WUs shown as client error there as well. Usually I have to reboot the system to clear the problem.

I have just (today) upgraded the Dual G$ to BOINC 5.2.8 to see if this will help, but it is too soon to tell.

Has anyone else seen all the WUs in the queue suddenly error out for no apparent reason?

Regards
Phil


<img src='http://www.boincsynergy.com/images/stats/comb-2033.jpg'>
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 193436 · Report as offensive

Questions and Answers : Macintosh : Over open file limit on dual-processor G5's?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.