[I asked something similar before. This is a more focused version.]
What can cause a server's select() call on a TCP socket to consistently time-out rather than "see" the client's close() of the socket? On the client's side, the socket is a regular socket()-created blocking socket that successfully connects to the server and successfully transmits a round-trip transaction. On the server's side, the socket is created via an accept() call, is blocking, is passed to a child server process via fork(), is closed by the top-level server, and is successfully used by the child server process in the initial transaction. When the client subsequently closes the socket, the select() call of the child server process consistently times-out (after 1 minute) rather than indicating a read-ready condition on the socket. The select() call looks for read-ready conditions only: the write-ready and exception arguments are NULL.
Here's the simplified but logically equivalent select()-using code in the child server process:
int one_svc_run(
    const int           sock,
    const unsigned      timeout) 
{
    struct timeval      timeo;
    fd_set              fds;
    timeo.tv_sec = timeout;
    timeo.tv_usec = 0;
    FD_ZERO(&fds);
    FD_SET(sock, &fds);
    for (;;) {
        fd_set      readFds = fds;
        int         status = select(sock+1, &readFds, 0, 0, &timeo);
        if (status < 0)
            return errno;
        if (status == 0)
            return ETIMEDOUT;
        /* This code not reached when client closes socket */
        /* The time-out structure, "timeo", is appropriately reset here */
        ...            
    }
    ...
}
Here's the logical equivalent of the sequence of events on the client-side (error-handling not shown):
struct sockaddr_in *raddr = ...;
int sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
(void)bindresvport(sock, (struct sockaddr_in *)0);
connect(sock, (struct sockaddr *)raddr, sizeof(*raddr));
/* Send a message to the server and receive a reply */
(void)close(sock);
fork(), exec(), and system() are never called. The code is considerably more complex than this, but this is the sequence of relevant calls.
Could Nagel's algorithm cause the FIN packet to not be sent upon close()?
If you want the server to close the connection, you should use shutdown(cfd, SHUT_RDWR) and close(cfd) after, NOT close(lfd) . This lets the lfd socket open, allowing the server to wait at the accept for the next incoming connection. The lfd should close at the termination of the server.
Because the initial socket is used to wait for communication while the second is used to communicate.
The close function shuts down the socket associated with socket descriptor s, and frees resources allocated to the socket. If s refers to an open TCP connection, the connection is closed. If a stream socket is closed when there is input data queued, the TCP connection is reset rather than being cleanly closed.
Most likely explanation is that you're not actually closing the client end of the connection when you think you are. Probably because you have some other file descriptor that references the client socket somewhere that is not being closed.
If your client program ever does a fork (or related calls that fork, such as system or popen), the forked child might well have a copy of the file descriptor which would cause the behavior you're seeing.
One way to test/workaround the problem is to have the client do an explicit shutdown(2) prior to closing the socket:
shutdown(sock, SHUT_RDWR);
close(sock);
If this causes the problem to go away then that is the problem -- you have another copy of the client socket file descriptor somewhere hanging around.
If the problem is due to children getting the socket, the best fix is probably to set the close-on-exec flag on the socket immediately after creating it:
fcntl(sock, F_SETFD, fcntl(sock, F_GETFD) | FD_CLOEXEC);
or on some systems, use the SOCK_CLOEXEC flag to the socket creation call.
Mystery solved.
@nos was correct in the first comment: it's a firewall problem. A shutdown() by the client isn't needed; the client does close the socket; the server does use the right timeout; and there's no bug in the code.
The problem was caused by the firewall rules on our Linux Virtual Server (LVS). A client connects to the LVS and the connection is passed to the least-loaded of several backend servers. All packets from the client pass through the LVS; all packets from the backend server go directly to the client. The firewall rules on the LVS caused the FIN packet from the client to be discarded. Thus, the backend server never saw the close() by the client.
The solution was to remove the "-m state --state NEW" options from the iptables(8) rule on the LVS system. This allows the FIN packets from the client to be forwarded to the backend server. This article has more information.
Thanks to all of you who suggested using wireshark(1).
select() call of Linux will modify value of timeout argument. From man page:
On Linux, select() modifies timeout to reflect the amount of time not slept
So your timeo will runs to zero. And when it is zero select will return immediately (mostly with return value zero).
The following change may help:
for (;;) {
    struct timeval timo = timeo;
    fd_set      readFds = fds;
    int         status = select(sock+1, &readFds, 0, 0, &timo);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With