Over the course of development of a significantly large project, we've accumulated a lot of unit tests. A lot of these tests start servers, connect to these servers and close the servers and clients, usually in the same process.
However, these tests randomly fail with a "Failed to bind address 127.0.0.1:(port)". When the test is re-run, the error usually disappears.
Now, we thought this was a problem with our tests, but we decided to write a small test in Clojure, which I'll post below (and comment for the non-Clojure people).
(ns test
(:import [java.net Socket ServerSocket]))
(dotimes [n 10000] ; Run the test ten thousand times
(let [server (ServerSocket. 10000) ; Start a server on port 10000
client (Socket. "localhost" 10000) ; Start a client on port 10000
p (.getLocalPort client)] ; Get the local port of the client
(.close client) ; Close the client
(.close server) ; Close the server
(println "n = " n) ; Debug
(println "p = " p) ; Debug
(println "client = " client) ; Debug
(println "server = " server) ; Debug
(let [server (ServerSocket. p)] ; Start a server on the local port of the client we just closed
(.close server) ; Close the server
(println "client = " client) ; Debug
(println "server = " server) ; Debug
))
)
The exception appears, at random, on the line where we start the second server. It appears that Java is holding onto the local port - even though the client on that port has already been closed.
So, my question: Why on earth is Java doing this, and why is it so seemingly random?
EDIT: Someone suggested I set the socket's reuseAddr to true. I've done this, and nothing has changed, so here's the code below.
(ns test
(:import [java.net Socket ServerSocket InetSocketAddress]))
(dotimes [n 10000] ; Run the test ten thousand times
(let [server (ServerSocket. )] ; Create a server socket
(. server (setReuseAddress true)) ; Set the socket to reuse address
(. server (bind (InetSocketAddress. 10000))) ; Bind the socket
(let [client (Socket. "localhost" 10000) ; Start a client on port 10000
p (.getLocalPort client)] ; Get the client's local port
(.close client) ; Close the client
(.close server) ; Close the server
; (. Thread (sleep 1000)) ; A sleep for testing
(println "n = " n) ; Debug
(println "p = " p) ; Debug
(println "client = " client) ; Debug
(println "server = " server) ; Debug
(let [server (ServerSocket. )] ; Create a server socket
(. server (setReuseAddress true)) ; Set the socket to reuse address
(. server (bind (InetSocketAddress. p))) ; Bind the socket to the local port of the client we just had
(.close server) ; Close the server
(println "client = " client) ; Debug
(println "server = " server) ; Debug
)))
)
I've also noticed that a sleep of 10msec or even 100msec does not prevent the problem. 1000msec has (so far) managed to prevent it, however.
EDIT 2: Someone put me on to SO_LINGER - but I can't find a way to set that on the ServerSockets. Anyone have any ideas on that?
EDIT 3: Turns out that SO_LINGER is disabled by default. What else can we look at?
UPDATE: The problem has been solved for the most part, using dynamic port allocation over a range of 10,000 or so ports. However, I'd still like to see what people can come up with.
I'm not (too) with the Clojure syntax, but you should invoke socket.setReuseAddr(true). This allows the program to reuse the port, even if there may be sockets in the TIME_WAIT state.
The test itself is invalid. Testing this behaviour is pointless, and has nothing to do with any required application behaviour: it is just exercising a corner condition in the TCP stack, which certainly no application should try to rely on. I would expect that opening a listening socket on a port that had just been an outbound connected port would never succeed at all due to TIME_WAIT, or at best succeed half the time due to uncertainty as to which end issued the close first.
I would remove the test. The rest of it doesn't do anything useful either,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With