We have been seeing inconsistent network failures when trying to set up Infinispan on EC2 (large instances) over Jgroups 3.1.0-FINAL running on Amazon's 64-bit linux AMI. An empty cache starts fine and seems to work for a time however once the cache is full, a new server getting synchronized causes the cache to lock.
We decided to roll our own cache but are seeing approximately the same behavior. 10s of megabytes are being exchanged during synchronization but they are not flooded. There is a back and forth data -> ack conversation at the application level but it looks like some of the messaging is never reaching the remote.
In looking at the UNICAST trace logging I'm seeing the following:
# my application starts a cache refresh operation 
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] DEBUG c.m.e.q.c.l.DistributedMapManager - i-f6a9d986: from i-d2e29fa2: search:REFRESH 
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] INFO  c.m.e.q.c.l.DistributedMapRequest - starting REFRESH from i-d2e29fa2 for map search, map-size 62373 
01:02:12.003 [Incoming-1,mprewCache,i-f6a9d986] DEBUG c.m.e.q.c.l.DistributedMapManager - i-f6a9d986: to i-d2e29fa2: search:PUT_MANY, 50 keyValues 
# transmits a block of 50 values to the remote but this never seems to get there 
01:02:12.004 [Incoming-1,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> DATA(i-d2e29fa2: #11, conn_id=10) 
# acks another window 
01:02:12.004 [Incoming-1,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> ACK(i-d2e29fa2: #4) 
# these XMITs happen for over and over until 01:30:40 
01:02:12.208 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #6) 
01:02:12.209 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #7) 
01:02:12.209 [Timer-2,mprewCache,i-f6a9d986] TRACE o.j.p.UNICAST - i-f6a9d986 --> XMIT(i-d2e29fa2: #8) 
...
Here's our Jgroups stack.  We replace the PING protocol at runtime with our own EC2_PING version which uses AWS calls to find other cluster member candidates.  This is not a connection issue.
Any ideas why some of the packets are not arriving at their destination?
Any ideas why some of the packets are not arriving at their destination?
This has been an interesting problem to track down. It seems to affect certain EC2 instances much more than others. The problem is around large packets being sent between EC2 instances via UDP.
The cache synchronization code was sending a large ~300k message to the remote server that gets fragmented (using FRAG2) into 4 packets of 60k (the default size) and 1 packet of 43k which are sent to the remote box. Because of some networking limitation, the remote box only receives the last (5th) 43k message. The 60k messages just never arrive. This seems to happen only between certain pairs of hosts -- other pairs can communicate fine with large packet sizes. That it's not universal is what took so long for me to isolate the diagnose the issue.
I initially thought this was a UDP receiver window size issue and tried to adjust it (sysctl -w net.core.rmem_max=10240000) but this did not help.  A look at the tcpdump showed the that 60k packets were just not arriving at the remote host.  Only the 43k packets was.
The solution was to decrease the frag size down to 16k (32k may have been fine but we were being conservative). There is some internal AWS limit to the packet sizes as they travel around Amazon's virtual network that is filtering large UDP packets above maybe 50k. The default Jgroups fragment size (60k) is to big IMO and probably should be decreased to 32k or something.
We submitted a ticket on this with Amazon and they acknowledged the issue but the general response was that it was difficult for them to fix. We had tweaked the fragment sizes and were working so the ticket was closed. To quote from the ticket:
From: Amazon Web Services
This is an update for case XXXXXXXXX. We are currently limited to packet sizes of 32k and below on Amazon EC2 and can confirm the issues you are facing for larger packet sizes. We are investigating a solution to this limitation. Please let us know if you can keep your packet sizes below this level, or if this is severe problem blocking your ability to operate.
We are actively looking into increasing the packet size along with other platform improvements, and apologize for this inconvenience.
Couple of other comments about EC2. First, we've see TTL's of >8 necessary for hosts in the same availability zone. If you are using multicast, make sure your TTL is set to 128 or something. We thought this initially was the problem but ultimately it was not.
Hope this helps others.
Without adding any element to the answer, I would like to add an alternative way of detecting the same issue.
I'm not a tcpdump expert, then I analysed the issue with debugging and logging.
In our case, a message was split in a number of smaller packets (given the frag_size parameter of FRAG2). Some of them (not necessarily the last one) were randomly not transmitted: typically, packets 1 to 19 were transmitted correctly, 21 was transmitted but 20 was missing.
This was followed by a large number of round-trips between the 2 instances:
The client would be missing packet #20, it acknowledges again #19 and asks for 20; the server would send #20 which is requested explicitely and #21 which has not been acknowledged
The client missing #20 would receive #21 (but not #20), re-acknowledge #19, re-ask #20 and so on for a time from 1 second to 50 seconds.
At the end, the client which is missing #20 generally completes (even if #20 has never been received) without saying anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With