ByteBuffer.putLong ~2x faster with non-native ByteOrder

Question

Here's a result I can't wrap by head around, despite extensive reading of the JDK source and the examination of intrinsic routines.

I'm testing clearing out a ByteBuffer, allocated with allocateDirect using ByteBuffer.putLong(int index, long value). Based on the JDK code, this results in a single 8 byte write if the buffer is in "native byte order", or a byte swap, followed by the same if it isn't.

So I'd expect native byte order (little endian for me) to be at least as fast as non-native. as it turns out, however, non-native are ~2x faster.

Here's my benchmark in Caliper 0.5x:

...    

public class ByteBufferBench extends SimpleBenchmark {

    private static final int SIZE = 2048;

    enum Endian {
        DEFAULT,
        SMALL,
        BIG
    }

    @Param Endian endian;

    private ByteBuffer bufferMember; 

    @Override
    protected void setUp() throws Exception {
        super.setUp();
        bufferMember = ByteBuffer.allocateDirect(SIZE);
        bufferMember.order(endian == Endian.DEFAULT ? bufferMember.order() :
            (endian == Endian.SMALL ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN));
    }

    public int timeClearLong(int reps) {
        ByteBuffer buffer = bufferMember;
        while (reps-- > 0) {
            for (int i=0; i < SIZE / LONG_BYTES; i+= LONG_BYTES) {
                buffer.putLong(i, reps);
            }
        }
        return 0;
    }

    public static void main(String[] args) {
        Runner.main(ByteBufferBench.class,args);
    }

}

The results are:

benchmark       type  endian     ns linear runtime
ClearLong     DIRECT DEFAULT   64.8 =
ClearLong     DIRECT   SMALL  118.6 ==
ClearLong     DIRECT     BIG   64.8 =

That's consistent. If I swap putLong for putFloat, it's about 4x faster for native order. If you look at how putLong works, it's doing absolutely more work in the non-native case:

private ByteBuffer putLong(long a, long x) {
    if (unaligned) {
        long y = (x);
        unsafe.putLong(a, (nativeByteOrder ? y : Bits.swap(y)));
    } else {
        Bits.putLong(a, x, bigEndian);
    }
    return this;
}

Note that unaligned is true in either case. The only difference between native and non-native byte order is Bits.swap which favors the native case (little-endian).

Nitsan Wakart · Accepted Answer

To summarize the discussion from the mechanical sympathy mailing list:

1.The anomaly described by the OP was not reproduce-able on my setup (JDK7u40/Ubuntu13.04/i7) resulting in consistent performance for both heap and direct buffers on all cases, with direct buffer offering a massive performance advantage:

BYTE_ARRAY DEFAULT 211.1 ==============================
BYTE_ARRAY   SMALL 199.8 ============================
BYTE_ARRAY     BIG 210.5 =============================
DIRECT DEFAULT  33.8 ====
DIRECT   SMALL  33.5 ====
DIRECT     BIG  33.7 ====

The Bits.swap(y) method gets intrinsic-fied into a single instruction and so can't/shouldn't really account for much of a difference/overhead.

2.The above result (i.e. contradictory to the OP experience) was independently confirmed by a naive hand rolled benchmark and a JMH benchmark written by another participant.

This leads me to believe you are either experiencing some local issue or some sort of a benchmarking framework issue. It would be valuable if others could run the experiment and see if they can reproduce your result.

ByteBuffer.putLong ~2x faster with non-native ByteOrder

Tags:

java

microbenchmark

nio

disruptor-pattern

caliper

BeeOnRope

1 Answers

Nitsan Wakart

Recent Activity

Donate For Us

ByteBuffer.putLong ~2x faster with non-native ByteOrder

Tags:

java

microbenchmark

nio

disruptor-pattern

caliper

BeeOnRope

1 Answers

Nitsan Wakart

Related questions

Recent Activity

Donate For Us