Here's a result I can't wrap by head around, despite extensive reading of the JDK source and the examination of intrinsic routines.
I'm testing clearing out a ByteBuffer, allocated with allocateDirect using ByteBuffer.putLong(int index, long value). Based on the JDK code, this results in a single 8 byte write if the buffer is in "native byte order", or a byte swap, followed by the same if it isn't.
So I'd expect native byte order (little endian for me) to be at least as fast as non-native. as it turns out, however, non-native are ~2x faster.
Here's my benchmark in Caliper 0.5x:
...
public class ByteBufferBench extends SimpleBenchmark {
private static final int SIZE = 2048;
enum Endian {
DEFAULT,
SMALL,
BIG
}
@Param Endian endian;
private ByteBuffer bufferMember;
@Override
protected void setUp() throws Exception {
super.setUp();
bufferMember = ByteBuffer.allocateDirect(SIZE);
bufferMember.order(endian == Endian.DEFAULT ? bufferMember.order() :
(endian == Endian.SMALL ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN));
}
public int timeClearLong(int reps) {
ByteBuffer buffer = bufferMember;
while (reps-- > 0) {
for (int i=0; i < SIZE / LONG_BYTES; i+= LONG_BYTES) {
buffer.putLong(i, reps);
}
}
return 0;
}
public static void main(String[] args) {
Runner.main(ByteBufferBench.class,args);
}
}
The results are:
benchmark type endian ns linear runtime
ClearLong DIRECT DEFAULT 64.8 =
ClearLong DIRECT SMALL 118.6 ==
ClearLong DIRECT BIG 64.8 =
That's consistent. If I swap putLong for putFloat, it's about 4x faster for native order. If you look at how putLong works, it's doing absolutely more work in the non-native case:
private ByteBuffer putLong(long a, long x) {
if (unaligned) {
long y = (x);
unsafe.putLong(a, (nativeByteOrder ? y : Bits.swap(y)));
} else {
Bits.putLong(a, x, bigEndian);
}
return this;
}
Note that unaligned is true in either case. The only difference between native and non-native byte order is Bits.swap which favors the native case (little-endian).
To summarize the discussion from the mechanical sympathy mailing list:
1.The anomaly described by the OP was not reproduce-able on my setup (JDK7u40/Ubuntu13.04/i7) resulting in consistent performance for both heap and direct buffers on all cases, with direct buffer offering a massive performance advantage:
BYTE_ARRAY DEFAULT 211.1 ==============================
BYTE_ARRAY SMALL 199.8 ============================
BYTE_ARRAY BIG 210.5 =============================
DIRECT DEFAULT 33.8 ====
DIRECT SMALL 33.5 ====
DIRECT BIG 33.7 ====
The Bits.swap(y) method gets intrinsic-fied into a single instruction and so can't/shouldn't really account for much of a difference/overhead.
2.The above result (i.e. contradictory to the OP experience) was independently confirmed by a naive hand rolled benchmark and a JMH benchmark written by another participant.
This leads me to believe you are either experiencing some local issue or some sort of a benchmarking framework issue. It would be valuable if others could run the experiment and see if they can reproduce your result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With