This is the case:
cat sum100000000.cpp && cat sum100000000.java 
#include <cstdio>
using namespace std;
int main(){
  long N=1000000000,sum=0;
  for( long i=0; i<N; i++ ) sum+= i;
  printf("%ld\n",sum);
}
public class sum100000000 {
    public static void main(String[] args) {
        long sum=0;
        for(long i = 0; i < 1000000000; i++) sum += i;
        System.out.println(sum);
    }
}
This is the result:
time ./a.out && time java sum100000000
499999999500000000
real    0m2.675s
user    0m2.673s
sys 0m0.002s
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
499999999500000000
real    0m0.439s
user    0m0.470s
sys 0m0.027s
Not seeing anything unusual in the disassembled binary. but seems c binary is significantly slower. which I couldn't understand.
My guess is that there might be some problem with the toolchain
clang -v
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix
uname -a
Darwin MacBook-Pro.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64
Add: c / cpp compiled without any special binary. This is not changing the result anyways.
gcc sum1b.cpp
clang sum1b.cpp
Add: for those concerned with llvm, really not changing anything
$ gcc sum100000000.cpp && time ./a.out
gcc sum100000000.cpp && time ./a.out
499999999500000000
real    0m2.722s
user    0m2.717s
sys 0m0.003s
Modification: O2 is a lot faster: but that looks like a cheat
$ otool -tV a.out
otool -tV a.out
a.out:
(__TEXT,__text) section
_main:
0000000100000f50    pushq   %rbp
0000000100000f51    movq    %rsp, %rbp
0000000100000f54    leaq    0x37(%rip), %rdi ## literal pool for: "%ld
"
0000000100000f5b    movabsq $0x6f05b59b5e49b00, %rsi
0000000100000f65    xorl    %eax, %eax
0000000100000f67    callq   0x100000f70 ## symbol stub for: _printf
0000000100000f6c    xorl    %eax, %eax
0000000100000f6e    popq    %rbp
0000000100000f6f    ret
I am convinced now that this is related with Optimisation, so now the question more on what did the JIT do exactly to speed up this calculation?
Large Integers Long variables can hold numbers from -9,223,372,036,854,775,808 through 9,223,372,036,854,775,807. Operations with Long are slightly slower than with Integer .
'int' runs much faster. Conclusion: Use "int" data type instead of "long" as much as you can to get better execution performance in interpreted-only mode.
The issue is most likely that you're not compiling the C version with optimization enabled. If you enable aggressive optimization, the binary produced by gcc should win. The JVM's JIT is good, but the simple fact is that the JVM has to load and then apply JIT at runtime; gcc can optimize the binary at compile-time.
Leaving off all gcc flags gives me a binary that performs fairly slowly, like yours. Using -O2 gives me a binary that just barely loses out to the Java version. Using -O3 gives me one that easily beats the Java version. (This is on my Linux Mint 16 64-bit machine with gcc 4.8.1 and Java 1.8.0_20 [e.g., Java 8 Update 20].) larsmans examined the disassembly of the -O3 version and assures me that the compiler isn't precomputing the result (my C and assembly fu is very weak these days; many thanks to larsmans for double-checking that). Interestingly, though, thanks to investigation by Mat, it looks like that's actually a byproduct of my using gcc 4.8.1; earlier and later versions of gcc seem to be willing to precompute the result. Happy accident for us, though.
Here's my pure C version [I've also updated it to account for Ajay's comment about your using a constant in the Java version, but the variable N in the C version (didn't make any real difference, but...)]:
sum.c:
#include <stdio.h>
int main(){
  long sum=0;
  long i;
  for( i=0; i<1000000000; i++ ) sum+= i;
  printf("%ld\n",sum);
}
My Java version is unchanged from yours other than that I lost track of the zeroes too easily:
sum.java:
public class sum {
    public static void main(String[] args) {
        long sum=0;
        for(long i = 0; i < 1000000000; i++) sum += i;
        System.out.println(sum);
    }
}
Results:
C binary run (compiled via gcc sum.c):
$ time ./a.out 499999999500000000 real 0m2.436s user 0m2.429s sys 0m0.004s
Java run (compiled with no special flags, run with no special runtime flags):
$ time java sum 499999999500000000 real 0m0.691s user 0m0.684s sys 0m0.020s
Java run (compiled with no special flags, run with -server -noverify, tiny improvement):
$ time java -server -noverify sum 499999999500000000 real 0m0.651s user 0m0.649s sys 0m0.016s
C binary run (compiled via gcc -O2 sum.c):
$ time ./a.out 499999999500000000 real 0m0.733s user 0m0.732s sys 0m0.000s
C binary run (compiled via gcc -O3 sum.c):
$ time ./a.out 499999999500000000 real 0m0.373s user 0m0.372s sys 0m0.000s
Here's the main result of objdump -d a.out on my -O3 version:
0000000000400470 : 400470: 66 0f 6f 1d 08 02 00 movdqa 0x208(%rip),%xmm3 # 400680 400477: 00 400478: 31 c0 xor %eax,%eax 40047a: 66 0f ef c9 pxor %xmm1,%xmm1 40047e: 66 0f 6f 05 ea 01 00 movdqa 0x1ea(%rip),%xmm0 # 400670 400485: 00 400486: eb 0c jmp 400494 400488: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40048f: 00 400490: 66 0f 6f c2 movdqa %xmm2,%xmm0 400494: 66 0f 6f d0 movdqa %xmm0,%xmm2 400498: 83 c0 01 add $0x1,%eax 40049b: 66 0f d4 c8 paddq %xmm0,%xmm1 40049f: 3d 00 65 cd 1d cmp $0x1dcd6500,%eax 4004a4: 66 0f d4 d3 paddq %xmm3,%xmm2 4004a8: 75 e6 jne 400490 4004aa: 66 0f 6f e1 movdqa %xmm1,%xmm4 4004ae: be 64 06 40 00 mov $0x400664,%esi 4004b3: bf 01 00 00 00 mov $0x1,%edi 4004b8: 31 c0 xor %eax,%eax 4004ba: 66 0f 73 dc 08 psrldq $0x8,%xmm4 4004bf: 66 0f d4 cc paddq %xmm4,%xmm1 4004c3: 66 0f 7f 4c 24 e8 movdqa %xmm1,-0x18(%rsp) 4004c9: 48 8b 54 24 e8 mov -0x18(%rsp),%rdx 4004ce: e9 8d ff ff ff jmpq 400460
As I say, my assembly-fu is very weak, but I'm seeing a loop there, rather than the compiler having done the math.
And just for completeness, the main part of the result of javap -c sum:
  public static void main(java.lang.String[]);
    Code:
       0: lconst_0
       1: lstore_1
       2: lconst_0
       3: lstore_3
       4: lload_3
       5: ldc2_w        #2                  // long 1000000000l
       8: lcmp
       9: ifge          23
      12: lload_1
      13: lload_3
      14: ladd
      15: lstore_1
      16: lload_3
      17: lconst_1
      18: ladd
      19: lstore_3
      20: goto          4
      23: getstatic     #4                  // Field java/lang/System.out:Ljava/io/PrintStream;
      26: lload_1
      27: invokevirtual #5                  // Method java/io/PrintStream.println:(J)V
      30: return
It's not precomputing the result at the bytecode level; I can't say what the JIT is doing.
It makes sense to optimize java byte code when comparing with gcc optimized byte code. Accepted answer did not include -server -noverify switches. Optionally, -Xcomp could be applied but it is not recommended.
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
java version "1.8.0_25" Java(TM) SE Runtime Environment (build 1.8.0_25-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
sum.java with flags:
time java -server -noverify sum
499999999500000000
real    0m0.299s
user    0m0.303s
sys 0m0.004s
and sum.c with 03 flag:
time ./a.out 
499999999500000000
real    0m0.233s
user    0m0.233s
sys 0m0.000s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With