cfor along how Why does a 4 billion-iteration Java loop take only 2 ms?

3 Answers

It looks like it was optimized away by JIT compiler. When I turn it off (-Djava.compiler=NONE), the code runs much slower:

$ javac
$ java MyClass
Used 4
$ java -Djava.compiler=NONE MyClass
Used 40409

I put OP's code inside of class MyClass.

java for loop

I'm running the following Java code on a laptop with 2.7 GHz Intel Core i7. I intended to let it measure how long it takes to finish a loop with 2^32 iterations, which I expected to be roughly 1.48 seconds (4/2.7 = 1.48).

But actually it only takes 2 milliseconds, instead of 1.48 s. I'm wondering if this is a result of any JVM optimization underneath?

public static void main(String[] args)
    long start = System.nanoTime();

    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++){
    long finish = System.nanoTime();
    long d = (finish - start) / 1000000;

    System.out.println("Used " + d);

As already pointed out, JIT (just-in-time) compiler can optimize an empty loop in order to remove unnecessary iterations. But how?

Actually, there are two JIT compilers: C1 & C2. First, the code is compiled with the C1. C1 collects the statistics and helps the JVM to discover that in 100% cases our empty loop doesn't change anything and is useless. In this situation C2 enters the stage. When the code is get called very often, it can be optimized and compiled with the C2 using collected statistics.

As an example, I will test the next code snippet (my JDK is set to slowdebug build 9-internal):

public class Demo {
    private static void run() {
        for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {

With the following command line options:

-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*

And there are different versions of my run method, compiled with the C1 and C2 appropriately. For me, the final variant (C2) looks something like this:


; B1: # B3 B2 <- BLOCK HEAD IS JUNK  Freq: 1
0x00000000125461b0: mov   dword ptr [rsp+0ffffffffffff7000h], eax
0x00000000125461b7: push  rbp
0x00000000125461b8: sub   rsp, 40h
0x00000000125461bc: mov   ebp, dword ptr [rdx]
0x00000000125461be: mov   rcx, rdx
0x00000000125461c1: mov   r10, 57fbc220h
0x00000000125461cb: call  indirect r10    ; *iload_1

0x00000000125461ce: cmp   ebp, 7fffffffh  ; 7fffffff => 2147483647
0x00000000125461d4: jnl   125461dbh       ; jump if not less

; B2: # B3 <- B1  Freq: 0.999999
0x00000000125461d6: mov   ebp, 7fffffffh  ; *if_icmpge

; B3: # N44 <- B1 B2  Freq: 1       
0x00000000125461db: mov   edx, 0ffffff5dh
0x0000000012837d60: nop
0x0000000012837d61: nop
0x0000000012837d62: nop
0x0000000012837d63: call  0ae86fa0h


It is a little bit messy, but If you look closely, you may notice that there is no long running loop here. There are 3 blocks: B1, B2 and B3 and the execution steps can be B1 -> B2 -> B3 or B1 -> B3. Where Freq: 1 - normalized estimated frequency of a block execution.

You consider start and finish time in nanosecond and you divide by 10^6 for calculate the latency

long d = (finish - start) / 1000000

it should be 10^9 because 1 second = 10^9 nanosecond.