Why does compiling with -Os make this function larger?

Question

Consider this function:

long foo(long x) {
    return 5*x + 6;
}

When I compile it with x86-64 gcc 8.2 with -O3 (or -O2 or -O1), it compiles into this:

foo:
  leaq 6(%rdi,%rdi,4), %rax  # 5 bytes: 48 8d 44 bf 06
  ret                        # 1 byte:  c3

When I use -Os instead, it compiles into this:

foo:
  leaq (%rdi,%rdi,4), %rax   # 4 bytes: 48 8d 04 bf
  addq $6, %rax              # 4 bytes: 48 83 c0 06
  ret                        # 1 byte:  c3

The latter is 3 bytes longer. Isn't -Os supposed to produce the smallest possible code even if something larger would be more efficient? Why does the opposite seem to be happening here?

Godbolt: https://godbolt.org/z/jzNquk

valiano · Accepted Answer

While -Os ("optimize for size") is expected to produce code more compact compared to code produced with the -O1, -O2 and -O3 options ("optimize for speed"), there's indeed no such guarantee, as commented by @Robert Harvey.

Optimizing compilation is a very complex and delicate process. It consists of dozens of different optimization phases, which are usually executed serially: each optimization phase does its work on the program tree representation, and prepares the ground for the next phase. During the optimization process, every decision made in one phase may be impactful on the optimizations down the road, and passes may interact in non trivial ways, which could be very hard to predict. The compiler employs different heuristics for producing the most optimal code, but in some cases, these heuristics fall short, as in this case.

In this example, it seems things start off as expected - with -Os producing the more compact intermediate code, but this changes later on. One of the first phases to be executed by GCC is the Expand phase, which translates the GCC high level tree representation called GIMPLE, into the lower level RTL representation. It produces RTL code similar to this:

O3:

tmp1 <- x
tmp2 <- tmp1 << 2
tmp3 <- tmp2 + x
retval <- tmp3 + 6

Os:

tmp <- x * 5
tmp2 <- tmp + 6
retval <- tmp2

So far, so good - -Os wins. But afterwards, some 15 phases later, the Combine phase is executed, which attempts to combine a sequence of instructions into one instruction. For the -O3 code, Combine is able to collapse it very cleverly to the leaq instruction in the final output, but for -Os, Combine doesn't do as much good, and not able to collapse the code further. From that point, the code doesn't change much by farther optimizations.

To answer the exact question - why does GCC do this (generate the code it does during Expand with -O3, and why Combine doesn't do a better job in -Os), one has to examine the GCC code and figure out which GCC parameters are the influential ones, as well as the decisions made by the preceding optimization phases.

But, the thing is, that while GCC under performed in this example, it may be the best choice for the majority of other examples. It's a matter of delicate trade offs - not an easy job for compiler writers!

This may not answer the question fully, but hopefully it gives some useful background. If you're interested in inspecting the outputs from GCC on each optimization phase, you could add the -da compilation flag, which will produce annotated tree dumps for every phase, and the -dP flag, which adds tree annotation to the generated assembly output, along with -S.

Why does compiling with -Os make this function larger?

Tags:

c

compiler-optimization

gcc

Joseph Sible-Reinstate Monica

1 Answers

valiano

Recent Activity

Donate For Us

Why does compiling with -Os make this function larger?

Tags:

c

compiler-optimization

gcc

Joseph Sible-Reinstate Monica

1 Answers

valiano

Related questions

Recent Activity

Donate For Us