Tuesday, May 30, 2017

Interprocedural optimization in GCC

Compilers can do a better job optimizing a function if they can use knowledge of other functions. The obvious case is inlining, but there are many more cases. This post lists the interprocedural optimizations implemented in GCC 7.

Many of the optimizations are only relevant for large functions (small functions are inlined into the caller!) or for helping other optimization passes. This makes it hard to give relevant examples, so the examples in this post are just illustrating the principles.

Parameter passing

Parameter passing for functions where GCC can see all callers (such as functions that are local to a translating unit, or when the whole program is compiled using link-time optimization) is optimized as
  • Unused parameters are removed.
  • Parameters passed  by reference may be changed to be passed by value. For example,
    static int foo(int *m)
    {
      return *m + 1;
    }
    
    int bar(void)
    {
      int i = 1;
      return foo(&i);
    }
    
    is changed to
    static int foo(int m)
    {
      return m + 1;
    }
    
    int bar(void)
    {
      int i = 1;
      return foo(i);
    }
    
    which makes it much easier for other optimization passes to reason about the variables.
  • A structure may be split into its elements. For example,
    struct bovid
    {
      float red;
      int green;
      void *blue;
    };
    
    static void ox(struct bovid *cow)
    {
      cow->red = cow->red + cow->green;
    }
    
    int main(void)
    {
      struct bovid cow;
    
      cow.red = 7.4;
      cow.green = 6;
      cow.blue = &cow;
    
      ox(&cow);
    
      return 0;
    }
    
    is changed to
    struct bovid
    {
      float red;
      int green;
      void *blue;
    };
    
    static void ox(float *t1, int t2)
    {
      *t1 = *t1 + t2;
    }
    
    int main(void)
    {
      struct bovid cow;
    
      cow.red = 7.4;
      cow.green = 6;
      cow.blue = &cow;
    
      ox(&cow.red, cow.green);
    
      return 0;
    }

These optimizations are enabled by -fipa-sra, which is enabled by default at -Os, -O2, and -O3.

Constant propagation

Functions where all callers pass the same constant can be optimized by propagating the constant into the function. That is,
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 7) + foo(n, 7);  
}
is optimized to
static int foo(int a)
{
  return a + 7;
}

int bar(int m, int n)
{
  return foo(m) + foo(n);
}

The constants can be propagated bitwise, which is useful for flag parameters. For example
static int foo(int a, int b)
{
  if (b & 4)
    return a & (b & 1);
  else
    return a & (b & 2);
}

int bar(int m, int n)
{
  return foo(m, 9) | foo(n, 3);
}
is optimized to
static int foo(int a, int b)
{
  return a & (b & 2);
}

int bar(int m, int n)
{
  return foo(m, 9) | foo(n, 3);
}

The constants do not need to be the same in all function calls – GCC tracks ranges of possible values and optimize as appropriate, so
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(n, 7);  
}
is optimized to
static int foo(int a, int b)
{
  return a + b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(n, 7);  
}
as both 5 and 7 are greater than 0.

These optimizations are enabled by -fipa-cp, -fipa-bit-cp, and -fipa-vrp, which are enabled by default at -Os, -O2, and -O3.

Constant propagation – cloning

It is often the case that only a few of the function calls pass constants as parameters, or that the constants are conflicting so they cannot be propagated into the called function. GCC handles this by cloning the called function to let each conflicting call get its own version. For example,
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(m, n);  
}
creates one clone of foo and optimizes it using the constant 5 for the parameter b
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

static int foo_clone(int a)
{
  return a + 5;
}

int bar(int m, int n)
{
  return foo_clone(m) + foo(m, n);  
}

This optimization is enabled by -fipa-cp-clone, which is enabled by default at -O3.

Devirtualization

Devirtualization (converting calls to virtual functions to direct calls – see Jan Hubička's blog series on how devirtualization works in GCC) is helped by propagating type information in roughly the same way as the constants are propagated, and is implemented by the constant propagation pass.

This is enabled by -fipa-cp and -fdevirtualize, which are enabled by default at -Os, -O2, and -O3.

Caller-saved registers

Caller saved registers do not need to be saved if those registers are not used by the called function.

This optimization is enabled by -fipa-ra, which is enabled by default at -Os, -O2, and -O3.

Identical code folding

The “identical code folding pass” merges identical functions. The functions do not need to be identical in the source code – the merging is done halfway through the optimization pipeline so it is enough that they have the same structure after simplification (and variable names etc. does not matter).

Functions that may be used outside the compilation unit cannot be completely merged as the C and C++ standards require that functions have unique addresses. GCC solves this by adding wrappers for the exported symbols, so that
#include <stdio.h>

void foo(char *s)
{
    printf("Hello %s\n", s);
}

void bar(char *s)
{
    printf("Hello %s\n", s);
}
is generated as
.LC0:
        .string "Hello %s\n"

foo:
        mov     rsi, rdi
        xor     eax, eax
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf

bar:
        jmp     foo

This optimization is enabled by -fipa-icf, which is enabled by default at -Os, -O2, and -O3.

Profile propagation

Many optimizations have different heuristics depending on how much the code is executed. The compiler estimates branch frequencies and propagates this information between functions so that, for example, a function only called from “cold” code segments is treated as a “cold” function.

This is enabled by -fipa-profile, which is enabled by default at -O and higher.

Pure, const, and noexcept

GCC analyzes functions to determine if they access memory or may throw exceptions, propagates this information throughout the compilation unit, and annotates the functions with pure, const, and noexcept attributes when possible, which helps other optimizations.

This optimization is enabled by -fipa-pure-const, which is enabled by default at -O and higher.

Global variables

It is in general hard to optimize usage of global variables, but it is easy to improve usage of global variables that cannot escape the compilation unit and that do not have the address taken. There are three optimizations done on such variables
  • Removal of global variables that are never read.
  • A global variable that is used in only one function may be changed to a local variable in that function.
  • The compiler tracks which functions modifies the variables so that loads and stores may be moved over function calls that do not touch the variable. For example, the function bar in
    static int g;
    
    void foo(void)
    {
      // Code not touching g
    }
    
    int bar(void)
    {
      g += 1;
      foo();
      g += 2;
    }
    
    is optimized to
    int bar(void)
    {
      foo();
      g += 3;
    }
    
These optimizations are enabled by -fipa-reference, which is enabled by default at -O and higher.

Pointer analysis

GCC can do interprocedural pointer analysis, which is enabled by -fipa-pta. This optimization is not enabled by default at any optimization level as it can cause excessive memory and compile-time usage on large compilation units.

5 comments:

  1. Thank you very much for your post. I have a question: which functions are local to a translation unit? I presume, static and those from anonymous namespace, but maybe some other too?

    ReplyDelete
    Replies
    1. I think those are then only one if you compile normally.

      But, for example, \(\verb!-fwhole-program!\) makes all functions local to the translation unit, unless you explicitly tell the compiler they are externally visible.

      Delete
  2. Great article! Thank you. Could you please write a little more about the last optimization, i.e. pointer analysis? What exactly a compiler can infer?

    ReplyDelete
    Replies
    1. Good idea! I'll try to do that in a week or so...

      Delete
  3. The barriers that exist between compilation units sometimes impose barriers to what would otherwise be useful optimizations. Such barriers, however, can also serve a useful purpose in cases where programs need semantics beyond what the Standard would offer; on many platforms, they made it possible to achieve such semantics without the use of compiler-specific syntax.

    Especially when doing embedded-, systems-, or other low-level programming, there are times when it is necessary to have a program behave as though it called a function the compiler knew nothing about, but which--from the compiler's point of view--might read any arbitrary collection of objects using character-type pointers and then write an arbitrary collection of objects likewise, regarding behavior as defined in all cases where some combination of actions by the function would make behavior defined. In the days before "whole program optimization" there was no need for such an intrinsic, since calls to outside functions would naturally provide the required behavior. While it would be fine to deprecate reliance upon some implicit barriers in favor of explicit ones, the Standard does not from what I can tell define any usable alternatives.

    ReplyDelete