Compiling code for (energy) efficiency | F. Werner’s Research Page

I was elated to read last winter that the x86-64 variant of RHEL 9 will be built and optimised for a modern target—not so much because this means slightly faster execution, rather because of the more efficient use of the available resources that goes along.

Reconsidering the default build target for our code

Red Hat has chosen the newly defined x86-64-v2 microarchitecture level as build target, which is a common denominator to the rather mixed capabilities of server processors out in the field. Most of our performance-critical code—at least the parts I maintain, which are a mixture of bandwidth-limited data acquisition, bit twiddling and digital signal processing—is built using the -march=native -mtune=native compiler flags. This should result in optimal efficiency for the machine on which the code is being compiled on, but has often lead to issues when people (including me) ran it on hybrid CPU clusters. This either resulted in us having to explain how to circumvent Illegal instruction crashes (by overriding CFLAGS) or even removing the related flags from the build scripts when the execution targets were simply too diverse. Neither approach is efficient (for me or the CPUs), so I decided to try out a compromise similar to that of RHEL 9.

How to optimise for x86-64-v2 with older toolchains?

GCC 11 and LLVM 12 support the three x86-64 microarchitecture levels by default, so simply adding -march=x86-64-v2 to the compiler flags should be sufficient.Note that this will choose a generic tuning by default: the instructions will be ordered to produce reasonable performance on the platforms that were relevant when your compiler was released. However, scientific experiments and institutes are often running on dated installations (because of various constraints on funding, software, hardware, and person-power – and IIABDFI), so I have not come across a single installation with either of those compiler versions (and not everyone is comfortable in setting up their own toolchain via homebrew or NixOS just to compile our code). So which flags are needed to optimise for x86-64-v2 on older versions of GCC and Clang?

The precise set of flags used for x86-64-v2 are defined in the following bits of the GCC 11 sources:

constexpr wide_int_bitmask PTA_X86_64_BASELINE = PTA_64BIT | PTA_MMX | PTA_SSE
  | PTA_SSE2 | PTA_NO_SAHF | PTA_FXSR;
constexpr wide_int_bitmask PTA_X86_64_V2 = (PTA_X86_64_BASELINE
                                            & (~PTA_NO_SAHF))
  | PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3;
constexpr wide_int_bitmask PTA_X86_64_V3 = PTA_X86_64_V2
  | PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
  | PTA_MOVBE | PTA_XSAVE;
constexpr wide_int_bitmask PTA_X86_64_V4 = PTA_X86_64_V3
  | PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL;

Hence, GCC 11’s -march=x86-64-v2 seems to be equivalent to:

gcc -march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3

A quick cross-check with echo | gcc-11 -E -dM - <flags> confirms the equivalence at the #define level.

LLVM 12 neatly documents the per-level features in doc/UsersManual.rst of their distribution:

Several micro-architecture levels as specified by the x86-64 psABI are defined. They are cumulative in the sense that features from previous levels are implicitly included in later levels.

-march=x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2

-march=x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3

-march=x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE

-march=x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

This seems to match the findings in GCC.The main differences are the explicit mentions of CMOV and SCE for LLVM; for GCC, CMOV seems to be enabled along with SSE, but I find no mention of SCE in the documentation or sources.

Beyond x86-64-v2

The advanced vector extensions (AVX) and fused multiply-add (FMA) have been tremendously helpful in improving the performance of our digital signal processing but are not included in x86-64-v2. This seems to be a compromise to the fact that even the newest Intel Atom “microserver” CPUs do not support either; I also know of a few users that are still running Sandy Bridge-based platforms (which support AVX, but not FMA). Since I am not aware of any Intel Atom servers in our environments it seems safe to add -mavx to the CFLAGS:

CFLAGS_X86_64_V2="-march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3"
CFLAGS="${CFLAGS_X86_64_V2} -mavx"  # close to Sandy Bridge

This may not seem like a large step up from x86-64-v2, but it will enable some of our hand-written AVX DSP routines by default.This is essentially “Level B” from the original proposal—it seems to have been dropped because people felt it was too small of a delta and because someone asked nicely. I hope integrating this into our projects will reduce the amount of “illegal instruction”-induced support requests while still keeping a high level of efficiency for the generic user.

The remaining question is: how are we going to find out when we can safely switch to x86-64-v3?

Reconsidering the default build target for our code#

How to optimise for x86-64-v2 with older toolchains?#

Beyond x86-64-v2#

Reconsidering the default build target for our code

How to optimise for x86-64-v2 with older toolchains?

Beyond x86-64-v2