I was elated to read last winter that the x86-64 variant of RHEL 9 will be built and optimised for a modern target—not so much because this means slightly faster execution, rather because of the more efficient use of the available resources that goes along.
Reconsidering the default build target for our code
Red Hat has chosen the newly defined x86-64-v2 microarchitecture level as build target, which is a common denominator to the rather mixed capabilities of server processors out in the field. Most of our performance-critical code—at least the parts I maintain, which are a mixture of bandwidth-limited data acquisition, bit twiddling and digital signal processing—is built using the -march=native -mtune=native
compiler flags. This should result in optimal efficiency for the machine on which the code is being compiled on, but has often lead to issues when people (including me) ran it on hybrid CPU clusters. This either resulted in us having to explain how to circumvent Illegal instruction
crashes (by overriding CFLAGS
) or even removing the related flags from the build scripts when the execution targets were simply too diverse. Neither approach is efficient (for me or the CPUs), so I decided to try out a compromise similar to that of RHEL 9.
How to optimise for x86-64-v2 with older toolchains?
GCC 11 and LLVM 12 support the three x86-64 microarchitecture levels by default, so simply adding -march=x86-64-v2
to the compiler flags should be sufficient.
The precise set of flags used for x86-64-v2 are defined in the following bits of the GCC 11 sources:
constexpr wide_int_bitmask PTA_X86_64_BASELINE = PTA_64BIT | PTA_MMX | PTA_SSE
| PTA_SSE2 | PTA_NO_SAHF | PTA_FXSR;
constexpr wide_int_bitmask PTA_X86_64_V2 = (PTA_X86_64_BASELINE
& (~PTA_NO_SAHF))
| PTA_CX16 | PTA_POPCNT | PTA_SSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_SSSE3;
constexpr wide_int_bitmask PTA_X86_64_V3 = PTA_X86_64_V2
| PTA_AVX | PTA_AVX2 | PTA_BMI | PTA_BMI2 | PTA_F16C | PTA_FMA | PTA_LZCNT
| PTA_MOVBE | PTA_XSAVE;
constexpr wide_int_bitmask PTA_X86_64_V4 = PTA_X86_64_V3
| PTA_AVX512F | PTA_AVX512BW | PTA_AVX512CD | PTA_AVX512DQ | PTA_AVX512VL;
Hence, GCC 11’s -march=x86-64-v2
seems to be equivalent to:
gcc -march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3
A quick cross-check with echo | gcc-11 -E -dM - <flags>
confirms the equivalence at the #define
level.
LLVM 12 neatly documents the per-level features in doc/UsersManual.rst
of their distribution:
Several micro-architecture levels as specified by the x86-64 psABI are defined. They are cumulative in the sense that features from previous levels are implicitly included in later levels.
-march=x86-64
: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
-march=x86-64-v2
: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
-march=x86-64-v3
: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
-march=x86-64-v4
: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL
This seems to match the findings in GCC.
Beyond x86-64-v2
The advanced vector extensions (AVX) and fused multiply-add (FMA) have been tremendously helpful in improving the performance of our digital signal processing but are not included in x86-64-v2. This seems to be a compromise to the fact that even the newest Intel Atom “microserver” CPUs do not support either; I also know of a few users that are still running Sandy Bridge-based platforms (which support AVX, but not FMA). Since I am not aware of any Intel Atom servers in our environments it seems safe to add -mavx
to the CFLAGS
:
CFLAGS_X86_64_V2="-march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3"
CFLAGS="${CFLAGS_X86_64_V2} -mavx" # close to Sandy Bridge
This may not seem like a large step up from x86-64-v2, but it will enable some of our hand-written AVX DSP routines by default.
The remaining question is: how are we going to find out when we can safely switch to x86-64-v3?