performance metrics for micro-optimization


by arya dradjica updated

This page describes a mental model for understanding the performance of individual instructions on modern x86 hardware. While benchmarking can determine whether an implementation is faster than another, this mental model can help explain why. It should be useful for writing high-performance routines (e.g. in assembly or using intrinsics).

This model describes ILP (instruction-level parallelism) and the concrete details of how it works on modern x86 CPUs (from both Intel and AMD). There's a lot this model doesn't consider (decoding time, memory accesses, speculative execution, or hyper-threading), but those factors are less relevant when analyzing a single CPU-bound function in a hot loop. If there are many memory accesses or conditional branches, this model is not so relevant.

An x86 CPU has registers for holding data and is capable of many kinds of operations (e.g. addition, division, bitwise XOR) on that data. It consumes machine code (which consists of encoded instructions) and executes it. An instruction specifies an operation to execute and the inputs and outputs of that operation. Operations are implemented by some circuitry (e.g. a 32-bit adder circuit) tucked away in the CPU. The crux of CPU design is to decode instructions, activate the right circuitry for every operation, and route input and output data to and from them, as quickly as physically possible.

Modern CPUs use ILP (instruction-level parallelism), which hinges on the following observation: different operations use independent bits of circuitry, so we could execute independent operations simultaneously by activating multiple bits of circuitry at a time. These CPUs analyze incoming instructions, determine which ones can be executed in parallel, and route the inputs and outputs of multiple instructions simultaneously.

In the case of x86, ILP is implemented thusly: the circuits of every operation are categorized into 8-10 high-level execution ports. Some ports are specialized (e.g. their primary function is to interact with memory or perform SIMD operations), but common operations like addition can be executed on many different ports. Instructions are translated into micro-operations (uops), each of which is executed on a single port. For example, on Intel Haswell CPUs, an ADD instruction results in a single uop that can be executed on port 0, 1, 5, or 6. An ADD that loads from memory will also include a uop to port 2 or 3; presumably, those ports support loading from memory.

Some complex uops can take more than one cycle to execute; they are implemented as a sequence of stages within the executing port. A stage (almost always) takes one cycle to execute. Since each uop only uses one stage at a time, every stage can be executing a different uop simultaneously. Thus, it is possible to begin executing a new uop on a port every cycle, even though previous uops are still executing; this is known as pipelining. On Intel Haswell, POPCNT is a 3-cycle uop on port 1, but a new POPCNT (that does not depend on the output of the previous one) can be executed every cycle. Presumably, its uop is split into 3 single-cycle stages.

Based on this model, the following performance metrics are used:

Latency

The time an instruction takes to execute.

In some cases, instructions which take multiple inputs and/or return multiple outputs will consume and produce them at different points during execution. Latency is really measured as the time between a particular input being consumed and a particular output being produced; when all input-output combinations have the same latency, a single value is reported by convention.

For example, integer addition and subtraction universally have 1 cycle of latency. On older CPUs, the MUL instruction would return the low half of the product after 3 cycles, but the high half a cycle later.

Throughput
Reciprocal Throughput

The number of independent instances of the same operation that can be executed per clock cycle.

On Intel Haswell chips, ADD has a throughput of 4 per cycle, because it can be executed on 4 different ports simultaneously.

Reciprocal throughput (which is more commonly used) is the reciprocal of this value, e.g. 0.25 cycles per instruction.

Port Usage

The ports used by an instruction, and how many uops are sent to each port.

Usually formatted like 3p0156 + 2p23, which indicates three uops (which can each go to port 0, 1, 5, or 6), and two uops (which can each go to port 2 or 3). The order of the uops and dependencies between them are unspecified, but could be inferred from the latency and throughput.

ADD and OR tend to use the same ports (p0156 on Intel Haswell), so even though their individual throughputs are 4 per cycle, 4 ADDs and 4 ORs could not be executed simultaneously in the same cycle.