Box64 Dev

Optimizing AVX2

Note: this article is pretty technical. Basic understanding of what registers are is a strict minimum required to understand what follows.

What is AVX?

AVX (and its extension AVX2) are complex x86_64 instruction sets which extends the SSE4 instruction set.

It mandates the existence of extensions of the 16 SSE 128-bits-wide XMM registers (called XMM0 through XMM15). These extensions are 256-bits wide and are called the YMM registers (YMM0 through YMM15). To use those instructions, there is a new prefix, the VEX prefix (which has two opcodes, C4 and C5).

However, the lowest 128 bits of the YMMi register is (physically) the XMMi register. SSE opcodes do not interfere with the highest 128 bits of the YMM registers, only interacting with the XMM registers, but AVX instructions change both. When an AVX instruction only explicitly writes to the XMM part of a YMM register, it also clears the high part of the YMM register.

The Dynarec problem

To be performant, the dynarec must map native registers to x64 registers and vice-versa.

On ARM64, the equivalent for XMM registers are called NEON registers and are only 128 bits wide; there are 32 of them. To store an entire YMM register, two such registers must be used. However, since a lot of the time those high parts are zero, it would be wasteful to allocate an entire NEON register to store a zero, especially given that those registers may all be reserved already (there are 16 XMM registers, 8 x87/MMX registers, and 8 scratch registers which all use those NEON registers). This means that there would need to be a lot of swapping (saving old registers and loading new registers to/from the emulator structure), which would be slow.

To fix this, every instruction also tracks which YMM registers have their high part cleared to 0; this way, when executing an instruction which clears the high part, we don’t waste a NEON register with a zero.

Of course, this theory is fine, but the implementation was tricky since we want the dynarec to be fast, which includes the creation of the blocks. Thus, we use a two-steps approach: first we compute which YMM registers get cleared, then we propagate this information.

To propagate the information, we need to know which instructions are before the current one. Thankfully, this information is available early enough.

The propagation algorithm

Once we know the three information (which instruction numbers are preceding, which high YMM registers are set to 0 and which are removed), we can compute which YMM registers have their high part as 0, which is stored in the ymm0 variable (actually, ymm0_in and ymm0_out depending on whether we look at before or after we execute the instruction).

One last note before the formulas: in the case where an instruction has multiple predecessors with conflicting ymm0s, we need to “purge” the conflicting ymm0s, since depending on what the previous instruction was, one such YMM register may be non-zero and must therefore be allocated.

We thus have five variables per instruction:

  • ymm0_in (y_i): the set of YMM registers which have their high part set to 0 before executing the instruction
  • ymm0_out (y_o): the same set but after executing the instruction (and before any eventual purge)
  • ymm0_add (y_+): the set of YMM registers with their high part set to 0 by the instruction
  • ymm0_sub (y_-): the reverse of ymm0_add
  • purge_ymm (y_!): the set of conflicting ymm0s.

We also have two more variables which depend on the previous instructions: ymm0_union (y_\cup) the union of all the ymm0_out of the previous instructions, and ymm0_inter (y_\cap) the intersection of those ymm0_out.

Finally, we must have:

y_i &= y_\cap \\
y_! &= y_\cup \setminus y_\cap \\
y_o &= (y_i \cup y_+) \setminus y_-

Also, the first instruction has a virtual previous instruction with an empty ymm0_out (this instruction is the entry point, so the previous instruction may have non-zero high YMMs). This means that for the first instruction, y_\cap = \varnothing.

Since this requires the set of previous instructions, we need to compute the fixpoint. We can do so by ignoring all untreated instruction (with the intersection of the empty set being everything and the union of the empty set being empty), and following jumps to already treated instructions.

Notice that this cannot create an infinite loop, since every time we revisit an instruction the y_i (thus y_o as well) can only decrease and y_! can only increase: after a finite number of iterations, we have reached a stable state for every instruction.

This algorithm is implemented in the updateYmm0s function in dynarec_native.c.

Leave a Reply

Your email address will not be published. Required fields are marked *