Revisiting the dynarec

Since the last post about it, the dynarec (dynamic recompiler, a Just-In-Time recompilation of x86 code) changed a lot. It still works in four major steps, but now there are a lot more intermediary steps. So let’s see how it works now!

Note: this article will focus on the ARM version of the dynarec on box64, which is more complex and has full support for AVX2. Other architectures have some support for AVX2, and some parts of the code is architecture-independent, but the architecture-dependent parts are still only partially done.

This article will contain function names and other references to the actual implementation. Some understanding of C code is therefore required to understand this article.

Step -1: the entry point

Alternative name: the calling site

What is the dynarec? The dynarec is what makes box64 run very fast on platforms on which it is supported. To do so, when box64 encounters previously unknown code, instead of simply emulating the code one instruction at a time, it first builds a native instruction block (a dynablock) which does equivalent work, then runs that instead.

To do so, there are two steps: first, building the dynablock, second, running the dynablock.

The second step is covered by the native_prolog assembly function, which is per-architecture (it actually calls arm64_prolog, la64_prolog or rv64_prolog depending on the architecture). The call can be found in the DynaRun function, which is called during normal operation.

The arguments of this function, however, are the emulator and the block. This block is requested by the DBGetBlock function, which stands for “DynaBlock get block”. This calls the internalDBGetBlock function, which tries to find a corresponding and valid block, but falls back to the FillBlock64 function if there is none. This is the function which actually builds a block.

This article will from now on focus on the content of this function.

Step 0: the first step

After some initial setup of a “helper” block, the function calls native_pass0, which is actually defined as native_pass. This is in fact true for all four major passes: there are a lot of functions which are compiled four times, each time with different macro definitions.

What this function does is iterate over every x86 instruction, sometimes propagate some information (for example, some data is reset across barriers), then uses the per-opcode decoder using the dynarec64_00 function (also a per-pass function), and finally gathers some information (for example, whether the next instruction should be a barrier), before stepping to the next instruction.

Pass 0 is the pass during which box64 counts the number of x64 instructions, sets up the dynablock structure associated with each one, does a first count of the number of native instructions (which will change on future passes, since this number depends on other information which didn’t get propagated yet), and remembers jumps to other instructions, flags set and used, and barriers. During this pass, the set of high-YMM registers that become zero and become non-zero is also computed; see also the article on optimizing AVX.

After each instruction on pass 0, there is also a block of code deciding whether the block should be stopped at the current instruction, extended since there is a jump to code near but past the end, or simply continue.

Pass 0.1: jumps

After some checks, barriers (which can be interpreted as “this is too complex, forget optimizations”) are added at the destination of some jumps.

Pass 0.2: predecessors

The total number of predecessors is computed (sizePredecessors), then an array is filled with those predecessors (fillPredecessors). From this point on, we can know which instruction can come before each instruction.

Pass 0.3: flags

The next step is to propagate flags (updateNeed).

During pass 0, box64 marked which instructions generated flags and which instructions required flags for itself.

This step iterates over all instructions backward, and sets the required flags that need to be set for the following instructions to work. box64 therefore implements Kildall’s algorithm for this step.

Pass 0.4: cleanup

Some instructions are never executed (they have no predecessor or all predecessors are never executed themselves). This step removes the trailing non-executed instructions and resets metadata for those instructions.

Pass 0.5: `YMM0`

This step is the step computing the ymm0_in, purge_ymm and ymm0_out (updateYmm0s). For more details on what these are, see the article on optimizing AVX.

Pass 1: float optimizations

As for pass 0, this pass is implemented in native_pass1 which is implemented by the native_pass function. This pass focuses on NEON register allocations: the final values of the various floating-point and SIMD register caches are computed. From this point on, we know the instructions to generate, which immediately leads to:

Pass 2: instruction counting

This pass (implemented in native_pass2) counts the number of native instructions the dynarec will generate. It also computes the address for all of the x64 instructions.

Pass 2.5: allocation

This pass allocates the memory where the instructions will live in. It uses the AllocDynarecMap function, which uses memalign and mprotect.

Pass 3: generation

This final pass, implemented in native_pass3, fills the table allocated in pass 2.5 with the native instructions and other values stored in a table. It may also print the generated instructions, depending on the log level (the value of BOX64_DYNAREC_DUMP).

Pass 3.5: return

Once the pass 3 is done, the block is generated; a new block is generated from the “helper” block, then a few cleanups are performed, allowing for the generation of a new block. The “helper” block is then reused for the next dynablock generation.

Summary

Entry point: FillBlock64
- Native pass 0: native_pass0
  - For every instruction:
    - Instruction decoding: dynarec64_000
    - Check for end-of-block or should-extend-block
  - If the instruction leaves the dynarec world, add a jump to the epilogue jump_to_epilog
- Compute jump destinations and add some barriers
- Compute the predecessors: sizePredecessors, fillPredecessors
- Propagate the flags to compute: updateNeed
- Clean up dead code
- Propagate ymm0s: updateYmm0s
- Native pass 1: native_pass1
  - Compute final register allocations
- Native pass 2: native_pass2
  - Compute the number of native instructions
  - Compute the address of all x64 instructions
- Allocate executable memory for instructions: AllocDynarecMap
- Native pass 3: native_pass3
  - Compute final register allocations
- Copy into a fresh dynablock_t, reset the helper dynablock_t
- Return from FillBlock64