Since the last post about it, the dynarec (dynamic recompiler, a Just-In-Time recompilation of x86 code) changed a lot. It still works in four major steps, but now there are a lot more intermediary steps. So let’s see how it works now!
Note: this article will focus on the ARM version of the dynarec on box64, which is more complex and has full support for AVX2. Other architectures have some support for AVX2, and some parts of the code is architecture-independent, but the architecture-dependent parts are still only partially done.
This article will contain function names and other references to the actual implementation. Some understanding of C code is therefore required to understand this article.
Step -1: the entry point
Alternative name: the calling site
What is the dynarec? The dynarec is what makes box64
run very fast on platforms on which it is supported. To do so, when box64
encounters previously unknown code, instead of simply emulating the code one instruction at a time, it first builds a native instruction block (a dynablock
) which does equivalent work, then runs that instead.
To do so, there are two steps: first, building the dynablock
, second, running the dynablock
.
The second step is covered by the native_prolog
assembly function, which is per-architecture (it actually calls arm64_prolog
, la64_prolog
or rv64_prolog
depending on the architecture). The call can be found in the DynaRun
function, which is called during normal operation.
The arguments of this function, however, are the emulator and the block. This block is requested by the DBGetBlock
function, which stands for “DynaBlock get block”. This calls the internalDBGetBlock
function, which tries to find a corresponding and valid block, but falls back to the FillBlock64
function if there is none. This is the function which actually builds a block.
This article will from now on focus on the content of this function.
Step 0: the first step
After some initial setup of a “helper” block, the function calls native_pass0
, which is actually defined as native_pass
. This is in fact true for all four major passes: there are a lot of functions which are compiled four times, each time with different macro definitions.
What this function does is iterate over every x86 instruction, sometimes propagate some information (for example, some data is reset across barriers), then uses the per-opcode decoder using the dynarec64_00
function (also a per-pass function), and finally gathers some information (for example, whether the next instruction should be a barrier), before stepping to the next instruction.
Pass 0 is the pass during which box64
counts the number of x64 instructions, sets up the dynablock structure associated with each one, does a first count of the number of native instructions (which will change on future passes, since this number depends on other information which didn’t get propagated yet), and remembers jumps to other instructions, flags set and used, and barriers. During this pass, the set of high-YMM registers that become zero and become non-zero is also computed; see also the article on optimizing AVX.
After each instruction on pass 0, there is also a block of code deciding whether the block should be stopped at the current instruction, extended since there is a jump to code near but past the end, or simply continue.
Pass 0.1: jumps
After some checks, barriers (which can be interpreted as “this is too complex, forget optimizations”) are added at the destination of some jumps.
Pass 0.2: predecessors
The total number of predecessors is computed (sizePredecessors
), then an array is filled with those predecessors (fillPredecessors
). From this point on, we can know which instruction can come before each instruction.
Pass 0.3: flags
The next step is to propagate flags (updateNeed
).
During pass 0, box64
marked which instructions generated flags and which instructions required flags for itself.
This step iterates over all instructions backward, and sets the required flags that need to be set for the following instructions to work. box64
therefore implements Kildall’s algorithm for this step.
Pass 0.4: cleanup
Some instructions are never executed (they have no predecessor or all predecessors are never executed themselves). This step removes the trailing non-executed instructions and resets metadata for those instructions.
Pass 0.5: YMM0
This step is the step computing the ymm0_in
, purge_ymm
and ymm0_out
(updateYmm0s
). For more details on what these are, see the article on optimizing AVX.
Pass 1: float optimizations
As for pass 0, this pass is implemented in native_pass1
which is implemented by the native_pass
function. This pass focuses on NEON register allocations: the final values of the various floating-point and SIMD register caches are computed. From this point on, we know the instructions to generate, which immediately leads to:
Pass 2: instruction counting
This pass (implemented in native_pass2
) counts the number of native instructions the dynarec will generate. It also computes the address for all of the x64 instructions.
Pass 2.5: allocation
This pass allocates the memory where the instructions will live in. It uses the AllocDynarecMap
function, which uses memalign
and mprotect
.
Pass 3: generation
This final pass, implemented in native_pass3
, fills the table allocated in pass 2.5 with the native instructions and other values stored in a table. It may also print the generated instructions, depending on the log level (the value of BOX64_DYNAREC_DUMP
).
Pass 3.5: return
Once the pass 3 is done, the block is generated; a new block is generated from the “helper” block, then a few cleanups are performed, allowing for the generation of a new block. The “helper” block is then reused for the next dynablock generation.
Summary
- Entry point:
FillBlock64
- Native pass 0:
native_pass0
- For every instruction:
- Instruction decoding:
dynarec64_000
- Check for end-of-block or should-extend-block
- Instruction decoding:
- If the instruction leaves the dynarec world, add a jump to the epilogue
jump_to_epilog
- For every instruction:
- Compute jump destinations and add some barriers
- Compute the predecessors:
sizePredecessors
,fillPredecessors
- Propagate the flags to compute:
updateNeed
- Clean up dead code
- Propagate
ymm0
s:updateYmm0s
- Native pass 1:
native_pass1
- Compute final register allocations
- Native pass 2:
native_pass2
- Compute the number of native instructions
- Compute the address of all x64 instructions
- Allocate executable memory for instructions:
AllocDynarecMap
- Native pass 3:
native_pass3
- Compute final register allocations
- Copy into a fresh
dynablock_t
, reset the helperdynablock_t
- Return from
FillBlock64
- Native pass 0: