Categories
Benchmark Box64 Box86

Performances 2022

This article was initially publised for the FOSDEM 2022 virtual event. The host is gone and I’m (quickly) republishing it here.


Box86 performances 

Like for last year, I will present a series of benchmark to try show the efficiency (or not) of box dynarec. This year, I will benchmark both box86 and box64, to show how box86 evolved over 1 year of optimisation, and also show how box64, the new 64bits counterpart of box86 runs.

With box86 and box64, to check efficiency, I used 4 programs: 7z, dav1d, glmark2, and openarena.

  • 7z is a widly used compression program, available on every distribution, and it comes with an integrated benchmark.
    The benchmark doesn’t seem to use much SSE code, and is very CPU performances centric. So it can be used to evaluate
    raw CPU power, and is considered a “worst case scenario” for both box (mostly CPU, no SSE, very little library
    function calls).
  • dav1d is a video transcoding tools. It use highly optimized SSSE3 or NEON routine. This is the “nightmare
    scenario”: with hand-optimized assembly routine and mostly no wrapped function calls.
  • glmark2 on the other hand is used to measure OpenGL performances. Here, CPU is not much used, and it’s mostly
    OpenGL calls all along. This is the ideal case scenario for box (mostly native library function calls).
  • openarena is an open source game, based on idTech3 engine. It include a benchmark mode that will be used here. The point of this scenario is to show some “real case usage” of box.

7z benchmark 

Method

The 7z used here is the version 16.02 found in the distribution.
The x86 one comes from Debian, and the native one from the distribution (with the exeption of the Pandora, which was
custom built). I noted that 32bits ARM version of 7z have higher Decompression result than 64bits ARM64 version, while Compression result are lower (I suspect asm optimisation maybe?).

The command 7z b was used, the output looks like that:

7-Zip [32] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,32 bits,4 CPUs LE)

LE
CPU Freq:   792  1486  1498  1495  1498  1492  1496  1493  1496

RAM size:    3776 MB,  # CPU hardware threads:   4
RAM usage:    882 MB,  # Benchmark threads:      4

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       3639   333   1062   3540  |      92329   396   1989   7877
23:       3539   338   1068   3606  |      90082   394   1976   7794
24:       3585   359   1075   3855  |      87605   395   1945   7691
25:       3506   367   1090   4004  |      84190   394   1901   7493
----------------------------------  | ------------------------------
Avr:             349   1074   3751  |              395   1953   7714
Tot:             372   1513   5733

I’ll report only the last number (so here 5733), and compare, per hardware, between native and emulated program.

Result box86

Hardware description 2021 Native   Emulated   Speed % 2022 Native   Emulated   Speed %
Pandora Cortex-a8 1GHz   655   282   43%   662   263   40%
Pyra Cortex-a15 1.5GHz x2 (3)   2510   976   39%   2441   1091   45%
RaspberyPI4 1.5GHz x4   5733   2912   51%  
ODroid XU4 2Ghz/1.3Ghz x4/4 (1)   3906   2374   61%  
ODroid N2 2Ghz x4/2   ?   ?   ?   8981   3811   42%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 (2)   6912   3367   49%  
FT2000/4 2.6GHz x4 (4)   8280   4595   55%   10596   4916   46%
D2000/8 2.3GHz x8   ?   ?   ?   18001   8462   47%

(1) limiting bench to 4 cores only.
(2) native was aarch64 in 2021.
(3) I suspect x86 test trigger some thermal throttle.
(4) I used a remote control in 2021. I have the physical machine in 2022.

Result box64 

Hardware description 2022 Native   Emulated   Speed %
RaspberyPI4 1.5GHz x4   ?   ?   -%  
ODroid N2 2Ghz x4/2   8883   3341   38%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4   ?   ?   -%  
FT2000/4 2.6GHz x4   9977   4370   44%
D2000/8 2.3GHz x8   16799   7572   45%  

Conclusion

So, nothing spectacular on the 7zip case. Emulation speed is more or less the same. In fact, most of the optimisation have been around x87 and SSE code. Some optimisation around conditionnal jump is still needed (it was barely started), and it shows on this cpu-only benchmark. And the picture is the same for 32bits or 64bits emulation.

dav1d benchmark

Method

dav1d is a transcoding application, heavily optimized for SIMD and make use of SSSE on x86 (or SSE4 if availble) and
NEON on ARM. While MMX, SSE and SSE2 convert fairly well to NEON, it’s not the case with SSSE3, conversion gets more
complex resulting in emiting 4 or more NEON opcode per SSSE3 one (like pmaddubsw on XMM regs generate 10 NEON
opcodes). Also, because the functions of dav1d are hand optimized assembler, the comparison between x86 and native will
give lower number, as the dynarec convert 1:1 opcode, without reordering or removing any opcode. The resuling SSE->NEON
code will be less efficiant than hand optimized routines. The benchmark will use the command
./dav1d -i Chimera-AV1-8bit-480x270-552kbps.ivf --muxer null, and just the resulting FPS will be taken.
When testing multiple thread, add --framethreads x --tilethreads x to the command, where the 2 x are the number of
threads. The x86 and x86_64 version comes from a Debian distribution.

Example of result:

dav1d 0.7.1-91-gba875b9 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 37.11/1784.02 fps (0.02x)

get 37.11

Result box86

Hardware description 2021 Native   Emulated   Speed % 2022 Native   Emulated   Speed %
Pandora Cortex-a8 1GHz (1)   37.11   9.43   25%   37.61   9.33   25%
Pyra Cortex-a15 1.5GHz x2 1 thread (1)   111.51   34.90   31%   120.00   31.30   26%
Pyra Cortex-a15 1.5GHz x2 2 threads (1)   141.94   43.97   31%   146.51   45.36   31%
RaspberyPI4 1.5GHz x4 1 thread   90.52   43.18   48%
RaspberyPI4 1.5GHz x4 2 threads   164.53   84.81   52%
RaspberyPI4 1.5GHz x4 4 threads   189.40   103.30   55%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 1 thread (1)(2)   196.57   55.80   28%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 2 threads (1)(2)   281.97   58.34   21%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 4 threads (1)(2)   374.00   93.85   25%
ODroid N2 2Ghz x4/2 1 thread (1)(2)   ?   ?   -%   216.26   57.99   27%
ODroid N2 2Ghz x4/2 2 thread (1)(2)   ?   ?   -%   383.34   92.40   24%
ODroid N2 2Ghz x4/2 4 thread (1)(2)   ?   ?   -%   500.63   138.80   28%
FT2000/4 2.6GHz x4 1 thread (2)   290.10   75.74   26%   308.61   80.95   26%
FT2000/4 2.6GHz x4 2 threads (2)   424.93   128.28   30%   562.19   163.44   29%
FT2000/4 2.6GHz x4 4 threads (2)   462.49   146.87   32%   658.96   196.12   30%
D2000/8 2.3GHz x4 1 thread   ?   ?   -%   255.16   73.25   29%
D2000/8 2.3GHz x4 2 thread   ?   ?   -%   551.93   165.81   30%
D2000/8 2.3GHz x4 4 thread   ?   ?   -%   811.30   260.50   32%
D2000/8 2.3GHz x4 8 thread   ?   ?   -%   816.01   265.34   33%

(1) native build of dav1d is custom build, with probably more optimisation than in regular version from distribution.
(2) native is aarch64 here.

Result box64

Hardware description 2022 Native   Emulated   Speed %
RaspberyPI4 1.5GHz x4 1 thread   ?   ?   -%
RaspberyPI4 1.5GHz x4 2 threads   ?   ?   -%
RaspberyPI4 1.5GHz x4 4 threads   ?   ?   -%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 1 thread   ?   ?   -%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 2 threads   ?   ?   -%
Pine64 RockPro64 RK3399 2Ghz/1.5Ghz x2/4 4 threads   ?   ?   -%
ODroid N2 2Ghz x4/2 1 thread   216.26   51.67   24%
ODroid N2 2Ghz x4/2 2 thread   383.34   89.77   24%
ODroid N2 2Ghz x4/2 4 thread   500.63   137.60   27%
FT2000/4 2.6GHz x4 1 thread   308.61   ?   -%
FT2000/4 2.6GHz x4 2 threads   562.19   ?   -%
FT2000/4 2.6GHz x4 4 threads   658.96   ?   -%
D2000/8 2.3GHz x4 1 thread   290.57   65.49   23%
D2000/8 2.3GHz x4 2 thread   636.41   152.54   24%
D2000/8 2.3GHz x4 4 thread   872.73   239.47   27%
D2000/8 2.3GHz x4 8 thread   935.60   246.42   26%

glmark2 benchmark

Method

glmark2 is a program to test OpenGL speed. A version targetting GLES2 hardware also exists, but because GLES2 is not
supported by box86, only the Desktop Opengl version is used here. Some distribution (like debian) doesn’t include
glmark2, so latest version from git source is used here (so 2020.04 version). For platform that doesn’t have full
opengl driver, gl4es will be used. Note that glmark2 will crash when run on OpenGL 2.1 only (because it uses
glGenerateMipmap function, but only get that function on OpenGL 3.0+), so MESA_GL_VERSION_OVERRIDE=3.2 is used on
mesa driver that only expose 2.1 profile, or LIBGL_GL=30 when using gl4es.

The glmark2 score is simply used here, at default windowed 800×600 resolution (cropped to 800×480 on pandora).
Because Mesa as evolved since last year, I will start fresh and not compared with 2021 result.

Result box86

Hardware description 2022 Native   Emulated   Speed %
Pandora Cortex-a8 1GHz gl4es   85   82   96%
Pyra Cortex-a15 1.5GHz x2 gl4es   284   280   98%
RaspberyPI4 V3D 4.2 Mesa   ?   ?   -%
ODroid XU4 2Ghz x4 gl4es (1)   ?   ?   -%
PinePro64 RK3399 2Ghz.1.5Ghz x2.4 mesa   ?   ?   -%
ODroid N2 2Ghz x4/2 mesa   733   720   98%
FT2000/4 2.6GHz x4 Radeon mesa   4540   4526   100%

(1) VSync cannot be disabled, lowering results.

Result box64

Hardware description 2022 Native   Emulated   Speed %
RaspberyPI4 V3D 4.2 Mesa   ?   ?   -%
PinePro64 RK3399 2Ghz.1.5Ghz x2.4 mesa   ?   ?   -%
ODroid N2 2Ghz x4/2 mesa   732   725   99%
FT2000/4 2.6GHz x4 Radeon mesa   4770   4626   97%

(1) VSync cannot be disabled, lowering results.

Conclusion

When running an app that is mainly relying on external libraries (that are wrapped), near native speeds can be
acheived, with all result here being between 95% and 100% of native speed.

OpenArena benchmark

This test wasn’t present last year. But I did write an article about it there.
The benchmark use some heavy graphics settings, making it quite GPU-limited on low-end machines, so I used an alternate graphic settings there (no bloom, refelction or shadows), to avoid beeing too GPU-limited, as that would invalidate the bench.

Result box86

Hardware description 2022 Native   Emulated   Speed %
Pandora Cortex-a8 1GHz gl4es (1)   20.1   ?   -%
Pyra Cortex-a15 1.5GHz x2 gl4es (1)   25.9   19.6   76%
RaspberyPI4 V3D 4.2 Mesa   ?   ?   -%
PinePro64 RK3399 2Ghz.1.5Ghz x2.4 mesa   ?   ?   -%
ODroid N2 2Ghz x4.2 mesa   40.1   34.3   86%
FT2000/4 2.6GHz x4 Radeon mesa (2)   95.1   68.1   72%
D2000/8 2.3GHz x8 Radeon mesa   82.0   60.3   74%

(1) using a simplifed graphic settings (no bloom, reflection or shadows), and fullscreen on device resolution
(2) at 2560×1440 resolution

Result box64

Hardware description 2022 Native   Emulated   Speed %
RaspberyPI4 V3D 4.2 Mesa   ?   ?   -%
PinePro64 RK3399 2Ghz.1.5Ghz x2.4 mesa   ?   ?   -%
ODroid N2 2Ghz x4/2 mesa   41.1   34.8   85%
FT2000/4 2.6GHz x4 Radeon mesa (1)   98.6   77.4   78%
D2000/8 2.3GHz x8 Radeon mesa   83.9   67.9   81%

(1) at 2560×1440 resolution

Conclusion

On a game scenrio, when you have reasonably optimized code and a good amount of function call that can be wrapped, the performances of both box are between 75% and 85% of the native version, wich is satisfying!

Box86 speed

So in real world applications and games, things will be in between those two extreme test cases.

Note that things like SSE opcodes are converted to NEON ones, and most of the time a 1:1 conversion is done, making
optimized SSE/SSE2 codes quite optimal in box86. It’s more difficult with SSE3/SSSE3, and converted code gets slower.
Wrapped functions gets native speeds, and emulated cpu will be at least 50% of what it could be if it was a native app
(or even lower if the app/games contains ASM hand optimized routines). The resulting speed will be something in between
depending of the percentage of emulated code vs wrapped lib ratio…

Leave a Reply

Your email address will not be published. Required fields are marked *