Development:InstCountCI

InstCountCI is a continuous integration tool that FEX-Emu uses to ensure that instruction implementations aren't getting worse over time.

Getting Started

Make sure to follow Development:Setting_up_FEX to get an initial build environment set up.

What you need

An Arm64 Linux device that can build FEX
- An x86-64 device using the VIXL simulator can be used as a substitute.

Additional cmake options

Some additional cmake options need to be passed to the FEX-Emu cmake options to get the tests building.

-DBUILD_TESTS=True
-DENABLE_VIXL_DISASSEMBLER=True

Quality of life improvements

Add these cmake options to make iteration time faster and have debug assertions to catch problems.

-DENABLE_LTO=False
-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DENABLE_ASSERTIONS=True

Running InstCountCI

First thing you need to build the tests. This step will parse all the json files inside of unittests/InstructionCountCI/ and set up running CI in the next step.

ninja instcountci_test_files

Next you need to actually run the tests. This will run all the instructions declared in unittests/InstructionCountCI/*.json. If this step fails, that is okay since that just means that either an instruction translation has gotten worse, or if the test crashed then something catastrophic happened.

ninja instcountci_tests

The next step is to take the data generated from the previous step and modify the resulting json that is tracked by git.

ninja instcountci_update_tests

Now to see how the implementations have changed, you can just run git diff to see how the json files in unittests/InstructionCountCI/ have changed.

Example

Minor improvement when optimizing a move instruction. From This pull request.

      "movd mm0, eax": {
 -      "ExpectedInstructionCount": 3,
 +      "ExpectedInstructionCount": 2,
        "Comment": "0x0f 0x6e",
        "ExpectedArm64ASM": [
 -        "ubfx x20, x4, #0, #32",
 -        "fmov s4, w20",
 +        "fmov s4, w4",
          "str d4, [x28, #752]"
        ]
      },

Reset the files with git checkout -- unittests/InstructionCountCI/*.json if the changes weren't desired.

What classifies as an optimal translation?

Nothing classifies if an instruction implementation is considered optimal or not. This is left up to the human to try and understand if the translation is optimal for that particular instruction.

In general the Optimal tag is just a guideline since humans could have made a mistake and the instruction could be further optimized.
Alternatively ARM could introduce a new set of instructions that improve how optimal an instruction could be.
Further an instruction could be considered optimal, but the reviewer ignored something like flag generation around it since that is a systemic FEX-Emu issue.
Or of course, just human error and it was misunderstood as an optimal implementation but someone found a way to do it better.

This is more just a tag to help humans when they are doing code auditing so they don't need to pay as much attention to the ones that are classified as such. It's still good to periodically go over these implementations and see if things could be done better.

Diving deeper in to the assembly

Manually run instcountci result

A useful first step might be to run the json tests directly in the code size validation program. This can be done from the build directory.

eg:

 ./Bin/CodeSizeValidation unittests/InstructionCountCI/FEXOpt/libnss.json.instcountci

Run the test through the TestHarnessRunner

While the instruction count CI is good at showing the final result, it isn't the best at showing what FEX did to get to that result. This is where the assembly test harness can come in handy.

Create a file in unittests/ASM/Test.asm
Add the following data:

 %ifdef CONFIG
 {
 }
 %endif
 addps xmm0, xmm1
 hlt

Recompile the asm tests with `ninja asm_files`
Run the assembly test manually now with FEX_DUMPIR=stderr FEX_DISASSEMBLE=blocks ./Bin/TestHarnessRunner -c irjit -n 1 -g ./unittests/ASM/Test.asm.bin ./unittests/ASM/Test.asm.config.bin
- This will dump both FEX's internal IR and the disassembly of the code for each instruction
- The second code block for the hlt can be ignored. It is just necessary for this test harness to run.

The resulting output will be:

 IR-post 0x10000:
       (%0) IRHeader %2, #65536, #0, #1
       (%2) CodeBlock %3, %10
               (%3 i0) BeginBlock %2(Invalid)
               %4(FPRFixed1) i128 = LoadRegister #0x0, #0xd0, FPR, FPRFixed, u8:Tmp:Size
               %5(FPRFixed0) i128 = LoadRegister #0x0, #0xc0, FPR, FPRFixed, u8:Tmp:Size
               %6(FPRFixed0) i32v4 = VFAdd u8:Tmp:RegisterSize, u8:Tmp:ElementSize, %5(FPRFixed0) i128, %4(FPRFixed1) i128
               (%7 i128) StoreRegister %6(FPRFixed0) i32v4, #0x0, #0xc0, FPR, FPRFixed, u8:Tmp:Size
               (%8 i64) InlineEntrypointOffset #0x3, u8:Tmp:RegisterSize
               (%9 i64) ExitFunction %8(Invalid)
               (%10 i0) EndBlock %2(Invalid)
 @@@@@
 [INFO] Disassemble Begin
 [INFO] adr x0, #-0x4 (addr 0xffff6fa00018)
 [INFO] str x0, [x28, #184]
 [INFO] fadd v16.4s, v16.4s, v17.4s
 [INFO] ldr x0, pc+8 (addr 0xffff6fa00030)
 [INFO] blr x0
 [INFO] unallocated (Unallocated)
 [INFO] udf #0xffff
 [INFO] unallocated (Unallocated)
 [INFO] udf #0x0
 [INFO] Disassemble End

The disassembly has some instructions at the start and end which are necessary for the JIT to run
- InstCountCI strips this code out automatically.
In a vacuum of a single instruction, the code block header and tail can dominate the code size.
- It's recommended to become familiar with what the header and tail look like and ignore it in the resulting code generation.
- Currently the header is the first two instructions adr+str
- Currently the tail starts with the ldr+blr after the fadd and continues with some metadata afterwards.

Diving Deeper

I would recommend looking at the man page for FEX to see additional options that can be useful

Specifically the options for FEX_PASSMANAGERDUMPIR to get more IR dumping options and FEX_HOSTFEATURES to fake CPU feature support.
Enabling the vixl simulator with the cmake option -DENABLE_VIXL_SIMULATOR=True can be useful to test features your CPU doesn't support!