Development:Debugging Crash

From FEX-Emu Wiki
Revision as of 02:41, 1 December 2023 by Sonicadvance1 (talk | contribs)
Jump to navigation Jump to search

Getting Started

  • Debug an application with `gdb --args FEXInterpreter <application full path>`
  • Under GDB make sure to do `handle SIGBUS SIGILL SIG63 noprint`
    • We use signals for various things, check out Here for more information

Crash in emulated/JIT code

Walking through debugging a simple test application that is crashing.

 $ gdb --args FEXInterpreter ./sigsegv_test
 Reading symbols from FEXInterpreter...
 (gdb) r
 Starting program: /usr/bin/FEXInterpreter ./sigsegv_test
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
 [New Thread 0x7fccb75f30 (LWP 90107)]
 
 Thread 2 "FEXInterpreter" received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0x7fccb75f30 (LWP 90107)]
 0x0000007fccfb9ec8 in ?? ()
  • Okay, we have a sigsegv. Let's double check that it is JIT code (aka, guest emulated code)
 (gdb) disas $pc,+32
 Dump of assembler code from 0x7fe26ebc20 to 0x7fe26ebc40:
 => 0x0000007fe26ebc20:  stlrb   w22, [x21]
    0x0000007fe26ebc24:  mov     x4, #0x0                        // #0
    0x0000007fe26ebc28:  ldr     x10, [x20]
    0x0000007fe26ebc2c:  add     x21, x20, #0x8
    0x0000007fe26ebc30:  mov     x22, #0x0                       // #0
    0x0000007fe26ebc34:  strb    w22, [x28, #428]
    0x0000007fe26ebc38:  mov     x22, #0x0                       // #0
    0x0000007fe26ebc3c:  strb    w22, [x28, #431]
 End of assembler dump.
 (gdb) info reg x9
 x9             0x0                 0
  • Looks like JIT code, even doing accesses to x28 which is the FEX CPU state
  • Code has no backtrace which reinforces this
  • Code is doing an atomic store, which reinforces this is FEX emulating the x86 TSO memory model
  • Now that we have checked that we are in the JIT code. Where are we in the guest side?
  • Let's dump the FEX CPU state information that is directly pointed to in x28 at all times in JIT code.
 (gdb) p/x ((FEXCore::Core::CpuStateFrame*)$x28)->State
 $3 = {rip = 0x401110, gregs = {0x416eb0, 0x7fe1e3b640, 0xffffffffffffff70, 0x0, 0x7fe1e3bf30, 0x0, 0x416eb0, 0x7fe1e3ae28, 0x0, 0x7fe1e3b640, 0x8, 0x7fe2054cc0, 0x7ff75ff48e, 0x7ff75ff48f, 0x0, 0x7fe163b000}, xmm = {{0x0, 0x0}, {0x0, 0x0}, {0xdeadbeef, 0xbad0dad1} <repeats 14 times>}, es = 0x0, cs = 0x0, ss = 0x0, ds = 0x0, gs = 0x0, fs = 0x7fe1e3b640, flags = {0x0, 0x1, 0x0,
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0 <repeats 38 times>}, mm = {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}}, gdt = {{base = 0x0} <repeats 32 times>}, FCW = 0x37f, FTW = 0xffff}
  • Looks like our guest RIP is currently `0x401110`
    • Consult the `info proc mappings` again
 0x401000           0x402000     0x1000     0x1000 {...}/sigsegv_test
  • Yep, we are inside our test application
  • For a simple test, let's load the application in gdb-multiarch and disassemble where we are
 $ gdb-multiarch ./sigsegv_test
 Reading symbols from ./sigsegv_test...
 (gdb) set disassembly-flavor intel
 (gdb) disas 0x401110
 Dump of assembler code for function main(int, char**):
    0x0000000000401110 <+0>:     push   rbp
    0x0000000000401111 <+1>:     mov    rbp,rsp
    0x0000000000401114 <+4>:     mov    DWORD PTR [rbp-0x4],0x0
    0x000000000040111b <+11>:    mov    DWORD PTR [rbp-0x8],edi
    0x000000000040111e <+14>:    mov    QWORD PTR [rbp-0x10],rsi
    0x0000000000401122 <+18>:    mov    rax,QWORD PTR [rbp-0x10]
    0x0000000000401126 <+22>:    movsxd rcx,DWORD PTR [rbp-0x8]
    0x000000000040112a <+26>:    mov    rax,QWORD PTR [rax+rcx*8]
    0x000000000040112e <+30>:    mov    QWORD PTR [rbp-0x18],rax
    0x0000000000401132 <+34>:    mov    rax,QWORD PTR [rbp-0x18]
    0x0000000000401136 <+38>:    mov    BYTE PTR [rax],0x63
    0x0000000000401139 <+41>:    xor    eax,eax
    0x000000000040113b <+43>:    pop    rbp
    0x000000000040113c <+44>:    ret
 End of assembler dump.
 (gdb)
  • Okay, not super helpful since FEX translates instructions in to blocks, `0x401110` is just our starting address
    • It's in this code somewhere, let's change some FEX settings to get a clearer picture
  • Set block size to one instruction and disable multiblock
    • See the image in FEXConfig to the right
FEXConfig 1Inst NoMulti.png
  • Now rerun our test application and find the new RIP
 (gdb) p/x ((FEXCore::Core::CpuStateFrame*)$x28)->State.rip
 $2 = 0x401136
  • Alright, now we know the RIP is exactly at `0x401136`
  • Back in gdb-multiarch
 (gdb) disas 0x401136,+1
 Dump of assembler code from 0x401136 to 0x401137:
    0x0000000000401136 <main(int, char**)+38>:   mov    BYTE PTR [rax],0x63
  • Looks like something in main is storing 0x63 to a nullptr
  • In this simple case we can now take a look at the test application's source and find the problem.
    • We know the problem is in the first block of main()
    • We know the exact instruction that it is at
    • We know it's something storing a byte to memory
  • For more complex cases it is likely necessary to use reverse engineering tools
    • BinaryNinja, Ghidra, IDA, and Hopper are all examples of tools like this

What to do from here

Now it becomes a lot harder. You don't get a typical debugging environment or even clean backtraces.

FEX's gdbserver integration is sorely lacking so you can't even use a remote gdb server connecting to FEX right now.

If you enable thunks you can get better backtraces here. Debugging_Crash_In_Thunks

Attempting to use FEX-Emu's gdbserver implementation

Here be dragons

FEX supports gdbserver as an integration. It's implementation is significantly limited but can still be used for debugging and getting some backtraces.

  • Currently hardcodes the port to use as `8086` and if you have multiple gdbserver processes running then it will encounter problems.
  • Currently does not follow processes through fork/execve at all. No multiprocess support
    • This means you must only start the process you're caring about debugging
  • Currently starts the process paused and will wait until gdb attach before continuing
    • No way to start a FEX instance then attach at some later point
  • Ctrl-C to stop the FEX process needs to be done twice
    • Maybe with a small delay inbetween because gdb needs to fetch a bunch of data on pause
    • Known bug, unknown why broken at the moment
 FEXLoader -G -- <Application> <Args...>

Double checking if we are in JIT code

 (gdb) info reg pc
 pc             0x7fccfb9ec8        0x7fccfb9ec8
 
 (gdb) info proc mappings
 ...
 0x7fccfb9000       0x7fcdfb9000  0x1000000        0x0
  • Looks like FEX JIT mapping, we start out at 16MB but scale up to 128MB
  • Depending on version of FEX we can check the base mapping for a unique string
 (gdb) p (char*)0x7fccfb9000
 $4 = 0x7fccfb9000 "FEXJIT::Arm64JITCore::"

Getting RIP of current code block

FEX sets up an address in our CPU context to get some debug data out.

Currently this isn't exposed in a way that a debugger can see other than manually typing out gdb commands

 p/x *(uint64_t*)($x28+184) = inline block header ptr
 p/x *(uint32_t*)(*(uint64_t*)($x28+184)) = OffsetToBlockTail
 p/x *(uint64_t*)($x28+184) + *(uint32_t*)(*(uint64_t*)($x28+184))
 p/x *(unsigned long long*)((*(unsigned long long*)($x28+184) + *(unsigned int*)(*(unsigned long long*)($x28+184)))+8) = RIP of block
 disas *(unsigned long long*)($x28+184),+*(unsigned long long*)(*(unsigned long long*)($x28+184) + *(unsigned int*)(*(unsigned long long*)($x28+184))) = Disassemble this block of code.


While hard to decipher here is basically what is happening. - x28 is the CPU register that FEX keeps in the JIT for context accesses - offset 184 is the offset of the `InlineJITBlockHeader` member inside of that context. - As long as FEX is in a JIT block that offset will be valid to point to the current RIP that the block is operating on.

Getting the stack can also be very useful

 x/64wx ((FEXCore::Core::CPUState*)$x28)->gregs[FEXCore::X86State::REG_RSP]

Doing raw pointer math here means that this works even when gdb fails to find symbols for the CPUState object, which happens very frequently for some reason.