Difference between revisions of "Development:Debugging Crash"

Latest revision as of 02:41, 1 December 2023

Getting Started

Debug an application with `gdb --args FEXInterpreter <application full path>`
Under GDB make sure to do `handle SIGBUS SIGILL SIG63 noprint`
- We use signals for various things, check out Here for more information

Crash in emulated/JIT code

Walking through debugging a simple test application that is crashing.

 $ gdb --args FEXInterpreter ./sigsegv_test
 Reading symbols from FEXInterpreter...
 (gdb) r
 Starting program: /usr/bin/FEXInterpreter ./sigsegv_test
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
 [New Thread 0x7fccb75f30 (LWP 90107)]
 
 Thread 2 "FEXInterpreter" received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0x7fccb75f30 (LWP 90107)]
 0x0000007fccfb9ec8 in ?? ()

Okay, we have a sigsegv. Let's double check that it is JIT code (aka, guest emulated code)

 (gdb) disas $pc,+32
 Dump of assembler code from 0x7fe26ebc20 to 0x7fe26ebc40:
 => 0x0000007fe26ebc20:  stlrb   w22, [x21]
    0x0000007fe26ebc24:  mov     x4, #0x0                        // #0
    0x0000007fe26ebc28:  ldr     x10, [x20]
    0x0000007fe26ebc2c:  add     x21, x20, #0x8
    0x0000007fe26ebc30:  mov     x22, #0x0                       // #0
    0x0000007fe26ebc34:  strb    w22, [x28, #428]
    0x0000007fe26ebc38:  mov     x22, #0x0                       // #0
    0x0000007fe26ebc3c:  strb    w22, [x28, #431]
 End of assembler dump.
 (gdb) info reg x9
 x9             0x0                 0

Looks like JIT code, even doing accesses to x28 which is the FEX CPU state
Code has no backtrace which reinforces this
Code is doing an atomic store, which reinforces this is FEX emulating the x86 TSO memory model

Now that we have checked that we are in the JIT code. Where are we in the guest side?
Let's dump the FEX CPU state information that is directly pointed to in x28 at all times in JIT code.

 (gdb) p/x ((FEXCore::Core::CpuStateFrame*)$x28)->State
 $3 = {rip = 0x401110, gregs = {0x416eb0, 0x7fe1e3b640, 0xffffffffffffff70, 0x0, 0x7fe1e3bf30, 0x0, 0x416eb0, 0x7fe1e3ae28, 0x0, 0x7fe1e3b640, 0x8, 0x7fe2054cc0, 0x7ff75ff48e, 0x7ff75ff48f, 0x0, 0x7fe163b000}, xmm = {{0x0, 0x0}, {0x0, 0x0}, {0xdeadbeef, 0xbad0dad1} <repeats 14 times>}, es = 0x0, cs = 0x0, ss = 0x0, ds = 0x0, gs = 0x0, fs = 0x7fe1e3b640, flags = {0x0, 0x1, 0x0,
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0 <repeats 38 times>}, mm = {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}}, gdt = {{base = 0x0} <repeats 32 times>}, FCW = 0x37f, FTW = 0xffff}

Looks like our guest RIP is currently `0x401110`
- Consult the `info proc mappings` again

 0x401000           0x402000     0x1000     0x1000 {...}/sigsegv_test

Yep, we are inside our test application
For a simple test, let's load the application in gdb-multiarch and disassemble where we are

 $ gdb-multiarch ./sigsegv_test
 Reading symbols from ./sigsegv_test...
 (gdb) set disassembly-flavor intel
 (gdb) disas 0x401110
 Dump of assembler code for function main(int, char**):
    0x0000000000401110 <+0>:     push   rbp
    0x0000000000401111 <+1>:     mov    rbp,rsp
    0x0000000000401114 <+4>:     mov    DWORD PTR [rbp-0x4],0x0
    0x000000000040111b <+11>:    mov    DWORD PTR [rbp-0x8],edi
    0x000000000040111e <+14>:    mov    QWORD PTR [rbp-0x10],rsi
    0x0000000000401122 <+18>:    mov    rax,QWORD PTR [rbp-0x10]
    0x0000000000401126 <+22>:    movsxd rcx,DWORD PTR [rbp-0x8]
    0x000000000040112a <+26>:    mov    rax,QWORD PTR [rax+rcx*8]
    0x000000000040112e <+30>:    mov    QWORD PTR [rbp-0x18],rax
    0x0000000000401132 <+34>:    mov    rax,QWORD PTR [rbp-0x18]
    0x0000000000401136 <+38>:    mov    BYTE PTR [rax],0x63
    0x0000000000401139 <+41>:    xor    eax,eax
    0x000000000040113b <+43>:    pop    rbp
    0x000000000040113c <+44>:    ret
 End of assembler dump.
 (gdb)

Okay, not super helpful since FEX translates instructions in to blocks, `0x401110` is just our starting address
- It's in this code somewhere, let's change some FEX settings to get a clearer picture
Set block size to one instruction and disable multiblock
- See the image in FEXConfig to the right

Now rerun our test application and find the new RIP

 (gdb) p/x ((FEXCore::Core::CpuStateFrame*)$x28)->State.rip
 $2 = 0x401136

Alright, now we know the RIP is exactly at `0x401136`
Back in gdb-multiarch

 (gdb) disas 0x401136,+1
 Dump of assembler code from 0x401136 to 0x401137:
    0x0000000000401136 <main(int, char**)+38>:   mov    BYTE PTR [rax],0x63

Looks like something in main is storing 0x63 to a nullptr
In this simple case we can now take a look at the test application's source and find the problem.
- We know the problem is in the first block of main()
- We know the exact instruction that it is at
- We know it's something storing a byte to memory
For more complex cases it is likely necessary to use reverse engineering tools
- BinaryNinja, Ghidra, IDA, and Hopper are all examples of tools like this

What to do from here

Now it becomes a lot harder. You don't get a typical debugging environment or even clean backtraces.

FEX's gdbserver integration is sorely lacking so you can't even use a remote gdb server connecting to FEX right now.

If you enable thunks you can get better backtraces here. Debugging_Crash_In_Thunks

Attempting to use FEX-Emu's gdbserver implementation

Here be dragons

FEX supports gdbserver as an integration. It's implementation is significantly limited but can still be used for debugging and getting some backtraces.

Currently hardcodes the port to use as `8086` and if you have multiple gdbserver processes running then it will encounter problems.
Currently does not follow processes through fork/execve at all. No multiprocess support
- This means you must only start the process you're caring about debugging
Currently starts the process paused and will wait until gdb attach before continuing
- No way to start a FEX instance then attach at some later point
Ctrl-C to stop the FEX process needs to be done twice
- Maybe with a small delay inbetween because gdb needs to fetch a bunch of data on pause
- Known bug, unknown why broken at the moment

 FEXLoader -G -- <Application> <Args...>

Double checking if we are in JIT code

 (gdb) info reg pc
 pc             0x7fccfb9ec8        0x7fccfb9ec8
 
 (gdb) info proc mappings
 ...
 0x7fccfb9000       0x7fcdfb9000  0x1000000        0x0

Looks like FEX JIT mapping, we start out at 16MB but scale up to 128MB
Depending on version of FEX we can check the base mapping for a unique string

 (gdb) p (char*)0x7fccfb9000
 $4 = 0x7fccfb9000 "FEXJIT::Arm64JITCore::"

Getting RIP of current code block

FEX sets up an address in our CPU context to get some debug data out.

Currently this isn't exposed in a way that a debugger can see other than manually typing out gdb commands

 p/x *(uint64_t*)($x28+184) = inline block header ptr
 p/x *(uint32_t*)(*(uint64_t*)($x28+184)) = OffsetToBlockTail
 p/x *(uint64_t*)($x28+184) + *(uint32_t*)(*(uint64_t*)($x28+184))

 p/x *(unsigned long long*)((*(unsigned long long*)($x28+184) + *(unsigned int*)(*(unsigned long long*)($x28+184)))+8) = RIP of block

 disas *(unsigned long long*)($x28+184),+*(unsigned long long*)(*(unsigned long long*)($x28+184) + *(unsigned int*)(*(unsigned long long*)($x28+184))) = Disassemble this block of code.

While hard to decipher here is basically what is happening. - x28 is the CPU register that FEX keeps in the JIT for context accesses - offset 184 is the offset of the `InlineJITBlockHeader` member inside of that context. - As long as FEX is in a JIT block that offset will be valid to point to the current RIP that the block is operating on.

Getting the stack can also be very useful

 x/64wx ((FEXCore::Core::CPUState*)$x28)->gregs[FEXCore::X86State::REG_RSP]

Doing raw pointer math here means that this works even when gdb fails to find symbols for the CPUState object, which happens very frequently for some reason.