Development:32Bit Syscall Woes

Syscall problems

We are always running as a 64bit process. Either x86-64 or AArch64 at the moment. A large number of syscalls require struct rewriting. This is entirely expected and is something we have to just deal with. This is mostly a manual process of defining the 32bit structure layout and remapping the host's structure in to the 32bit x86 structure.

mmap and mmap2 - 90/192

32bit mmap must return pointers in the lower 32bit memory space. We can get around this by having an allocator that does placement of pointers with mmap manually. One of the issues with this approach is that the kernel has more knowledge about the memory arrangement, so if any ioctl allocates memory behind our back it'll desync our allocator. Means that our allocator must always be allowed to fail and try and allocate forward until it has either wrapped around or found a location. We will always end up with holes in our bitmap because ioctls will have consumed space.

Possible Fixes

MAP_32BIT
- Doesn't exist on AArch64
- Only allows allocation in the range of [0x4000'0000, 0x8000'0000) (1GB) instead of the full lower 32bits
- Not a real solution
New Flag: MAP_FULL_32BIT
- Make it available on AArch64 and x86-64 (For compatibility testing)
- Allow it to map the full 32bit memory space
- Should be fine to allow 64bit address and 64bit pgoff to allow mmap2 behaviour
New syscall: mmap_range - **Ideal solution**
- Instead of a single address, pass in a lower bound and upper bound address
- ERANGE if one of the addresses live outside of viable range? Or silently clamp?
- MAP_FIXED{_NOREPLACE} ignores upper range argument, places at lower?
- Bit more code work to punch it through everywhere
Do userspace bitmap allocator
- Puts the burden on FEX to use MAP_FIXED_NOREPLACE for anything that hasn't set MAP_FIXED or MAP_FIXED_NOREPLACE
- We don't have the full VMA view so it must always be pessimistic about failures and keep trying on failure
- Will always gave higher overhead
- HUGETLB is an annoyance to reimplement

Current implementation

Map the full [0x1'0000'0000, 0x1'0000'0000'0000) range on application startup
- This is 256TB subtract 4GB of range
Once the 256TB of space is allocated, use an internal allocator for the host side allocations.
- FEX itself won't ever use any of:
  - mmap without MAP_FIXED placement
  - mremap without MREMAP_FIXED allocation
  - shmat (at all, can't control placement)
  - ioctl (at all)

mremap - 163

Allows userspace to remap an allocation from an old size to a new size, with the possibility to move to a new address. MREMAP_FIXED allows the application to remap the address to a new fixed location. This isn't a problem. MREMAP_MAYMOVE allows the kernel to select a new memory region on resize if the previous VM location wouldn't fit. This is a problem.

Specifically MREMAP_FIXED means that we will have to attempt scanning the memory region with mmap first, then once we found a valid location. Do an mremap on that location with MREMAP_FIXED. This can be a race condition.

On move the kernel may end up moving a VM range to the 64bit range, which would break things

Possible fixes

With support from the previous userspace bitmap allocator
- This would allow us to find a new memory region to remap the location to
- Performance concern since we fall down the same pessimistic path that mmap needs to hit
New flag: MREMAP_FULL_32BIT
- Just like the mmap new flag, limit mremap to the full lower 32bits
- Would leave the burden of support to the kernel since it has the full VMA view
New flag: MREMAP_FIXED_NOREPLACE
- Just like mmap's MAP_FIXED_NOREPLACE but for mremap
- Makes the bitmap allocator path less of a race condition to implement.
New syscall: mremap_range
- Just like the mmap new syscall, provide a address range that is viable

Current implementation

See mmap/mmap2 current implementation

shmat - 397

Allows userspace to attach a SHM region to an address of the process. If the passed in address is nullptr then the kernel is allowed to choose the address to map the region. This can end up mapping the region to 64bit space, which will obviously break things Using the address of null is preferred by docs as the portable utilization.

Possible fixes

With support from the previous userspace bitmap allocator
- Will allow us to find a new memory region
- Same performance concerns
- shmat behaves like MAP_FIXED_NOREPLACE, will return EINVAL if the memory region is already mapped (Unless SHM_REMAP specified)
New Flag: SHM_FULL_32BIT
- Just like the mmap flag, limits the shmat behaviour to 32bits
New Syscall: shmat_range
- Just like mmap_range, limits to address range provided

Current implementation

See mmap/mmap2 current implementation

ioctl - 54

Linux branch: https://github.com/Sonicadvance1/linux/tree/fex_ioctl32

Possible fixes

Userspace ioctl structure parsing
- Prone to failure
- Requires userspace to constantly be tracking upstream support of ioctls
- Ioctls allocating any sort of memory will likely break and can't be worked around
- Any ioctls changing will be difficult to track (doesn't happen often but it does)
- Will always be a subset of the full kernel ioctl support
- Not a good solution
Implement new compat_ioctl syscall
- Allows us to call the 32bit ioctl syscall from 64bit userspace
- Proven to work locally with some hiccups
- Tracks the regular compat_ioctl support in the kernel so it'll always be up to date
- Requires CONFIG_COMPAT in the kernel to be supported
- Might hit problems if the ioctl checks the process type instead of where the entry was from.
Expose compat_ioctl to symbols for modules to use
- Allows FEX to load a kernel module for 32bit support
- Exposes the compat_ioctl syscall interface through the module through a 64bit ioctl interface
- Might hit problems if the ioctl checks the process type instead of where the entry was from.

Possible Problems

- Some ioctls use `compat_alloc_user_space` to allocate temporary data on to the stack.**

Depending on architecture, this may or may not return a 64bit stack offset.

On x86-64 it returns a 32bit truncated stack pointer
On AArch64 it returns the full 64bit stack pointer

Users of this helper may or may not expect a 64bit pointer. From the 29 users of this helper in the kernel this may be easy enough to resolve.

Possible Fixes

Enforce that this may return a stack that is in 64bit space
- Might cause headaches for kernel developers that are expecting compat ioctls to only ever have 32bit pointers
- From the low amount of users it may still be fine?
- sound drivers, socket drivers, usb drivers,
Allocate host stack in 32bit space
- This works around the issue by the host stack living in 32bit space
- Major problem is that now we have two stacks living in 32bit space PER THREAD.
- We don't want to take up more of the limited 32bit address space than we have to
  - Per-thread VMA regions for stack would be a viable solution, needs support in kernel
  - New mmap flag MAP_THREAD_VMA?
  - Something like this might already exist in the form of the kernel vmacache?
  - Each (emulator) thread's stack lives at the same VM location, and special care to be taken to not try and pass that region to another thread
  - Set up a mirror to another region so if we absolutely need to pull that threads data from another thread, Index + offset in to the mirrored backing buffer, which is mapped for the full process VMA.
    - We don't want to remove CLONE_VM from clone, since that will cause significantly more problems
  - 32bit already runs out of memory space, doubling the number of stacks means that each thread's initial memory usage is literally double.
Allocate a couple pages to do a stack pivot before each 32bit ioctl
- As long as these don't ever try and allocate a large amount of data then a couple of pages per thread isn't terrible
- Just means every 32bit ioctl will need some number of pages allocated for the chance of a pivot.
- 65k per thread? Might be fine, haven't tested

Some ioctls check the current operating mode of the process to change behaviour

ioctl's can check the task struct to see if the task is a 32bit or 64bit task. This will change behaviour since anything operating under FEX will end up being a 64bit task. Current workaround is that every ioctl32 is modifying that task flag before and after the internal ioctl call.

Is the ioctl is interrupted, this would stick the system in to a weird state?
We probably need to pass a compat flag through somewhere?
- x86-64 `current_thread_info()->status |= TS_COMPAT;`
- AArch64 `set_thread_flag(TIF_32BIT)`
- Setting these flags outside of the 32bit trampoline handling will cause issues with interrupting syscalls

Any ioctl that allocates memory needs to be aware that if it is in a compat ioctl that it is only allocated in 32bit space

This should theoretically already be available for x32 ABI but might be partially broken
FEX will work around it by stealing all of the 64bit VMA space, so nothing can allocate in the high bits
Would be nicer to enforce that compat_ioctl memory allocations must be in the lower 32bits
- Hard to guarantee.

- Struct packing is and will have problems between architectures**
Very specifically u64 has 4byte alignment on x86 while it will have 8 byte alignment on aarch64 (and aarch32 compat?)
- I believe something that crosses the 64byte boundary has different alignment requirements as well?
Haven't encountered any ioctls yet that cause an issue
- Once discovered, either work with upstream and project to identify a path to move forward making it compatible
- Or if it is an application that shouldn't be running on ARM anyway, put in an application profile claiming its ioctls are broken
- Only solution is ioctl translation layer at that point otherwise, if it is an application we care about.

Allocating upper 64-bits of VA space means stack growing no longer works

Host application has stolen all the memory. It can no longer grow the host application's VA space.

MAP_GROWSUP and MAP_GROWSDOWN

IOCTL number conflicts are non-trivial to resolve in userspace =

Just take a look at ioctl-number.rst, there are 123 instances of conflict in there
Trying to resolve conflicts back to the correct device? How?
- Even Qemu doesn't even try to figure this out.
- Even Strace doesn't try to figure this out
- Turns out that resolving an ioctl command conflict is impossible from userspace

Current implementation

See mmap/mmap2 current implementation

sendmsg, recvmsg, sendmmsg, recvmmsg - (socketcall) 102, 370, 372, 345, 337/417

This is easy to break. It is also prone to breaking. If the user application is shipping auxiliary channel data over with exact sized packets. They can be expecting to read exactly that amount of data. FEX must munge the data because of pointer size differences. Because of this we need to provide buffers to the kernel interface that is significantly larger than what the application provides. This can end up reading more data then the application expects. It can be expecting exact sized packets which we break. This potential footgun on the application facing side turns in to an even larger footgun on the FEX side.

Current Implementation

Try to patch up the data as well as we can and just eat the fact that guest applications could accidentally read too much data and lose it.

Getdents family

Behaviour changes between 32bit and 64bit and can't be directly emulated. Recreate whatever hashing algorithm that the kernel uses for the 32-bit side, pass it to the guest? This needs to match behaviour for an application doing getdents64 in a 64-bit processing and passing that data off to a 32-bit process.

Current Implementation

getdents - return -ENOSYS and praying nothing uses it
getdents64 - overwriting d_off to be a incrementing number and hoping nothing uses it

Robust list/futex family

Zero access to the 32bit robust list without a syscall API. Mixture of host and guest robust futexes result in pain. Can't emulate due to the robustness of the feature. Giving an API for multiple robust list tracking wouldn't be all-encompassing since we still have a 32-bit robust list that needs to be handed to the kernel.

Possible fixes

Implement new compat syscalls for each
- Ensure they are available on Aarch64
Implement a new key based robust list syscall
- Instead of just having a single robust list per process (technically two for x86-64)
- Implement a new syscall for giving the kernel multiple robust lists tied to a key
- Like pkey_alloc and pkey_free
- Ensure it has a flag for 32-bit pointer list or not
- This way the kernel will know if the passed in linked list should be tracked using 32-bit pointers or 64-bit pointers.

Current implementation

64-bit processes
- Pass the robust list through so the kernel is tracking the guest application's robust list
32-bit processes
- Lie to them and say it is getting tracked, won't actually work.

Get/Set sockopt

Some options rely on the level and type of socket referred to by the passed in file descriptor. Currently SOL_SOCKET is fixed up inside of FEX but there are more options that check if they are coming from a compatibility syscall. FEX will have a hard time emulating this since it can't track what all FDs are. Especially ones passed with SCM_RIGHTS.

Possible fixes

TBD

Development:32Bit Syscall Woes

Contents

Syscall problems

mmap and mmap2 - 90/192

Possible Fixes

Current implementation

mremap - 163

Possible fixes

Current implementation

shmat - 397

Possible fixes

Current implementation

ioctl - 54

Possible fixes

Possible Problems

Possible Fixes

Some ioctls check the current operating mode of the process to change behaviour

Any ioctl that allocates memory needs to be aware that if it is in a compat ioctl that it is only allocated in 32bit space

Allocating upper 64-bits of VA space means stack growing no longer works

IOCTL number conflicts are non-trivial to resolve in userspace =

Current implementation

sendmsg, recvmsg, sendmmsg, recvmmsg - (socketcall) 102, 370, 372, 345, 337/417

Current Implementation

Getdents family

Current Implementation

Robust list/futex family

Possible fixes

Current implementation

Get/Set sockopt

Possible fixes

Navigation menu

Search