|  | 1. Intel(R) MPX Overview | 
|  | ======================== | 
|  |  | 
|  | Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability | 
|  | introduced into Intel Architecture. Intel MPX provides hardware features | 
|  | that can be used in conjunction with compiler changes to check memory | 
|  | references, for those references whose compile-time normal intentions are | 
|  | usurped at runtime due to buffer overflow or underflow. | 
|  |  | 
|  | You can tell if your CPU supports MPX by looking in /proc/cpuinfo: | 
|  |  | 
|  | cat /proc/cpuinfo  | grep ' mpx ' | 
|  |  | 
|  | For more information, please refer to Intel(R) Architecture Instruction | 
|  | Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection | 
|  | Extensions. | 
|  |  | 
|  | Note: As of December 2014, no hardware with MPX is available but it is | 
|  | possible to use SDE (Intel(R) Software Development Emulator) instead, which | 
|  | can be downloaded from | 
|  | http://software.intel.com/en-us/articles/intel-software-development-emulator | 
|  |  | 
|  |  | 
|  | 2. How to get the advantage of MPX | 
|  | ================================== | 
|  |  | 
|  | For MPX to work, changes are required in the kernel, binutils and compiler. | 
|  | No source changes are required for applications, just a recompile. | 
|  |  | 
|  | There are a lot of moving parts of this to all work right. The following | 
|  | is how we expect the compiler, application and kernel to work together. | 
|  |  | 
|  | 1) Application developer compiles with -fmpx. The compiler will add the | 
|  | instrumentation as well as some setup code called early after the app | 
|  | starts. New instruction prefixes are noops for old CPUs. | 
|  | 2) That setup code allocates (virtual) space for the "bounds directory", | 
|  | points the "bndcfgu" register to the directory (must also set the valid | 
|  | bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) | 
|  | that the app will be using MPX.  The app must be careful not to access | 
|  | the bounds tables between the time when it populates "bndcfgu" and | 
|  | when it calls the prctl().  This might be hard to guarantee if the app | 
|  | is compiled with MPX.  You can add "__attribute__((bnd_legacy))" to | 
|  | the function to disable MPX instrumentation to help guarantee this. | 
|  | Also be careful not to call out to any other code which might be | 
|  | MPX-instrumented. | 
|  | 3) The kernel detects that the CPU has MPX, allows the new prctl() to | 
|  | succeed, and notes the location of the bounds directory. Userspace is | 
|  | expected to keep the bounds directory at that locationWe note it | 
|  | instead of reading it each time because the 'xsave' operation needed | 
|  | to access the bounds directory register is an expensive operation. | 
|  | 4) If the application needs to spill bounds out of the 4 registers, it | 
|  | issues a bndstx instruction. Since the bounds directory is empty at | 
|  | this point, a bounds fault (#BR) is raised, the kernel allocates a | 
|  | bounds table (in the user address space) and makes the relevant entry | 
|  | in the bounds directory point to the new table. | 
|  | 5) If the application violates the bounds specified in the bounds registers, | 
|  | a separate kind of #BR is raised which will deliver a signal with | 
|  | information about the violation in the 'struct siginfo'. | 
|  | 6) Whenever memory is freed, we know that it can no longer contain valid | 
|  | pointers, and we attempt to free the associated space in the bounds | 
|  | tables. If an entire table becomes unused, we will attempt to free | 
|  | the table and remove the entry in the directory. | 
|  |  | 
|  | To summarize, there are essentially three things interacting here: | 
|  |  | 
|  | GCC with -fmpx: | 
|  | * enables annotation of code with MPX instructions and prefixes | 
|  | * inserts code early in the application to call in to the "gcc runtime" | 
|  | GCC MPX Runtime: | 
|  | * Checks for hardware MPX support in cpuid leaf | 
|  | * allocates virtual space for the bounds directory (malloc() essentially) | 
|  | * points the hardware BNDCFGU register at the directory | 
|  | * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to | 
|  | start managing the bounds directories | 
|  | Kernel MPX Code: | 
|  | * Checks for hardware MPX support in cpuid leaf | 
|  | * Handles #BR exceptions and sends SIGSEGV to the app when it violates | 
|  | bounds, like during a buffer overflow. | 
|  | * When bounds are spilled in to an unallocated bounds table, the kernel | 
|  | notices in the #BR exception, allocates the virtual space, then | 
|  | updates the bounds directory to point to the new table. It keeps | 
|  | special track of the memory with a VM_MPX flag. | 
|  | * Frees unused bounds tables at the time that the memory they described | 
|  | is unmapped. | 
|  |  | 
|  |  | 
|  | 3. How does MPX kernel code work | 
|  | ================================ | 
|  |  | 
|  | Handling #BR faults caused by MPX | 
|  | --------------------------------- | 
|  |  | 
|  | When MPX is enabled, there are 2 new situations that can generate | 
|  | #BR faults. | 
|  | * new bounds tables (BT) need to be allocated to save bounds. | 
|  | * bounds violation caused by MPX instructions. | 
|  |  | 
|  | We hook #BR handler to handle these two new situations. | 
|  |  | 
|  | On-demand kernel allocation of bounds tables | 
|  | -------------------------------------------- | 
|  |  | 
|  | MPX only has 4 hardware registers for storing bounds information. If | 
|  | MPX-enabled code needs more than these 4 registers, it needs to spill | 
|  | them somewhere. It has two special instructions for this which allow | 
|  | the bounds to be moved between the bounds registers and some new "bounds | 
|  | tables". | 
|  |  | 
|  | #BR exceptions are a new class of exceptions just for MPX. They are | 
|  | similar conceptually to a page fault and will be raised by the MPX | 
|  | hardware during both bounds violations or when the tables are not | 
|  | present. The kernel handles those #BR exceptions for not-present tables | 
|  | by carving the space out of the normal processes address space and then | 
|  | pointing the bounds-directory over to it. | 
|  |  | 
|  | The tables need to be accessed and controlled by userspace because | 
|  | the instructions for moving bounds in and out of them are extremely | 
|  | frequent. They potentially happen every time a register points to | 
|  | memory. Any direct kernel involvement (like a syscall) to access the | 
|  | tables would obviously destroy performance. | 
|  |  | 
|  | Why not do this in userspace? MPX does not strictly require anything in | 
|  | the kernel. It can theoretically be done completely from userspace. Here | 
|  | are a few ways this could be done. We don't think any of them are practical | 
|  | in the real-world, but here they are. | 
|  |  | 
|  | Q: Can virtual space simply be reserved for the bounds tables so that we | 
|  | never have to allocate them? | 
|  | A: MPX-enabled application will possibly create a lot of bounds tables in | 
|  | process address space to save bounds information. These tables can take | 
|  | up huge swaths of memory (as much as 80% of the memory on the system) | 
|  | even if we clean them up aggressively. In the worst-case scenario, the | 
|  | tables can be 4x the size of the data structure being tracked. IOW, a | 
|  | 1-page structure can require 4 bounds-table pages. An X-GB virtual | 
|  | area needs 4*X GB of virtual space, plus 2GB for the bounds directory. | 
|  | If we were to preallocate them for the 128TB of user virtual address | 
|  | space, we would need to reserve 512TB+2GB, which is larger than the | 
|  | entire virtual address space today. This means they can not be reserved | 
|  | ahead of time. Also, a single process's pre-popualated bounds directory | 
|  | consumes 2GB of virtual *AND* physical memory. IOW, it's completely | 
|  | infeasible to prepopulate bounds directories. | 
|  |  | 
|  | Q: Can we preallocate bounds table space at the same time memory is | 
|  | allocated which might contain pointers that might eventually need | 
|  | bounds tables? | 
|  | A: This would work if we could hook the site of each and every memory | 
|  | allocation syscall. This can be done for small, constrained applications. | 
|  | But, it isn't practical at a larger scale since a given app has no | 
|  | way of controlling how all the parts of the app might allocate memory | 
|  | (think libraries). The kernel is really the only place to intercept | 
|  | these calls. | 
|  |  | 
|  | Q: Could a bounds fault be handed to userspace and the tables allocated | 
|  | there in a signal handler intead of in the kernel? | 
|  | A: mmap() is not on the list of safe async handler functions and even | 
|  | if mmap() would work it still requires locking or nasty tricks to | 
|  | keep track of the allocation state there. | 
|  |  | 
|  | Having ruled out all of the userspace-only approaches for managing | 
|  | bounds tables that we could think of, we create them on demand in | 
|  | the kernel. | 
|  |  | 
|  | Decoding MPX instructions | 
|  | ------------------------- | 
|  |  | 
|  | If a #BR is generated due to a bounds violation caused by MPX. | 
|  | We need to decode MPX instructions to get violation address and | 
|  | set this address into extended struct siginfo. | 
|  |  | 
|  | The _sigfault feild of struct siginfo is extended as follow: | 
|  |  | 
|  | 87		/* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ | 
|  | 88		struct { | 
|  | 89			void __user *_addr; /* faulting insn/memory ref. */ | 
|  | 90 #ifdef __ARCH_SI_TRAPNO | 
|  | 91			int _trapno;	/* TRAP # which caused the signal */ | 
|  | 92 #endif | 
|  | 93			short _addr_lsb; /* LSB of the reported address */ | 
|  | 94			struct { | 
|  | 95				void __user *_lower; | 
|  | 96				void __user *_upper; | 
|  | 97			} _addr_bnd; | 
|  | 98		} _sigfault; | 
|  |  | 
|  | The '_addr' field refers to violation address, and new '_addr_and' | 
|  | field refers to the upper/lower bounds when a #BR is caused. | 
|  |  | 
|  | Glibc will be also updated to support this new siginfo. So user | 
|  | can get violation address and bounds when bounds violations occur. | 
|  |  | 
|  | Cleanup unused bounds tables | 
|  | ---------------------------- | 
|  |  | 
|  | When a BNDSTX instruction attempts to save bounds to a bounds directory | 
|  | entry marked as invalid, a #BR is generated. This is an indication that | 
|  | no bounds table exists for this entry. In this case the fault handler | 
|  | will allocate a new bounds table on demand. | 
|  |  | 
|  | Since the kernel allocated those tables on-demand without userspace | 
|  | knowledge, it is also responsible for freeing them when the associated | 
|  | mappings go away. | 
|  |  | 
|  | Here, the solution for this issue is to hook do_munmap() to check | 
|  | whether one process is MPX enabled. If yes, those bounds tables covered | 
|  | in the virtual address region which is being unmapped will be freed also. | 
|  |  | 
|  | Adding new prctl commands | 
|  | ------------------------- | 
|  |  | 
|  | Two new prctl commands are added to enable and disable MPX bounds tables | 
|  | management in kernel. | 
|  |  | 
|  | 155	#define PR_MPX_ENABLE_MANAGEMENT	43 | 
|  | 156	#define PR_MPX_DISABLE_MANAGEMENT	44 | 
|  |  | 
|  | Runtime library in userspace is responsible for allocation of bounds | 
|  | directory. So kernel have to use XSAVE instruction to get the base | 
|  | of bounds directory from BNDCFG register. | 
|  |  | 
|  | But XSAVE is expected to be very expensive. In order to do performance | 
|  | optimization, we have to get the base of bounds directory and save it | 
|  | into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT | 
|  | command execution. | 
|  |  | 
|  |  | 
|  | 4. Special rules | 
|  | ================ | 
|  |  | 
|  | 1) If userspace is requesting help from the kernel to do the management | 
|  | of bounds tables, it may not create or modify entries in the bounds directory. | 
|  |  | 
|  | Certainly users can allocate bounds tables and forcibly point the bounds | 
|  | directory at them through XSAVE instruction, and then set valid bit | 
|  | of bounds entry to have this entry valid.  But, the kernel will decline | 
|  | to assist in managing these tables. | 
|  |  | 
|  | 2) Userspace may not take multiple bounds directory entries and point | 
|  | them at the same bounds table. | 
|  |  | 
|  | This is allowed architecturally.  See more information "Intel(R) Architecture | 
|  | Instruction Set Extensions Programming Reference" (9.3.4). | 
|  |  | 
|  | However, if users did this, the kernel might be fooled in to unmaping an | 
|  | in-use bounds table since it does not recognize sharing. |