SIMPLE IS BETTER: Implement software-based SMEP with Non-Execute (NX) bit in page tables to secure kernel/user virtual memory address space.

Monday, November 17, 2014

Implement software-based SMEP with Non-Execute (NX) bit in page tables to secure kernel/user virtual memory address space.

In my previous post, I talked about how to implement a software-based SMEP (Supervisor Mode Execution Protection) with virtualization/hypervisor for fun. In this post, I'm going to detail yet another solution to implement software-based SMEP without virtualization technology.

In modern operating systems, like Linux and Windows, all the processes share the same kernel virtual address space, but have separate user virtual address space, see below for Windows 32bit OS. The system can achieve this by configuring separate page structures pointed by a translation table base register (e.g. CR3 register on x86/Intel MMU architecture) for each process, and switch among them.

To simplify the discussion, I'm assuming that we are working with a Linux 64bit OS system on x86_64/Intel architecture.

So, from here (https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt), we can know that the virtual address range below belongs to user space.

0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm

And, we also get to know that x86_64 bit Linux OS uses Intel IA-32e paging as below (w/ 4KB page size as an example), which has CR3 register pointing to the physical base address of a PML4 table. Each process/task has a corresponding PML4 table.

When a task gets scheduled, the corresponding physical base address of PML4 table will be wrote to CR3 register by a mov-to-cr3 instruction, so that the task/process virtual address space can be switched accordingly.

Since the Linux user address space range is 0000000000000000 - 00007fffffffffff, we can infer that the first 256 PML4 entries (index 0~255) will eventually pointer to user virtual address space for each process/task. See below picture.

In each PML4 entry above, there are some processor-"Ignored" bits and a XD (eXecute-Disable) bit as picture indicated below. The "XD" bit can control whether or not the referenced physical pages can be fetched for execution. If it is set, then an instruction fetch will trigger a #PF exception (assuming MSR IA32_EFER.NXE = 1). This is the key point for implementing software-base SMEP solution.

So, the solution now is:

Whenever a process enters kernel mode (CPL=0, for example, through a syscall or sysenter instruction), OS kernel sets the bit PML4E.XD bit for all the PML4 table entries (index 0 through 255, can be optimized). And then flush TLB (performance cost).
In this way, any attempt to fetch user virtual address memory in kernel mode will cause a #PF exception, but read/write access to user virtual address memory is allowed (for example, copy_to/from_user() functions).
OS kernel can use some "Ignored" bits to record this intended behavior for easy virtual address management.
Before leaving kernel mode, the OS kernel change PML4.XD bit (and some "Ignored" bits) back to the original state.

Similarly, if we don't consider performance cost, we are even able to implement a software-based SMAP (Supervisor Mode Access Protection) with "Present" bit clear, but I'm not explaining the details in this post.

<The End>

Update:
I didn't do enough homework before. Previously UDEREF from PAX used 32bit segmentation (and its limit) to emulate SMEP/SMAP behaviors, but thanks to someone from PAX team commenting it as below, I got the UDEREF for 64bit here:
https://github.com/opntr/pax-docs-mirror/blob/master/uderef-amd64.txt

10 comments:

PaX Team11/17/2014 9:20 AM
you should probably study PaX and its UDEREF/KERNEXEC features as all this has been implemented for years now ;)
ReplyDelete
Replies
PaX Team11/20/2014 8:14 AM
> not sure if this is OK

this is how UDEREF/amd64 works actually when PCID support is detected ;).
ReplyDelete
Replies

Add comment