Monday, November 17, 2014

Implement software-based SMEP with Non-Execute (NX) bit in page tables to secure kernel/user virtual memory address space.

In my previous post, I talked about how to implement a software-based SMEP (Supervisor Mode Execution Protection) with virtualization/hypervisor for fun. In this post, I'm going to detail yet another solution to implement software-based SMEP without virtualization technology. 


In modern operating systems, like Linux and Windows, all the processes share the same kernel virtual address space, but have separate user virtual address space, see below for Windows 32bit OS. The system can achieve this by configuring separate page structures pointed by a translation table base register (e.g. CR3 register on x86/Intel MMU architecture) for each process, and switch among them.



To simplify the discussion, I'm assuming that we are working with a Linux 64bit OS system on x86_64/Intel architecture. 

So, from here (https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt), we can know that the virtual address range below belongs to user space. 
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm

And, we also get to know that x86_64 bit Linux OS uses Intel IA-32e paging as below (w/ 4KB page size as an example), which has CR3 register pointing to the physical base address of a PML4 table. Each process/task has a corresponding PML4 table. 



When a task gets scheduled, the corresponding physical base address of PML4 table will be wrote to CR3 register by a mov-to-cr3 instruction, so that the task/process virtual address space can be switched accordingly.

Since the Linux user address space range is 
0000000000000000 - 00007fffffffffff, we can infer that the first 256 PML4 entries (index 0~255) will eventually pointer to user virtual address space for each process/task. See below picture.


In each PML4 entry above, there are some processor-"Ignored" bits and a XD (eXecute-Disable) bit as picture indicated below. The "XD" bit can control whether or not the referenced physical pages can be fetched for execution. If it is set, then an instruction fetch will trigger a #PF exception (assuming MSR IA32_EFER.NXE = 1). This is the key point for implementing software-base SMEP solution.


So, the solution now is:
  1. Whenever a process enters kernel mode (CPL=0, for example, through a syscall or sysenter instruction), OS kernel sets the bit PML4E.XD bit for all the PML4 table entries (index 0 through 255, can be optimized). And then flush TLB (performance cost).
    In this way, any attempt to fetch user virtual address memory in kernel mode will cause a #PF exception, but read/write access to user virtual address memory is allowed (for example, copy_to/from_user() functions).
  2. OS kernel can use some "Ignored" bits to record this intended behavior for easy virtual address management.
     
  3. Before leaving kernel mode, the OS kernel change PML4.XD bit (and some "Ignored" bits) back to the original state.  

Similarly, if we don't consider performance cost, we are even able to implement a software-based SMAP (Supervisor Mode Access Protection) with "Present" bit clear, but I'm not explaining the details in this post.

<The End>


Update:
I didn't do enough homework before. Previously UDEREF from PAX used 32bit segmentation (and its limit) to emulate SMEP/SMAP behaviors, but thanks to someone from PAX team commenting it as below, I got the UDEREF for 64bit here:
https://github.com/opntr/pax-docs-mirror/blob/master/uderef-amd64.txt


10 comments:

  1. you should probably study PaX and its UDEREF/KERNEXEC features as all this has been implemented for years now ;)

    ReplyDelete
    Replies
    1. Thanks for sharing, that would be good. I thought it only implemented smep with segment/limit feature. :)

      Delete
    2. a few more comments:

      1.the SMEP sort of equivalent is more like KERNEXEC/i386, and SMAP is more like UDEREF/i386.
      2. on amd64 the water is muddier as UDEREF implements part of KERNEXEC (the non-exec userland sub-feature).
      3. of interest may be that for about a year now UDEREF/amd64 also uses PCID/INVPCID when available (though i have yet to blog about that part ;).
      4. there's also an ARM implementation of both features that uses various paging tricks by spender (https://forums.grsecurity.net/viewtopic.php?f=7&t=3292).

      Delete
    3. >>> "of interest may be that for about a year now UDEREF/amd64 also uses PCID/INVPCID when available (though i have yet to blog about that part ;)."
      this is interested. recently I also have idea to use PCID (process context ID) to separate kernel/user virtual address space, for example, using different CR3 (with different PCID field) for user and kernel address base pointer even for the same process. not sure if this is OK. It seems ARM can use TTBR0 and TTBR0 to separate privileged and unprivileged space (I'm a newbie for ARM)

      >> "https://forums.grsecurity.net/viewtopic.php?f=7&t=3292"
      this is a great post, I read this when I started to read MMU arch in ARM architecture :)

      Delete
  2. > not sure if this is OK

    this is how UDEREF/amd64 works actually when PCID support is detected ;).

    ReplyDelete
    Replies
    1. That's great!
      Could you share me with the link of that blog for CR4.PCID? I didn't get it by asking google searching :(

      Delete
    2. that's because i haven't written it yet ;), but it'll be on the grsecurity blog.

      Delete
    3. I see.. let me know after you've done that. So, the code patch is ready for use/testing as you mentioned, right?

      Delete
    4. Ok, It seems that I saw UDEREF/amd64 with PCID/INVPCID support
      in the patch http://grsecurity.net/stable/grsecurity-3.0-3.14.24-201411150026.patch

      Delete
    5. as i said earlier, this code has been in PaX for over a year now already, look for STRONGUDEREF to find most of the related code.

      Delete