SIMPLE IS BETTER

How to enable CoreOS to boot on top of iKGT (Intel Kernel Guard Technology) ?

2015-06-09T23:47:00.002-07:00

Linked to here:
https://01.org/intel-kgt/blogs or https://01.org/intel-kgt/blogs/bzhu5/2015/coreos-ikgt
<END>

Intel Kernel Guard Technology is released as opensource software

2015-06-09T23:38:00.001-07:00

See the official site for details: https://01.org/intel-kgt
<END>

Common security design issues in privileged hypervisor or in any privileged emulators

2015-04-06T20:03:00.002-07:00

Recently I've reviewed nearly 100 Xen Security Advisories (http://xenbits.xen.org/xsa/), except some bad security coding practices for any ordinary software, I found there are some specific security issues that we need to take into considerations when designing prvilieged hypervisors or privileged emulators.

<Work In Progress>

"What, How, and Why" on Interrupt Window (or NMI Window) Exiting in Virtualization Technology

2015-04-06T19:54:00.000-07:00

More recently, one of my colleagues asked me why there is a feature called "Interrupt Window exiting" in virtualization technology, and how it can be used by VMM? This blog is going to briefly describe its "what, how and why" .

WHAT, and HOW

"Interrupt Window Exiting" is one of VM exit reasons (#7 in Intel Technology). If “interrupt-window exiting” VM-execution control is set, this VM exit happens right after a VM entry and at the beginning of an any instruction:

at which RFLAGS.IF = 1 (external interrupt is unmaksed) and,
on which the interruptibility state of the guest would allow delivery of an interrupt (for example, not being blocked by STI or by MOV SS).

WHY
In a typical case, the VMM software wants to inject/deliver a (virtual) interrupt to its one of Guest VM at some point, but unfortunately the interruptibility state of its guest would NOT allow delivery of an interrupt at that moment (for example, since its guest RFLAGS.IF = 0).

So, in order to deliver this interrupt, the VMM will need to poll and check the interruptibility state of the guest, once the interruptibility state of its guest allows delivery of an interrupt (A window is open), then VMM can deliver it at this moment. This is inefficient way to do so.

So, the problem is that -- How does a VMM get to know when its guest becomes interruptible?

With this feature supported, a VMM is allowed to queue a virtual interrupt to its guest when the guest is not in an interruptible state. The VMM can just only set the “interrupt-window exiting” VM-execution control for that guest and depend on a VM exit to know when the guest becomes interruptible (and, therefore, when it can inject a virtual interrupt). The VMM can detect such VM exits by checking for the basic exit reason “interrupt-window”, if the value of exit reason is 7, then VMM knows it is right time to deliver a virtual interrupt to its specific guest.

Similarly, those also apply to "NMI window exiting" feature in Virtualization Technology.

Control-flow processor exceptions (single-stepping on branches) on control-flow branch instructions (jmp/call/ret)

2015-01-26T05:45:00.000-08:00

"single-stepping on branches" is processor hardware feature of x86/Intel architecture. When it is enabled, the processor generates a single-step debug exception only after instructions that cause a branch. This mechanism
allows a debugger to single-step on control transfers caused by branches. What does this imply to defense against control-flow hijacking attacks (e.g. ROP or JOP) ?

Control-flow Transfer Instructions
Control-flow hijacking attacks allow an attacker to overwrite a value that is loaded into the program counter (EIP) of a running program, typically redirecting execution to his own injected code or existing ROP/JOP gadget chains for executing arbitrary malicious code. In general, the value that is subverted could a jump target address, function pointer, or return address in a user-controlled stack.

Call/Jmp/Ret instructions, called as control-transfer branch instructions, are used by control-flow hijacking attacks to redirect CPU execution. There are many software tools that can perform binary analysis on those instructions, for example, by dynamically instrumenting control-flow graph (CFG) for control-flow integrity (CFI) enforcement. So it would be good if the hardware processor can generate an exception on (or after) those control-transfer instructions.

Single-stepping on Branches
In x86/Intel processor architecture, there is a bit (Trap Flag, TF) in EFLAGS register as below. It is set to enable single-step mode for debugging, clear to disable single-step mode.

In single-step mode, the processor generates a debug exception (#DB) after each instruction. This allows the execution state of a program to be inspected after each instruction.

However, things are changed under a special condition as indicated below. When BTF (single-step on branches) flag in IA32_DEBUGCTL MSR is set, the processor treats the TF flag in the EFLAGS register as a “single-step on branches” flag rather than a “single-step on instructions” flag. This mechanism allows single-stepping the processor on taken branches. Note that the exception is a trap-class exception, which means the exception is generated after the branch instruction (call/ret/jmp) is executed.

So now we can make processor generate an exception (#DB) on (^after^, actually) every call/jmp/ret instruction.

Potential Usages
We might have some usages with this capability, for example:

Build dynamical CFG (Control Flow Graph) without changes to software binary or source code.
Detect unknown control-flow hijacking vulnerabilities by using dynamic taint analysis, e.g. when a tainted value loaded into the program counter (EIP) has been influenced by data from the untrusted inputs.
Perform software-invisible hooks for function calling (target of "call" instruction).

However, there are some limitations:

Performance overhead !!! (unless we use it under some environment where performance is not a big concern).
It cannot control jmp/ret/call individually, for example, trigger exceptions only on CALL instructions, or RET instructions, or even only on "indirect" jmp/call instructions (because normally code with direct-jmp/call is trusted due to W^X on code section).
It also has no CPL (user or kernel) controls, but we can control it through EFLAGS.TF bit crossing system call/ret.
Because this #DB is controlled by EFLAGS bit, it can be easy to be disabled by using a "popf" instruction if the stack is controlled by an attacker:( .
It obviously requires OS kernel changes (Does OS provide legitimate #DB handler registration?)

Please let me know if you have any comments.

References:
Intel IA32 architecture software development manual:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

How to defend against Stack Pivoting attacks on existing 32-bit x86 processor architecture?

2015-01-16T01:46:00.001-08:00

Stack Pivoting is a common technique widely used by vulnerability exploits to bypass hardware protections like NX/SMEP, or to chain ROP (Return-Oriented Programing, the Wikipedia link) gadgets. However, there is NO hardware protection solution to defend against it (at least for now:-). This blog will describe a software solution to detect Stack Pivoting at run time, and I will also point out some limitations due to current processor architecture implementations. <Please let me know if this is NOT a new idea, or NOT doable.>

The basic idea of detecting stack pivoting is: configure the appropriate stack base/limit (normally, the modern OS sets base/limit with 0~4G in 32bit mode) in stack segment register for a specific thread, then if a stack pivoting that causes the stack address (ESP) out of the defined range is detected, the processor will generate a #SS fault (limit violation exception).

Before introducing my solution, let me briefly talk about an existing solution to detect stack pivot in Windows 8 OS.

Microsoft implements a simple protection mechanism: every function associated with manipulating virtual memory, including the often-abused VirtualProtect and VirtualAlloc, now includes a check that the stack pointer, as contained in the trap frame, falls within the range defined by the Thread Environment Block (TEB, see below picture, StackBase/StackLimit)

You can take a look at this blog for detailed descriptions. However, the blog author (Dan Rosenberg) also describes an approach to bypassing it.

Now I'm going to talk about the solution and limitations in greater details.

What's stack pivoting?
Please skip this section if you already know about what's stack pivoting.

With stack pivoting, attacks can pivot from the real stack to a fake stack which could be an attacker-controlled buffer, such as the heap, then attackers can control the program execution. For example, this is achieved by controlling data pointed to by RSP(stack pointer register), such that each ret instruction results in incrementing RSP and transferring execution to the next address chosen by attackers.

Here are some good blogs to briefly explain what is stack-pivoting, how to pivot a stack, and how it is used for attacks (e.g. ROP).
http://neilscomputerblog.blogspot.com/2012/06/stack-pivoting.html
http://blogs.mcafee.com/mcafee-labs/emerging-stack-pivoting-exploits-bypass-common-security
http://neilscomputerblog.blogspot.com/2013/04/rop-return-oriented-programming.html

#SS (Stack Fault Exception)
In x86/Intel processor architecture, exception vector 12 is assigned to #SS fault. There are a couple of conditions that can result in a #SS fault. One of them, according to IA32 architecture manual, is limit violation as below:

A limit violation is detected during an operation that refers to the SS register. Operations that can cause a limit violation include stack-oriented instructions such as POP, PUSH, CALL, RET, IRET, ENTER, and LEAVE, as well as other memory references which implicitly or explicitly use the SS register (for example, MOV AX, [BP+6] or MOV AX, SS:[EAX+6]). The ENTER instruction generates this exception when there is not enough stack space for allocating local variables.

So, basically processor checks stack base and limit value when operating any stack-oriented instructions. If the referenced stack address is out of the range (indicated by base/limit values in SS register, see picture below), then a #SS fault will be generated.

However, please note that this limit violation only applies to 32-bit processor mode, I will talk about this later.

Segment Register (SS)
Every segment register, including SS, has a “visible” part and a “hidden” part (see below). The hidden part is sometimes referred to as a “descriptor cache” or a “shadow register”.

According to the IA32 architecture, when a segment selector is loaded into the visible part of a segment register, the processor also loads the hidden part of the segment register with the base address, segment limit, and access control information from the segment descriptor (see next section) pointed to by the segment selector. The information cached in the segment register (visible and hidden) allows the processor to translate addresses without taking extra bus cycles to read the base address and limit from the segment descriptor.

Segment Descriptor
A segment descriptor (see picture below) is a data structure in a GDT or LDT that provides the processor with the size and location (e.g. base/limit) of a segment, as well as access control and status information.

The segment descriptor is pointed by the corresponding segment selector, for example, a stack segment descriptor is referenced by SS selector, and normally OS uses different SS selectors for kernel and applications.

As indicated in last section, the "hidden" part of segment register is loaded from the corresponding segment descriptor (in GDT table residing in RAM). However, it is software's responsibility to reload the segment registers when the segment descriptor tables are modified (e.g. when base or/and limit value are changed). If this is not done, an old segment descriptor cached in a segment register might be used after its memory-resident version (segment descriptor in GDT table) has been modified.

So, when OS system software modifies stack base/limit in SS segment descriptor for a particular thread, it must reload the corresponding SS segment register. According to x86/Intel architecture, there are two kinds of load instructions provided for loading the segment registers:

Direct load instructions such as the MOV, POP, LSS instructions. These instructions explicitly reference the segment registers.
Implicit load instructions such as the far pointer versions of the CALL, JMP, and RET instructions, the SYSENTER and SYSEXIT instructions, and the IRET, INTn, INTO and INT3 instructions. These instructions change the contents of the SS register (and sometimes other segment registers) as an incidental part of their operation.

OS Implementation
To simplify the discussion, I'm taking user mode application as an example for stack pivoting detection.

Normally, OS software allocates unique stack space for each user mode thread. We can change thread scheduler to modify the stack base/limit values in SS segment descriptor (in GDT table) pointed by user mode SS selector, as part of thread context switching.

When that user mode thread starts to execution in user mode after switching stack from kernel to user, the base/limit values in RAM will be automatically reloaded to "hidden" part of SS segment register.

Then if there is an attack initialed by a stack pivoting that causes the user mode stack address (ESP) out of the defined range (base/limit in "hidden" part of SS segment register) is detected, the processor will generate a #SS fault (limit violation exception), then the anti-malware software can detect such an attack.

Limitations

One of big problems is that we cannot apply this solution to x86/Intel 64-bit processor mode. This is because SS (and DS/ES) segment registers are not used in 64-bit mode, their fields (base, limit, and attribute) in segment descriptor of GDT table are ignored. Address calculations that reference the ES, DS, or SS segments are treated as if the segment base is zero. So the #SS exception due to "limit violation" cannot be generated.
Because the SS segment descriptor is located in kernel memory space, so the application cannot modify it directly in user mode. Hence, this solution cannot apply to User Mode Thread, one of examples is Microsoft UMS or User-Mode Scheduling, which is a lightweight mechanism that applications can use to schedule their own threads. An application can switch between UMS threads in user mode without involving the system scheduler. For details, please see the link
http://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx Note that this feature is not available on 32-bit versions of Windows:)
It requires extra changes for thread schedule (as part of context switching) in 32-bit OS, but the change is very minimal, please see above.
One of assumptions is that the thread stack is virtually contiguous in address space, so that the base/limit checks can apply.
It cannot detect the stack pivoting to other memory space that is also part of stack (still in the range of base/limit).

References:
Intel IA32 architecture software development manual:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Transparent ROP Detection using CPU Performance Counters: https://www.trailofbits.com/threads/2014/transparent_rop_detection_using_cpu_perfcounters.pdf

Defeating Windows 8 ROP Mitigation:
http://vulnfactory.org/blog/2011/09/21/defeating-windows-8-rop-mitigation/

Using LBR (Last Branch Record) feature to detect ret2usr (return-to-user) attack w/ MMU paging structure corruption

2014-12-15T06:49:00.002-08:00

SMEP (Supervisor Mode Execution Prevention) is a mitigation that aims to prevent the CPU from running code from user-mode while in kernel-mode, however this post (Windows 8 Kernel Memory Protections Bypass) presents a generic technique for exploiting kernel vulnerabilities with bypassing SMEP. Unlike my previous post (Page Table Structure Corruption Attacks - How to Mitigate it?) that presented a mitigation to that attack, this post will present a solution to detect such a ret2usr attack due to MMU paging structure corruption.

In Intel/x86 recent processors, the LBR (last branch record) feature has some filtering capabilities like CPL (current privilege level) filtering and indirect jmp/call filterings.

For instance, for a specific suspicious process or application, we can configure LBR to only record last branch recording addresses (like LastBranchToIP) for indirect jmp/call and ret branch instructions in kernel mode (CPL=0).

Therefore, by analyzing the LastBranchToIP addresses in BTS (branch trace store) buffer resident in system RAM, we can get to know that whether or not a "ret2usr" attack occurred.

The rule is pretty simple:
check all the LastBranchToIP addresses, if we can find out that any one or more of addresses are located in the range of 0~2GB, then it indicates that a "ret2usr" attack occurred in a "monitored" process or application.

This is typically because the user mode virtual address space range is 0~2GB by default on a 32-bit Windows Operating system, even if the paging-structure entry (e.g. PTE) U/S bit is corrupted by a write-what-where vulnerability which causes a user mode memory to be interpreted as a kernel memory.

New security feature - Control Flow Guard (CFG) - available in Visual Studio 2015 Preview

2014-12-11T23:10:00.002-08:00

This blog announced that the Preview for Visual Studio 2015 includes a new, work-in-progress feature, called Control Flow Guard (CFG).

It says

"Whilst compiling and linking code, it analyzes and discovers every location that any indirect-call instruction can reach. It builds that knowledge into the binaries (in extra data structures). It also injects a check, before every indirect-call in your code, that ensures the target is one of those expected, safe, locations. If that check fails at runtime, the Operating System closes the program"

I will evaluate this, e.g. performance impact and effectiveness against JOP/ROP attacks, when I'm free, and update this post then :-)

Update:
MJ0011, "Windows 10 Control Flow Guard Internals"
http://webhard.milkgun.kr/%EC%9E%90%EB%A3%8C/POC%202014/MJ0011%20-%20Windows%2010%20Control%20Flow%20Guard%20Internals.pdf

Defending Against ret2dir Attacks (partially) with Virtualization Technology?

2014-11-21T07:53:00.001-08:00

I was so excited when recently reading the paper (ret2dir: Rethinking Kernel Isolation) from Vasileios P. Kemerlis. This post is basically going to introduce the idea of ret2dir attack, and how to prevent such an attack with hardware virtualization technology, actually partially.

ret2dir (Return-to-Direct-Mapped-Memory) attack abuses physmap design in kernel virtual memory management system of many Linux/Unix OSs, it can bypass the SMEP/SMAP, PXN, KERNEXEC/UDEREF.

So, what is physmap?

It is Address Aliasing technique, which is designed for performance improvement. According to the author, "Given the existence of physmap, whenever the kernel (buddy allocator) maps a page frame to user space, it effectively creates an alias ( synonym) of user content in kernel space!"

To be more specific, the key point is that for the same physical memory address space in physmap region, there might have two virtual address addresses that will be translated or mapped to the same physical memory space, aka, N:1 mapping, here N is 2. One virtual address is in kernel address space (Page table U/S bit =0), and the other is in user address space (U/S=1). See picture below from ret2dir paper.

In order to make this kind of attacks work, the paper presents several innovative solutions. For example, how to force user memory allocation physical pages emerge in physmap area, how to get PFN info through information leak, and how to do physmap spaying (Can TSX help x-spaying exploit writing?) , etc.

Besides, the author also innovates a solution in Linux kernel to mitigate this ret2dir attack: eXclusive Page Frame Ownerwhip (XPFO).

The basic idea is very straightforward, it enforces exclusive ownership (of page frames) by either the kernel or userland unless explicitly requested by a kernel component(e.g., to implement zero-copy shared buffers). It means whenever a page frame is allocated to userland, it unmaps it from physmap; when such a page frame is reclaimed from userland, it puts it back to physmap.

In a virtualization environment, however, this solution (How to Implement a software-based SMEP with Virtualization/Hypervisor Technology) in my previous post may be able to stop or defend against ret2dir attacks. But it also has a limitation: the virtualization-based software SMEP can only stop execution of shellcode/payload in physmem area.

In order to stop read/write access to payload from kernel space with ret2dir attacks (which could be used to do ROP attacks, e.g. use the payload as kernel stack after stack pivoting), technically, we can extend that solution to implement a virtualization-based software SMAP to defend against ret2dir attacks...

References:
ret2dir: Rethinking Kernel Isolation -- paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

and its slides:
https://www.usenix.org/sites/default/files/conference/protected-files/sec14_slides_kemerlis.pdf

OpenBSD fix to remove executable permission in direct-map pages (recently):

https://secure.freshbsd.org/commit/openbsd/52e8e9f52ef21a21a315187623fafe4800efd868

Improve Performance for Separating Kernel and User Address Space with Process-Context Identifiers (PCIDs)

2014-11-21T00:02:00.000-08:00

This post is not talking about any new idea, just about what I'm thinking..

Back to year 2003, Ingo Molnar proposed 4G/4G split on x86 with 64 GB RAM support to separate user and kernel mode virtual address space. This is another post 64GB on 32-bit systems talking about this. Originally, the motivation was as below in that post.

The 4G/4G split feature is primarily intended for large-RAM x86 systems, which want to (or have to) get more kernel/user VM, at the expense of per-syscall TLB-flush overhead.

Obviously this is true for 32bit OS, for example in Linux OS, by default the kernel uses higher 1GB virtual address, while user space uses lower 3GB virtual space. In Windows OS, by default 2G/2G split is used.

At that time, security design was not a big concern, however, in year 2010 (maybe before this date), PaX team revisited this again by [grsec] Announcing UDEREF/amd64. But this time the motivation is not about kernel/user VM size, but about securing kernel and user address space to mitigate many ret2user attacks.

Before hardware PCID (Process-Context Identifiers) feature was introduced, TLB flush due to user/kernel page table switch through syscall or interrupt/exception will cause significant performance cost. In latest UDEREF/amd64 implementation (see this link: http://grsecurity.net/stable/grsecurity-3.0-3.14.24-201411150026.patch, PaX team told me that they will blog this), this PCID hardware feature will be used to speed up performance.

So what is PCID? see text below from Intel SDM.

Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces. The processor may retain cached information when software switches to a different linear address space with a different PCID.

When a logical processor creates entries in the TLBs and paging-structure caches, it associates those entries with the current PCID. When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID.

However, the PCID feature is only available on x64 mode (Intel IA-32e mode), which means only 64bit OS can use it. The PCID is a 12-bit value stored in CR3 register for each address space, see below from SDM manual.

To assisting in OS software programing, the new instruction INVPCID (Invalidate Process-Context Identifier) also is introduced to invalidate mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on process context identifier (PCID). It is kind of like INVEPT and INVVPID in Intel Virtualization technology, the former is to invalidate information cached from the EPT paging structures, and the later is to invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on Virtual Processor Identifier (VPID).

There are four INVPCID types (granularities) currently defined (copied from IA32/Intel SDM):

Individual-address invalidation: If the INVPCID type is 0, the logical processor invalidates mappings—except global translations—for the linear address and PCID specified in the INVPCID descriptor. In some cases, the instruction may invalidate global translations or mappings for other linear addresses (or other PCIDs) as well.
Single-context invalidation: If the INVPCID type is 1, the logical processor invalidates all mappings—except global translations—associated with the PCID specified in the INVPCID descriptor. In some cases, the instruction may invalidate global translations or mappings for other PCIDs as well.
All-context invalidation, including global translations: If the INVPCID type is 2, the logical processor invalidates all mappings—including global translations—associated with any PCID.
All-context invalidation: If the INVPCID type is 3, the logical processor invalidates all mappings—except global translations—associated with any PCID. In some case, the instruction may invalidate global translations as well.

References:
IA32 Intel Software Development Manual... just searching it in Google..

4G/4G split on x86, 64 GB RAM (and more) support
http://lwn.net/Articles/39283/

64GB on 32-bit systems
http://lwn.net/Articles/39925/

[grsec] Announcing UDEREF/amd64
http://grsecurity.net/pipermail/grsecurity/2010-April/001024.html

Grsecurity patch download
http://grsecurity.net/download.php

Anybody knows How to Legitimately Register a PMI (PMU Performance Monitor Interrupt) Callback Handler on Windows OS?

2014-11-18T17:22:00.001-08:00

According to IA32/Intel Software Development Manual, when some PMU (Performance Monitor Unit) counter overflows occur, or LBR (Last Branch Record)/BTS (Branch Trace Store) is near full, the processor will deliver a PMI (Performance Monitor Interrupt). In Linux Kernel implementation, the PMU (perf tool) is using NMI to deliver such a PMI interrupt, and we can directly change the kernel source to add our own PMI handler for a particular event.

But in Windows OS, how to register a PMI handler callback in a driver without hooking the kernel IDT table? Does anybody know about it?

I've searched almost all the Driver Support Routines provided for kernel-mode drivers to use in MSDN site, but didn't get the documented kernel APIs to do so. However, by checking the Windows 7 32bit OS with Windbg tool, I got something interesting.

According to IA32 manual, the local APIC is set up to deliver the PMI interrupt and a software handler for the corresponding interrupt must be in place on a certain vector entry of IDT (Interrupt Descriptor Table) table.

To be more specific, the Local APIC LVT (Local Vector Table) Performance Counter Register must be set up for this purpose. In xAPIC mode, the LVT Performance Counter Register MMIO address is (APIC base physical address + offset 0x340H), while when x2APIC mode is enabled, its address is IA32_X2APIC_LVT_PMI MSR (index 0x834h), which is called x2APIC LVT Performance Monitor register.

On a Windows 7 32bit OS, I used Windbg to check the MSR IA32_APIC_BASE (0x1B) with rdmsr command:

kd> rdmsr 0x1b
msr[1b] = 00000000`fee00900

See layout below of IA32_APIC_BASE MSR, so we can get to know that the APIC base physical address is 0xfee00000h, and the xAPIC mode is enabled on my system. This means the LVT Performance Counter Register MMIO address is 0xfee00340h.

Then, I used !dd command to read the content of this register address, see below, the value is 0x000000fe.

kd> !dd [uc] fee00340
#fee00340 000000fe 00000000 000000fe 00000000
#fee00350 0001001f 00000000 0001001f 00000000
#fee00360 000004ff 00000000 000004ff 00000000
#fee00370 000000e3 00000000 000000e3 00000000
#fee00380 00000000 00000000 00000000 00000000
#fee00390 00000000 00000000 00000000 00000000
#fee003a0 00000000 00000000 00000000 00000000
#fee003b0 00000000 00000000 00000000 00000000

Now, see the picture below for its layout, which means by default Windows OS kernel uses Fixed (000b) Delivery Mode and IDT vector 0xfe to deliver PMI interrupt.

Now, let's check the vector 0xfe in IDT table with !idt command in Windbg tool, the PMI ISR (Interrupt Service Routine) is hal!HalpPerfInterrupt installed by OS kernel.

kd> !idt 0xfe
Dumping IDT: 80b95400
fe: 82a221a8 hal!HalpPerfInterrupt

Disassemble this function as below with command uf, ellipsis(...) means some of instructions are truncated. We can see that it retrieves the handler (callback?) from the global variable hal!HalpPerfInterruptHandler, then calls it. So now my question is - how to register this performance interrupt handler (PMI handler) in my own driver, so that my callback routine can get called whenever a PMI event occurs?

kd> uf hal!HalpPerfInterrupt
...
hal!HalpPerfInterrupt:
82a221a8 54 push esp
82a221a9 55 push ebp
82a221aa 53 push ebx
82a221ab 56 push esi
82a221ac 57 push edi
82a221ad 83ec54 sub esp,54h
82a221b0 8bec mov ebp,esp
82a221b2 894544 mov dword ptr [ebp+44h],eax
82a221b5 894d40 mov dword ptr [ebp+40h],ecx
82a221b8 89553c mov dword ptr [ebp+3Ch],edx
82a221bb f7457000000200 test dword ptr [ebp+70h],20000h
82a221c2 75bc jne hal!V86_Hpf_a (82a22180) Branch

hal!HalpPerfInterrupt+0x1c:
82a221c4 66837d6c08 cmp word ptr [ebp+6Ch],8
82a221c9 741f je hal!HalpPerfInterrupt+0x42 (82a221ea) Branch
...
hal!HalpPerfInterrupt+0x146:
82a222ee 8bcd mov ecx,ebp
82a222f0 a1e43ca282 mov eax,dword ptr [hal!HalpPerfInterruptHandler (82a23ce4)]
82a222f5 0bc0 or eax,eax
82a222f7 745b je hal!HalpPerfInterrupt+0x1ac (82a22354) Branch

hal!HalpPerfInterrupt+0x151:
82a222f9 ffd0 call eax
...

A possible solution ? (It requires to change OS default settings):
As we know that, PMI interrupt event vector is shared, so basically a PMI interrupt handler should check the IA32_PERF_GLOBAL_STATUS MSR (0x38E) to determine which event(s) triggered the PMI. However, for each PMU, during a specific time period, there should have only one PMU driver (if we have multiple PMU drivers) to control and use it. Hence, Windows operating system (Win7+) provides two APIs below for PMU drivers.

HalAllocateHardwareCounters()
HalFreeHardwareCounters()

See their usages as below, for details please take a look at MSDN link.

If more than one such tool is installed on a computer, the associated drivers must avoid trying to use the same hardware counters simultaneously. To avoid such resource conflicts, all drivers that use counter resources should use the HalAllocateHardwareCounters and HalFreeHardwareCounters routines to coordinate their sharing of these resources.

A counter resource is a single hardware counter, a block of contiguous counters, or a counter overflow interrupt in a PMU.

Before configuring the counters, a driver can call the HalAllocateHardwareCounters routine to acquire exclusive access to a set of counter resources. After the driver no longer needs these resources, it must free the resources by calling the HalFreeHardwareCounters routine.

Does this mean that once we successfully call HalAllocateHardwareCounters() to acquire exclusive access to PMI (e.g. counter overflow interrupt in a PMU), then we can even re-program the default Local APIC LVT Performance Counter Register?

If we can do that without triggering PatchGuard (Windows x64 OS) or causing any other compatibility issues, then we could do it as below:

Call HalAllocateHardwareCounters() to acquire exclusive access to PMI interrupt.
Re-program APIC LVT performance counter register by setting Delivery Mode with NMI (100b), see its layout in picture above. Then whenever a PMI interrupt is triggered, a NMI (nonmaskable interrupt) handler will get called.
In other words, such a setting converts PMI event to NMI event.
Fortunately, Windows OS kernel provides two APIs below:
KeRegisterNmiCallback() - Registers a routine to be called whenever a NMI occurs
KeDeregisterNmiCallback()
See this MSDN link for details. It means OS kernel allows our driver to register a NMI callback routine to handle any NMI interrupt event.

Once we have done these, I think we can control and use a particular PMU, and handle the PMI interrupt event appropriately. When jobs are done, apparently we must restore APIC (xAPIC or x2APIC) LVT performance register back to its default settings, de-register NMI callback, and free hardware counter resource.

Notes:

Due to this bug in my previous post, on Windows 8.1 32bit OS, NMI interrupt will cause system crash. Not sure if Microsoft fix this issue on latest version.
Intel VTune driver on Windows OS might be using PMU PMI, but I have no idea how it does :-(
If anybody knows there is a good solution to register PMI interrupt, please let me know :) I really appreciate it!

Page Table Structure Corruption Attacks - How to Mitigate it?

2014-11-18T05:44:00.000-08:00

On x86 and many other processor architectures (with MMU), page tables are critical data structures for address translations. And many hardware-based page level protection technologies in my previous post, like SMEP, XD/DEP, highly depend on correct page table settings. so what if page tables are controlled by an attacker? ...At the end of this post, I will propose an extra solution to mitigate page table structure attacks.

Recently, this post (Windows 8 Kernel Memory Protections Bypass) presents a generic technique for exploiting kernel vulnerabilities with bypassing SMEP and DEP. It just requires only a single vulnerability that provides an attacker with a write-what-where primitive, then exploits it with modifying the page tables (U/S and XD bit flags) intentionally to bypass SMEP and DEP protections.

As we all know that SMEP (Supervisor Mode Execution Prevention) is a mitigation that aims to prevent the CPU from running code from user-mode while in kernel-mode. Internally the processor check the U/S bit flag in corresponding page structure tables when fetching instruction for execution in kernel mode. Hence if we can corrupt the paging structures to modify the U/S flag, then we can cause a user memory to be interpreted as kernel memory without any other additional changes.

Similarly, DEP (Data Execution Prevention) depends on the NX bit flag (set) to prohibit a data page being executed. If we can clear such a flag by corrupting the paging structures, we can cause a data page to be marked as executable.

On Windows 8 system, both SMEP and DEP are enabled by default, and the KASLR (Kernel Address Space Layout Randomization) is also enabled. But unfortunately, the virtual address of a corresponding PTE entry address for a particular virtual address (for example, a user mode address) is fixed and easy to calculated. So how to retrieve page table addresses?

For example, on 32bit PAE Windows system, the code below can get the virtual address of PTE (not the PTE contents) for a particular virtual address as input.

#define PT_VIRTUAL_BASE_ADDRESS 0xC0000000

#define PAGE_TABLE_SHIFT 12

#define PAGE_DIR_SHIFT 21

#define PAGE_DIR_POINTER_SHIFT 30

__inline

UINT32 PAEGetPteAtVirtualAddress(UINT32 Vaddr)

{

return (UINT32)

( PT_VIRTUAL_BASE_ADDRESS +

((Vaddr & 0xC0000000) >> PAGE_DIR_POINTER_SHIFT) * 0x200000 +

((Vaddr & 0x3FE00000) >> PAGE_DIR_SHIFT) * 0x1000 +

((Vaddr & 0x001FF000) >> PAGE_TABLE_SHIFT) * 8

);

}

On 64-bit Windows system, similarly.

/* you can see this definition in Win DDK/SDK ntddk.h file */

#define PTE_BASE 0xFFFFF68000000000UI64

#define PTE_SHIFT 3

#define PTI_SHIFT 12

#define PDI_SHIFT 21

#define PPI_SHIFT 30

#define PXI_SHIFT 39

#define VIRTUAL_ADDRESS_BITS 48

#define VIRTUAL_ADDRESS_MASK ((((UINT64)1) << VIRTUAL_ADDRESS_BITS) - 1)

#define X64GetPteAddress(va) \

(((((UINT64)(va) & VIRTUAL_ADDRESS_MASK) >> PTI_SHIFT) << PTE_SHIFT) + PTE_BASE)

Then if there is write-what-where kernel vulnerability, an attacker can corrupt the corresponding PTE based upon the calculations above for a particular virtual address of user mode code that is controlled by attacker.

So now, how to mitigate this kind of SMEP/DEP bypassing?

As the author of that post said, randomization for page table address itself is not possible because it is recognised that many of the core functions of the kernel memory management may rely on this mapping to locate and update paging structures.

The author also proposed two solutions to mitigate it:

One is to use a separate data segment for holding page structures.
This requires an extra dedicated segment register. Maybe GS is unused in 32bit Windows, and FS is unused in 64bit Windows, then we can use this solution.
The other one is to set hardware debug breakpoints on the access to the paging structures (or key fields of the structures).
Hardware breakpoint is a very limited resource (only max 4 H/W breakpoints supported), and it may also cause other compatibility issues.

Now, I am proposing another solution to solve this issue by write-protecting page table structures with CR0.WP capability.

The basic idea is to set data page of page structures/tables themselves with Read-Only permission. And because CR0.WP bit is set by default, so any write access to page table structures will generate #PF exception by processor. But for legitimate modification to page table structures, use the code sequence below:

disable_wp(); // clear CR0.WP bit.

write access to RO page structures.

enable_wp(); // set CR0.WP bit again.

I have talked about this idea before in my previous posts, please check details below:

Security OS Kernel Design: an idea to prevent malicious software overwriting the critical system kernel data structures

Security OS Design (cont.): Write Protection for Linux Kernel critical data structures (GDT, IDT, syscall table, task_strcture, mm_struct,...)

Update:
Some references from M$FT slides about Windows self-mapping page tables:

Implement software-based SMEP with Non-Execute (NX) bit in page tables to secure kernel/user virtual memory address space.

2014-11-17T04:10:00.002-08:00

In my previous post, I talked about how to implement a software-based SMEP (Supervisor Mode Execution Protection) with virtualization/hypervisor for fun. In this post, I'm going to detail yet another solution to implement software-based SMEP without virtualization technology.

In modern operating systems, like Linux and Windows, all the processes share the same kernel virtual address space, but have separate user virtual address space, see below for Windows 32bit OS. The system can achieve this by configuring separate page structures pointed by a translation table base register (e.g. CR3 register on x86/Intel MMU architecture) for each process, and switch among them.

To simplify the discussion, I'm assuming that we are working with a Linux 64bit OS system on x86_64/Intel architecture.

So, from here (https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt), we can know that the virtual address range below belongs to user space.

0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm

And, we also get to know that x86_64 bit Linux OS uses Intel IA-32e paging as below (w/ 4KB page size as an example), which has CR3 register pointing to the physical base address of a PML4 table. Each process/task has a corresponding PML4 table.

When a task gets scheduled, the corresponding physical base address of PML4 table will be wrote to CR3 register by a mov-to-cr3 instruction, so that the task/process virtual address space can be switched accordingly.

Since the Linux user address space range is 0000000000000000 - 00007fffffffffff, we can infer that the first 256 PML4 entries (index 0~255) will eventually pointer to user virtual address space for each process/task. See below picture.

In each PML4 entry above, there are some processor-"Ignored" bits and a XD (eXecute-Disable) bit as picture indicated below. The "XD" bit can control whether or not the referenced physical pages can be fetched for execution. If it is set, then an instruction fetch will trigger a #PF exception (assuming MSR IA32_EFER.NXE = 1). This is the key point for implementing software-base SMEP solution.

So, the solution now is:

Whenever a process enters kernel mode (CPL=0, for example, through a syscall or sysenter instruction), OS kernel sets the bit PML4E.XD bit for all the PML4 table entries (index 0 through 255, can be optimized). And then flush TLB (performance cost).
In this way, any attempt to fetch user virtual address memory in kernel mode will cause a #PF exception, but read/write access to user virtual address memory is allowed (for example, copy_to/from_user() functions).
OS kernel can use some "Ignored" bits to record this intended behavior for easy virtual address management.
Before leaving kernel mode, the OS kernel change PML4.XD bit (and some "Ignored" bits) back to the original state.

Similarly, if we don't consider performance cost, we are even able to implement a software-based SMAP (Supervisor Mode Access Protection) with "Present" bit clear, but I'm not explaining the details in this post.

<The End>

Update:
I didn't do enough homework before. Previously UDEREF from PAX used 32bit segmentation (and its limit) to emulate SMEP/SMAP behaviors, but thanks to someone from PAX team commenting it as below, I got the UDEREF for 64bit here:
https://github.com/opntr/pax-docs-mirror/blob/master/uderef-amd64.txt

DMA Attacks Against McAfee DeepSafe

2014-11-16T04:15:00.001-08:00

Rafal Wojtczuk (from Bromium, previously Invisible Things Lab) presented DMA attacks against DeepSafe.

About DeepSafe:

http://www.mcafee.com/us/solutions/mcafee-deepsafe.aspx

The snapshots below from: https://www.youtube.com/watch?v=RM1oBlFX5UQ

How to know where physical address space DeepSafe hypervisor is located in? (from whitepaper)

There are a few interesting technical details regarding the above hypervisor overwrite. First, malware running in OS needs to know where in physical address space Deepsafe hypervisor is located. Dumping all the physical address space via DMA and doing pattern search in it is possible, but troublesome. A more elegant approach was found – it turns out that when EPT fault occurs because OS tried to read from a physical address belonging to the hypervisor, then Deepsafe does not bother to emulate the instruction, it just skips it. Thus, the following function

mov rax, MAGICVALUE
mov rax, [rcx]
ret

Will return MAGICVALUE if memory at rcx belongs to Deepsafe, and something else (real memory content) if not. Deepsafe allocates a contiguous physical memory region of size 0x300000, so it is easy and fast to find it via scanning all the memory.

References:

BlackHat 2014 @US:

https://www.blackhat.com/us-14/archives.html#poacher-turned-gamekeeper-lessons-learned-from-eight-years-of-breaking-hypervisors

The Presentation:

https://www.blackhat.com/docs/us-14/materials/us-14-Wojtczuk-Poacher-Turned-Gamekeeper-Lessons_Learned-From-Eight-Years-Of-Breaking-Hypervisors.pdf

Whitepaper:

https://www.blackhat.com/docs/us-14/materials/us-14-Wojtczuk-Poacher-Turned-Gamekeeper-Lessons_Learned-From-Eight-Years-Of-Breaking-Hypervisors-wp.pdf

http://www.bromium.com/sites/default/files/wp-bromium-breaking-hypervisors-wojtczuk.pdf

Latest researching status of ROP/JOP attacks and defenses

2014-11-16T02:45:00.003-08:00

Control Flow Hijacking, like ROP, becomes a hot topic in recent years since ever DEP(W^X enforcement) and SMEP were introduced in h/w processor. Based upon the papers that I read recently, this post just gives a brief introduction on the recent researching status (though incomplete) about control flow attacks and defenses.

When code injection attacks become more and more difficult, attackers start to seek other opportunities to execute arbitrary code with completely re-using existing executable code in application image and/or shared libraries.

Typically, for example, those techniques without code injection could be return-to-libc, ROP (Return Oriented Programming), JOP (Jump Oriented Programming), or even SROP (Sigreturn Oriented Programming, see Framing Signals—A Return to Portable Shellcode).

Regarding techniques that defend against those control flow hijackings, here below are the lists (also incomplete).

Change existing compilers to re-generate the code binaries. There are some typical solutions like, generate return-less binary code, generate control flow friendly binary (with extra IDs/labels for CFG hardening), modify all the control flow (ret, jmp, call) instructions to a well-known center redirection table, etc.
Without binary changes, make static binary instrument and dynamic control flow tracing. For instance, build control flow graph, and enforce the control flow execution exactly aligning with the known CFG paths.
Hardware assisted CFI. CFI(Control-Flow Integrity) is an efficient way to defend against ROP/JOP attacks. However, due to performance issue, complete CFI enforcement is impossible in practice. So, there are some lightweight CFI checks with the help of latest processor LBR (last branch record) to examine the control flow behaviors base upon the experience (rather than the full CFG analysis).
THere are some specific solutions, like stack shadowing (to check if the target "ret" is call-preceding instruction), code section shadowing. But most of them are not a generic solution, but have many assumptions and limitations.

Regarding to the defenses with CFI, there are many papers that focus on the policy of checking whether or not the history control flow instructions behave as a malicious software. Typically, for example:

Some solutions to check the length of each one of ROP gadgets. If it is very short (e.g. less than 5~6 instructions), it might be suspected. If there are consecutive chain of gadgets with "short" instruction sequences in a control flow, then it might be a ROP attack.
Some of them to check whether if the target of "ret" is a Call-Preceded instruction (the instruction immediately preceding is a CALL instruction), this is because normally, every "ret" instruction returns back to an instruction that immediately follows a corresponding "call" instruction.
Or check if the target of all the indirect call/jmp instructions are the "entry-point" functions. This is also normally true for a legitimate application, because generally an indirect "jmp" or "call" won't be calling into a certain middle location in a function.

However, just as those papers (see the links in References below) indicate, all the current CFI solutions based upon above assumptions can be bypassed with advanced ROP gadgets.

For example, in Nicholas's paper, he just used some call-preceded ROP gadgets and long termination gadgets for flushing attacking history to bypass checks of the famous kBouncer and ROPecker CFI solutions.

But his solution has an assumption for kBounce/ROPecker: the last branch records (LBR) can be stored in only 16 (at most) pairs of LBR MSRs.

As a matter of fact, if appropriately configured, the last branch records can also be stored into a variable-sized memory-resident branch trace store (BTS) buffer specified by DS(Debug Store) save area pointed by the IA32_DS_AREA MSR . And processor doesn't restrict how many pairs of last branch records could be stored in that BTS buffer, it also allows us to make processor generate an interrupt before the count of records reaches to the max records configured (or when the BTS buffer is nearly full). This means that we will never miss history LBR records. If we don't consider performance cost when enforcing CFI check at run time, this could be a good solution to trace all the control flow information.

However, even if we can get all the control flow traces (to defend against Nicholas's "history flushing" solution), does it mean that we can completely defend against control flow attacks? Unless that we can make full CFI checks with CFG, one of another problems that we might always encounter is how to design a better policy to reduce "false positive", and at the same time to catch all the ROP attacks at run time with acceptable performance cost in practice.

References:

Nicholas Carlini, David Wagner, ROP is Still Dangerous: Breaking Modern Defenses
http://www.cs.berkeley.edu/~daw/papers/rop-usenix14.pdf
Enes Gökta¸s, Size Does Matter - Why Using Gadget-Chain Length to Prevent Code-Reuse Attacks is Hard
http://www.cs.columbia.edu/~mikepo/papers/chainlength.sec14.pdf
Out Of Control: Overcoming Control-Flow Integrity: http://www.ieee-security.org/TC/SP2014/papers/OutOfControl_c_OvercomingControl-FlowIntegrity.pdf
Technical Report TR-HGI-2014-001:
http://www.hgi.ruhr-uni-bochum.de/media/emma/veroeffentlichungen/2014/05/09/TR-HGI-2014-001_1_1.pdf

How to Implement a software-based SMEP(Supervisor Mode Execution Protection) with Virtualization/Hypervisor Technology

2014-11-12T18:35:00.001-08:00

As my previous post indicated, SMEP is a powerful security feature, and easy to deploy in modern commodity OS. However this feature requires H/W processor's support, for those processors that are not SMEP-capable, this post presents a software-based solution to emulate SMEP functionality with the help of Virtualization/Hypervisor technology.

When x86 processor CR4.SMEP bit is set, the system software executing in kernel mode (CPL<3) cannot fetch instructions from any linear address with a translation for which the U/S flag is 1 (User) in every paging-structure entries controlling the translation. In other words, If SMEP is enabled, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode. When such an instruction fetch occurs, a #PF exception will be generated by SMEP-capable processor.

So, how to implement a software-based SMEP feature?

This paper (SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes from CyLab/CMU) presents an great idea: Create two separate EPT protection memory views for guest kernel (ring 0) and user (ring 3) mode respectively, with different EPT permissions for corresponding GPA->HPA translations, and then switch these two EPT page table views by intercepting kernel<->kernel mode switches. In x86/Intel processor, hypervisor can configure different VMCS EPTP pointers (which points to different Extended Page Tables) and switch among them at appropriate time.

To make the discussion easier, we can call these two guest memory translation tables (pointed by two different EPTP pointers) as protected memory views: one is used in guest Kernel mode, named as Kernel View; the other is for User mode, named as User View.

Besides, as that paper indicates, for both views, the identity map (GPA=HPA) is created in both EPT page tables by default. But EPT page table entry permissions may be different for the same GPA addresses. The latter is the key part for emulating SMEP behaviors, I will talk about it later.

By intercepting guest kernel/user mode switches, we can do this below in hypervisor:

Switch to use Kernel View when guest logical processor entering Kernel mode;
As we know that, in x86 processor, there are several ways to cause logical processor enter Kernel mode, for example in Windows OS, interrupt/fault/trap (through IDT table), syscall instructions. Based upon my previous project experience, some others like task gate (only NMI on 32bit OS), call gate, are not never used in Windows OS.
Switch to use User View when guest logical processor leaving kernel (or entering User mode);

So, the question now is that - how to get hypervisor be notified whenever a kernel/user mode switch happens?

The SecVisor does it like pictures below (snapshots from this link): In User View, the Execution permission is removed in the EPT page tables for Kernel Code pages , whenever entering kernel mode to fetch the entry point instruction from Kernel code page, an EPT violation vmexit occurs, then the control is transferred to hypervisor (SecVisor), so SecVisor can switch to Kernel View by updating the corresponding EPTP pointer in VMCS. Similarly, we can switch to User View whenever leaving kernel mode.

Now, obviously we can get to know how to emulate SMEP behaviors.

Assumed that the guest logical processor is running Kernel mode, and EPTP hence points to the mapping tables in Kernel View and also assumed that only approved code (e.g. Kernel and trusted LKM modules) has EPT execution permission in Kernel View, see picture below in the meanwhile, provided that there is a kernel vulnerability that can be exploited by malware to execute arbitrary user mode code. When the logical processor starts to execution user accessible code in kernel mode, an EPT violation will be generated because that user mode code cannot be executable in EPT Kernel View.

When hypervisor gets the control, the following policy could be applied to check the execution (instruction fetch) violate SMEP functionality:

Read the current CPL value from corresponding guest VMCS area to see if it is ZERO (kernel mode);
Get the current guest CR3 value (also from VMCS) and guest violation linear address (actually for EPT violation due to execution fault, that address is guest RIP) from corresponding VMCS area, then traverse the guest page table to see if U/S bit (accessible in user mode) flags in every page structures are ONE.

If both conditions above are true, then we catch a SMEP-like violation in guest kernel mode.

Challenges:
However, there are many challenges to implement this software-based SMEP feature with virtualization technology.

Performance impacts.
Because in that paper, we create two EPT memory protection views (Kernel View and User View), in order to switch back and forth at run time, the hypervisor must have to trap every event of entering and leaving kernel. This introduces significant performance cost because kernel-user mode switches are normally very frequent.

I think one of solutions of switching EPTP pointers (Views) without VMExit is to leverage the latest Virtualization features, like Virtualization Exception (#VE) and EPTP switching function (VMFUNC) in my previous post, and also use IDT Shadow/Virtualization technique in my another post to trap every kernel/user mode switches due to interrupt/trap/fault events. However, on those #VE/VMFUNC-capable machines, SMEP is also available:-)

For the mode switches due to syscall/sysret, you can brainstorm how to handle it without vmexit!
In Kernel View, we configure all the kernel code executable in EPT tables. When there is an LKM module loaded or unloaded, we must update the module memory to be executable in Kernel View and to be non-executable in User View immediately.

The author in the paper has a solution to solve it by adding code in load_module() and the free_module() function.

However, without guest kernel code changes, for module loading, I think we can use a lazy solution to solve it, for example, when a new loaded LKM module starts to run at the first time in Kernel mode, a EPT violation occurs, then in hypervisor we can check if it is a trusted LKM module, if yes, then we just allow that LKM code page executable in Kernel View, and remove the execution permissions in User View. But how to update the LKM code page EPT permissions in Kernel View when such a LKM module gets unloaded from kernel?
In the case of low memory pressure, will Linux OS page out or swap out LKM code pages to the disk storage?
I know this is true on Windows OS system, but I have no idea if Linux will do the same thing. (Anybody can tell me?)
If it is the case on Linux system, then without guest kernel hooks, it is also a challenge to update LKM code page permissions in EPT Kernel View and User View.

Note that what I'm talking about in this post is for fun. I don't think it is worth doing all those things just only for emulating SMEP-like feature with virtualization technology:(. As a matter of fact, I have yet another solution to implement a software-based SMEP feature without Virtualization/Hypervisor. Please stay tuned...in my next post.

Update:
Question: thinking of how to implement a software SMAP (Supervisor Mode Access Protection) with virtualization technology......

References:
SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes, and its presentation link．

　

What does Transactional Synchronization Extensions (TSX) processor technology mean to vulnerability exploits (e.g. Brute Forcing)?

2014-11-08T06:36:00.000-08:00

Intel Transactional Synchronization Extensions (TSX) was introduced since from Haswell processor with adding hardware transactional memory support. It was originally design to speed up execution of multi-threaded software through lock elision. Every new technology has both good side and evil side, then how about TSX extension? What can we use it to do for vulnerability exploits and its defenses?

According to Intel SDM, TSX provides two software interfaces for programmers:

Hardware Lock Elision (HLE) is a legacy compatible instruction set extension (comprising the XACQUIRE and XRELEASE prefixes).
Restricted Transactional Memory (RTM) is a new instruction set interface (comprising the XBEGIN, XABORT, and XEND instructions).

This post is focusing on the RTM interface, as the specification indicates below.

Software uses the XBEGIN instruction to specify the start of the transactional region and the XEND instruction to specify the end of the transactional region.

The XBEGIN instruction takes an operand that provides a relative offset
to the fallback instruction address if the transactional region could not be successfully executed transactionally.

A processor may abort transactional execution for many reasons. The hardware automatically detects transactional abort conditions and restarts execution from the fallback instruction address with the architectural state corresponding to that at the start of the XBEGIN instruction and the EAX register updated to describe the abort status.

The interesting things here are as follows:

What reasons can cause a transactional execution abort?
What does it look like when a transactional execution abort happens?

For the 2nd question above, the TSX specification says below,

The architecture ensures that updates performed within a transactional region that subsequently aborts execution will never become visible. Only a committed transactional execution updates architectural state. Transactional aborts never cause functional failures and only affect performance.

which means after a TSX abort occurs, all the physical memory and processor register updates (after XBEGIN instruction) will be discarded, and from user's perspective, the memory and register states are "restored" to the states at the start of XBEGIN instruction. This seems to be the same with the behaviors of try/catch or setjmp/longjmp functionalities.

However, there are some differences. For instance, any fault or trap in a transactional region that must be exposed to software will be suppressed, as if the fault or trap had never occurred. If any exception is not masked, that will result in a transactional abort and it will be as if the exception had never occurred.

As matter of fact, all the synchronous exception events (like #GP, #PF) that occur during transactional execution are suppressed as if they had never occurred, and those events won't be delivered to processor for handling.

Then regarding the 1st question above, there are a couple of reasons that might cause a TSX abort, like XABORT, CPUID, Software INT, VMX instructions, IO instructions, ring transitions (e.g. syscall), VMExit (ept violation), etc.

So, now let's think about something about Intel TSX!

Provided that a malicious software attempts to access (e.g. write or execute) a protected memory, normally it will trigger a #PF fault, then it could be detected and terminated subsequently by OS kernel. But if the malicious software attempts to do the same thing in TSX state after an XBEGIN instruction, as we pointed out above, such a #PF exception will be suppressed, which means the OS kernel cannot even detect this access violation.

The similar case also applies to the access to protected memory by EPT (extended page table) in a VMM/Hypervisor. Because when guest software attempts to access the physical memory protected with EPT in a hypervisor, an EPT violation vmexit will be triggered, but in TSX state, such a vmexit won't be triggered, so the hypervisor won't detect it.

So basically, it means that, To malicious software, before a successful attack, it can make more attempts (e.g. brute-forcing) to do bad things without being caught by OS kernel or even hypervisor.

But now I don't have an idea on how to use it to do something "real bad". Maybe someone else does have ideas....

UPDATE:
Interesting!!! got this....
TSX improves timing attacks against KASLR
http://labs.bromium.com/2014/10/27/tsx-improves-timing-attacks-against-kaslr/

Using LBR (Last Branch Record) Feature to Detect IDT-Shadowing-Based Malicious IDT Hooking

2014-11-08T04:53:00.004-08:00

Thanks to Yushi who shared a presentation (ELI: Bare-Metal Performance for I/O Virtualization) with me. In that hypervisor (ELI), it innovates an idea of gust IDT shadow (or IDT virtualization) design for some specific usage models. I'm going to talk a little bit about this idea.

This post firstly gives an introduction about IDT shadow in that paper, then talks about guest IDT hooking with this technique, and finally explains how to detect such a hooking with processor LBR feature.

IDT Shadow/Virtualization

See picture below (captured from that presentation), it presents an idea of Exitess Interrupt Delivery.

The basic ideas are as follows:

Setup a guest shadow IDT table. With hardware virtualization's help, we can easily to cheat guest OS software by monitoring and trapping guest execution of LIDT and SIDT instruction.
We can maintain a shadow IDT table (and keep it sync'ed with the original one), then let the real guest logical processor IDTR.base point to our shadow IDT table.
And in shadow IDT table, it clears Present bit or reduces IDTR.limit to trap the wanted guest interrupts/exceptions.
The former idea is exactly the same with my idea in previous post by generating #NP fault. The later idea (triggering #GP) is not recommended in my opinion because decreasing DITR.limit will cause many false positives (please correct me if I am wrong).

In that paper (see References at the end of this post for the link of this full paper), ELI utilizes the shadow IDT to implement a high-performance guest/host interrupt delivery solution without changes to guest OS kernel.

And one of other benefits is that the guest OS kernel integrity check (like Windows x64 OS PatchGuard) cannot even detect it.

Guest IDT vector entry hooking

Another usage of shadow IDT is to hook any one of guest IDT ISRs by changing the corresponding ISR entry in the shadow IDT table.

For example, to implement an interrupt filter for a particular interrupt, we can change the IDT entry in the shadow IDT table and let it point to our own hooking ISR entry. After that, whenever that particular interrupt is triggered, the processor will firstly pass on the control to our hooking ISR routine, and we can do something (e.g. filtering), then jump to original OS kernel ISR for further handling. And under some circumstance, we can even let the control be back to our routine after original ISR handling (see the patent in the References).

However, what if this kind guest IDT hooking is done by a malicious hypervisor? Then how to detect it in guest OS?

Detect IDT hooking with LBR (Last Branch Record)

Nowadays, all the x86/Intel processors have a feature: Last Branch Record.

When it is enabled, the processor records a running trace of the most recent branches, interrupts, and/or exceptions taken by the processor in the last branch record (LBR) stack (could be in MSRs and/or Branch Trace Store (BTS) Buffer in DS save memory area).

To be specific, when this feature is configured by guest OS kernel, the processor will capture the trace (e.g. LastBranchToIP address) in LBR stack whenever an interrupt is generated by processor. So when the OS original ISR gets executed, it can check the content of LastBranchToIP in LBR stack, to see if it matches with the OS original ISR entry. If there is a mismatch, then it indicates that the corresponding ISR entry is hooked by others, e.g. by a malicious hypervisor.

But in a hypervisor environment, there are some solutions to prevent guest OS kernel from detecting IDT hooking with LBR feature, e.g.,

Hide LBR and disabling this feature by trapping corresponding CPUID instruction (LBR capability check) and MSR read/write access to LBR control MSRs.
Because LBR stack could be stored with some LBR MSRs, hypervisor must trap those MSRs, and return faked values.
And hypervisor must also trap read access to Branch Trace Store (BTS) Buffer of DS-save memory area if guest OS configures the LBR stack to be also stored in BTS buffer.

References:

Ravi, et al.,Patent: Secure handling of interrupted events utilizing a virtual interrupt definition table (VIDT):

http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/8578080

The full paper for ELI presentation from IBM israel research lab: Abel Gordon, et al. ELI: Bare-Metal Performance for I/O Virtualization

http://www.mulix.org/pubs/eli/eli.pdf

Monitor Trap Flag (MTF) Usage in EPT-based Guest Physical Memory Monitoring

2014-11-06T07:02:00.002-08:00

Monitor Trap Flag (MTF) is a flag specifically designed for single-stepping in x86/Intel hardware virtualization VT-x technology. When MTF is set, the guest will trigger a VM Exit after executing each instruction (need to consider NMI or other interrupt delivery boundary). This paper presents an idea to use MTF for memory write allowing when monitoring modification to guest virtual-to-physical mapping (page table entries) tables.

In that paper (SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization), it details a solution to trap guest virtual-to-physical mapping address changes by monitoring the corresponding guest page tables. Based upon my previous experience, monitoring page table entries (with read-only permission in EPT PTE settings) will cause significant performance cost. In this post, I am not challenging that solution since it is not a product after all.

As we all know that EPT can be configured to monitor guest physical memory access with appropriate RWX permission settings. For example, for a guest data page, we can configure the corresponding EPT page table entry with !W permission, then whenever the processor fetches the instructions in that guest physical page for execution (e.g. code injection for shellcode execution), an EPT violation vmexit (or #VE interrupt) will occur.

However, the contents of some guest physical page might be swapped out to disk by OS under a low memory pressure condition, and then that physical page might be remapped to another guest virtual address used for by other process. In this case, we must restore the EPT permission to default (e.g. RWX), otherwise there are many unwanted EPT violations occur.

One of solutions is to monitor guest virtual-to-physical mapping page table entries just as what the paper does. For example, we can monitor guest PTE page (guest physical address) with EPT Read-Only permission. Whenever a page remapping is required, the guest OS kernel will update the corresponding guest PTE entry.

Since the PTE entry in EPT permission is read-only, any change to that entry will trigger EPT violation vmexit. After hypervisor captures this event, it will record the current values of PTE entry, then temporarily set the PTE page to writable and let the guest single-step (through enabling MTF) through the instruction that performs the write access. After the single-stepping, hypervisor will read the new values of PTE entry and see which ones of them have been modified, and take appropriate actions based up mapping updates. After that, hypervisor will disable MTF flag and set the PTE page back to read-only to capture future remapping event.

In the real case, the guest page table may have multiple levels, also changes to page table entries
may be very frequent, and minimal EPT page granularity is 4KB (too large), therefore this can only be an experimental solution due to huge performance penalty.

However, using MTF flag to grant a data write access and/or inspect the write content on a data page that is wrote less frequently is acceptable.

BitVisor - A Thin Hypervisor Built for Enforcing I/O Device Security - Storage (USB/DISK) Encryption or File Access Monitoring

2014-11-05T04:27:00.003-08:00

This post is wrote to share an idea of the paper (BitVisor: A Thin Hypervisor for Enforcing I/O Device Security) that I read recently. It innovates a hypervisor-based solution for enforcing storage/disk encryption of ATA devices.

As we know that Direct memory access (DMA) is a feature in computer system that allows certain hardware subsystems to directly access main system memory independently of the CPU/processor.

As shown in picture below (from wikipedia), processor MMU (if present) cannot directly intercept data movement between external DMA-capable devices and the main memory. Even in a virtualization environment, extended page table (EPT or NPT) MMU configured by Hypervisor cannot protect the main memory resource, this is one of reasons of why the technology IOMMU like VT-d (Intel Virtualization Technology for Directed I/O) is introduced to prevent malicious device drivers from attacking against the main memory even hypervisor owned memory space.

Although the processor has no chance to intercept the data transferred between device and main memory, it still has chances to intercept the access to DMA control data regions like DMA descriptors that store transferring information,
such as the buffer address and the size of the data.

Because basically all the DMA host controllers are using some specific command control registers to configure and trigger DMA data transfers. Those registers are commonly represented as I/O based or/and MMIO based registers, the former registers are accessed with I/O port (e.g. IN/OUT or INTS/OUTS instructions in x86 system), and the latter registers are accessed with memory-mapped I/O method with generic memory movement instructions (like MOV instruction).

In x86 virtualization, both accesses with I/O port and MMIO can be monitored by hypervisor (e.g. through I/O port bitmap VMCS configuration, and appropriate EPT page permission settings).

In BitVisor, a thin hypervisor intercepts any read/write access to the ATA DMA host controller's command-block registers and control-block register. Therefore, it is easy to obtain information necessary to enforce encryption. For example, the hypervisor can obtain the LBA and sector count by intercepting writes to these registers. Similarly, the corresponding ATA disk DMA descriptor can also be monitored and controlled by BitVisor hypervisor.

With intercepting all the information as mentioned above, the hypervisor has knowledge of where the DMA buffer (physical address) will be transferred to/from, when to start/stop data transfers, and what size in bytes will be transferred to/from external device.

This paper presents a novel idea of shadow DMA descriptors (see Figure below, click it to enlarge) for safely intercepting the content of data transferred by DMA. A shadow DMA descriptor is a shadow of the DMA descriptor of the guest OS (guest DMA descriptor).The hypervisor sets the shadow DMA descriptors to the host controller (rather than the real guest DMA descriptor). The shadow DMA descriptor specifies a memory region controlled by the hypervisor as a temp buffer, called the shadow buffer.

When data movement starts, the DMA host controller transfers data between the shadow buffer (rather than the real guest buffer) and the device, based on the shadow DMA descriptors. After data movement is completed, the hypervisor emulates the host controller behaviors by copying data between the shadow buffer and the guest buffer that is specified in the real guest DMA descriptor.

Now that hypervisor can fully control and intercept the DMA data content, the encryption and decryption become easy. When guest software writes the data to ATA disk, the hypervisor can enforce encryption from guest buffer to shadow buffer (that eventually goes to disk), and in reverse order when guest software reads data from ATA disk, the hypervisor will decrypt data from shadow buffer (coming from disk) to guest buffer.

On the other side, as a pretty good side effort, we can utilize this solution to enforce protection from device-specific DMA attacks on a platform that is lack of IOMMU (e.g. VT-d) capability. For instance, the hypervisor can verify the address of guest buffers specified in the guest DMA descriptors so that the address (plus the data size) does not point to the hypervisor memory regions and any other protected guest memory regions.

As we can see that the hypervisor can capture the event when DMA starts. However, the end of DMA transfer is usually notified by a hardware interrupt, but BitVisor cannot identify the ATA device that issues hardware interrupts. Instead, BitVisor captures I/O access to status registers, because device drivers usually read status registers to check whether DMA transfer has finished successfully or not, and write registers to acknowledge interrupts. As an alternative, the solution in my previous post can solve this issue by monitoring the ATA disk external interrupt.

To download the source code of latest BitVisor, please go to the official site http://www.bitvisor.org/

See below snapshot (and other one BitVisor Summit @2012), it uses the Ring3 layer in VMX-root to hold various services.

References:
TreVisor:
OS-Independent Software-Based Full Disk Encryption & Secure Against Main Memory Attacks
http://www1.cs.fau.de/filepool/projects/trevisor/trevisor.pdf

OSb: OSv on BitVisor
http://www.slideshare.net/yushiomote/osb-osv-on-bitvisor

Takahiro Shinagawa: Introduction to the BitVisor and Comparison with Xen
http://www.slideshare.net/xen_com_mgr/xs-japan-2008-bitvisor-english

Dependable Cloud Computing
http://www.slideshare.net/kazuhikokato/121127-37898979

Kernel Memory Protection by an Insertable Hypervisor which has VM Introspection and Stealth Breakpoints (IWSEC2014)
http://www.slideshare.net/suzaki/international-workshop-on-security-iwsec2014

A Hypervisor IPS based on Hardware Assisted Virtualization Technology
http://www.slideshare.net/ffri/bh-usa08murakami

XEN PVH Virtualization Mode - "What Color Is Your Xen?"

2014-11-04T21:39:00.001-08:00

In my previous post Why smaller code size with XEN on ARM?, one of reasons I explained is that XEN on x86 must support different guest working modes with backward compatibility due to historical x86 virtualization technology limitations (e.g. in the first x86 VT-x version, no hardware-assisted Paging support). This post just shares some useful information/links on a new XEN virtualization mode (PVH) I read recently.

Before PVH virtualization mode introduced (by Mukesh Rathor @Oracle in 2012), Xen (on x86) supports different virtualization modes, like PV, HVM, HVM with PV drivers, PVHVM depending the guest domain/OS type and hardware machine capability. This was pretty complicated in XEN design. I think we wouldn't do it like that if XEN on x86 project were launched in recent year (instead of 10 years ago). This is why XEN on ARM can do it better in this area.

Here are some very great posts that explain why PVH mode is much better than any one of previous virtualization modes based upon the latest x86 processors and platforms.

What Color Is Your Xen?

http://www.brendangregg.com/blog/2014-05-07/what-color-is-your-xen.html

The Paravirtualization Spectrum, part 1: The Ends of the Spectrum

https://blog.xenproject.org/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/

The Paravirtualization Spectrum, Part 2: From poles to a spectrum

https://blog.xenproject.org/2012/10/31/the-paravirtualization-spectrum-part-2-from-poles-to-a-spectrum/

At a glance, this picture below (from What Color Is Your Xen?) has a straightforward illustration of the differences among all those virtualization working modes.

Unikernels: Library Operating Systems for the Cloud (OSv)

2014-11-03T22:29:00.003-08:00

Unlike a general-purpose, commercial operating system (like Windows, Ubuntu), OSv (http://osv.io/ from cloudius-systems) is a single-purpose operating system. It is also kind of library operating system designed for the cloud that running on top of different hypervisors, e.g. XEN, KVM, VMware. So what does OSv like look?

By quickly taking a look at this slide (http://www.slideshare.net/dmarti1111/o-sv-usenix-atc-2014), we can get to know these two features below.

General-purpose OS has kernel mode (ring 0) and user mode (ring 3), but OSv only has code running in Ring 0 mode, it doesn't have code running in user (ring 3) mode. This is one of most significant differences.
The other main difference is that the OSv only has one single address space (multiple threads allowed, though) running on a hypervisor as being a single virtual appliance, which serves a single-purpose cloud service.

So, each OSv holds one specific application (with multiple threads) on top of it, and the OSv itself runs as a guest OS on top of hypervisor. Application and resource isolation is guaranteed by the hypervisor.

Since an OSv has a single address space, it needs only one CR3 value. Process and address space switch (scheduler) is not required any more, and hence no TLB flush overhead introduced. This also can reduce "kernel" component memory footprint, and let application own more memory space.

Regarding address translation overheads, e.g. GVA->GPA->HPA, by using the larger table (2MB, or even 1GB) for both guest virtual memory page tables and EPT tables, such a translation overhead could be further reduced a lot.

Note that, as the slides pointed out, syscalls (user/kernel switch) are no longer required. Any traditional syscalls now are converted to just function calls in kernel mode only. Although this can significantly reduce performance cost, it also causes a new issue: application ABI compatibility issue. This means that in order to deploy this application on top of OSv, the source code must have to be modified and recompiled.

There are many challenges, you can check http://osv.io/ for greater details. But anyway, if OSv is a correct direction for cloud OS (and with a success on having rich application supported) in future, it will definitely have a direct competition with some other solutions like Docker.

I like this product in person, the much simpler it is, the more I love it :-).

Also, see this slide (OSb: OSv on BitVisor), someone is working on enabling OSv on top of BitVisor, cool !

Some other references about Library OS:

----------------------------------------------------

Unikernels: Rise of the Virtual Library Operating System
http://queue.acm.org/detail.cfm?id=2566628

XPDS14: OSv - A Modern Semi-POSIX LibraryOS - Glauber Costa
http://www.xenproject.org/help/presentations-and-videos/video/xpds14v-osv.html

Rethinking the Library OS from the Top Down:

http://research.microsoft.com/pubs/141071/asplos2011-drawbridge.pdf

OSv on bhyve:

http://bhyvecon.org/osv_on_bhyve.pdf

Problems arises when supporting EFI + GRUB2 + Xen with Multiboot2 boot specification

2014-11-03T18:46:00.001-08:00

Previously I wrote a post to discuss the limitations for Multiboot boot specification, today I saw that XEN hypervisor also has the similar problems.

See these two links below, Daniel Kiper (from Oracle) have some proposals to solve XEN/Multiboot2 issue on EFI/Grub2 platforms.
http://lists.xen.org/archives/html/xen-devel/2014-05/msg02928.html
https://lists.gnu.org/archive/html/grub-devel/2014-06/msg00016.html
http://www.slideshare.net/xen_com_mgr/xen-in-efiworld20140801finaldk

To summarize it, the problems are:

Grub2 calls ExitBootServices() before jumpping to the entry point of XEN, which means all the EFI Boot Services will be terminated then.
Multiboot2 specification doesn't define 64-bit entry point and its initial transition state, which means that even both XEN and Grub2/EFI are running in 64bit environment, during handover stage Grub2 must have to switch processor mode to 32bit, and XEN must also have a stub that switches processor mode back to normal 64-bit.
XEN requires some information from Grub2, e.g. EFI tables/Functions, ACPI, Memory map, VGA, EDD data, etc. Hence, some extra Multiboot2 TAGs must be introduced to support passing on those informations. But this also requires changes upstreamed to Grub2.

Previously, in our own proprietary VMM, we worked around this issue to boot guest Linux OS with "noefi" flag in vmlinuz cmdline options (see this link for Linux parameters). We did it in the same way with tboot project, because we just get EFI System Descriptor Table from Grub2, and then boot Linux guest OS as usually like on a legacy platform (e.g. with legacy e820 memory map format). However, XEN should NOT do it in this way.

Debugging Bug Check (BSOD) 0x101 CLOCK_WATCHDOG_TIMEOUT in a Hypervisor/VMM Environment

2014-11-03T00:16:00.002-08:00

I'm planning to write a post for debugging Bug Check 0x101 issue (CLOCK_WATCHDOG_TIMEOUT) in Windows system. but I happened to find this blog Debugging a CLOCK_WATCHDOG_TIMEOUT Bugcheck from MSFT debugger team which explaned it in greater details. However, the issue we met is slightly different from what MSFT team was debugging. We are working in virtualization/hypervisor environment, and Windows (7+) is running as a primary Guest OS.

Basically, according to MSFT, a bugcheck 0x101 occurs when the Clock interrupt (Its IRQL is #28) has not been processed by each processor within a timeout. The Clock interrupt is quite high in the IRQL table for x86, however the Inter-Processor Interrupt (IPI, its IRQL is #29) is much higher than this level.

However, in our case, the things are little bit different. But I won't give some details for this.

In Uniprocessor mode, the system hangs up when issue occurs, in a SMP mode, the system shows 0x101 BSOD just right after a very short stucking, but sometimes the system also gets hang-up.

The root cause I eventually got is a deadloop happening in hypervisor. And when this deadloop happens in BSP processor (the CPU is endlessly running in host VMX root mode), the symptom is guest Windows OS hang-up without 0x101 WATCHDOG TIMEOUT BSOD, but when such a deadloop happens in any one of APs (Application Processors), this clock watchdog timeout Blue Screen of Death occurs very soon.

This is because when one of processors runs in VMX root mode endlessly, the Clock interrupt (IRQL #28) has not been processed by that processor within a timeout, then the BSP processor will initiate a 0x101 BSOD, send IPI to other processors starting to dump the system states, and putting themselves into shutdown state.

Security OS Design (cont.): Write Protection for Linux Kernel critical data structures (GDT, IDT, syscall table, task_strcture, mm_struct,...)

2014-11-02T22:07:00.003-08:00

To be continued for previous post, let me review what must be changed in Linux kernel in order to prevent buffer overrun/overflow attacks from modifying the critical kernel data structures, like GDT, IDT, task_struct, mm_struct, etc.

There are some kernel data structures that are never changed at runtime as long as the operating system completes their initialization. For example, the GDT and IDT table, the system call table, or SSDT (pointed by nt!KeServiceDescriptorTable, see this link for SSDT hooking in Windows OS). Note that in Linux system, some of GDT table entries will also be updated by kernel.

For those data structures, we can directly configure them with Read-Only memory permission in page table entries.

However, there are many kernel data structures like task_struct, mm_struct, GDT, which must have to be configured with Read-Write attribute because they are changed very frequently during OS runtime.

But the good thing is that those data are only changed by kernel itself. Basically, the system drivers or other LKMs must not change them, and we can even think that any changes to them by those LKMs are illegitimate, and not desired behaviors.

So, with this assumption we can now take a look at what we should have to do on existing Linux kernel system or a new operating system started from scratch.

Kernel virtual memory management subsystem:
First of all, add a new type of memory allocation to support Read-Only memory allocation (with kmalloc() or even vmalloc() ), for example, adding a new parameter GFP_ROMEM.

This means that the kernel internal memory management subsystem (e.g. Linux slab allocator) must be extended to group RO memory chunks together in a single or multiple RO pages (4KB or 2MB in size), and traditional RW memory chunks into other multiple RW pages in 4KB or 2MB size. This might greatly increase the complicity of memory management system design.

Memory allocation for kernel itself and drivers (or any LKMs)
Once we add a new type GFP_ROMEM, we must define the rules to use it.

The first rule #1 is ... for the data structures that will only be modified by kernel module itself, we must use this new type for memory allocation in kernel (e.g. scheduler). All the drivers (or other LKMs) are disallowed to use this new type, we can use code static analysis tool to enforce this usage.

The second rule #2 is ... since the data structure are now RO attribute, and by default CR0.WP bit is set, so kernel module must have to disable CR0.WP before writing access to those data structures. So the code logic is as below:

disable_wp(); // clear CR0.WP bit.

write access to RO data fields.

enable_wp(); // set CR0.WP bit again.

At the same time, any legitimate drivers (LKMs) are not allowed to change CR0.WP bit (code scanning to enforce this).

With solution, we can prevent many buffer overflow attacks like, some driver bug that causes arbitrary kernel memory overwriting. However, ROP (JOP) attacks might bypass this solution, but this security design is not intended to address such a specific attack like ROP.

Problems:

This will increase tremendous changes to existing Linux kernel system. But it would be good if we plan to write our own operating system starting from scratch.
Performance impact. Too many extra cycles for disabling/enabling CR0.WP bit. But we can optimize it, the real impact might not so big.
Need to consider the interrupts or NMIs between disable_wp() and enable_wp() functions. This is just an implementation consideration. It can be solved very easily.

Any other big issues?

[Update]:
The memory Protection Keys feature can do kind of similar protection for key data structures. With this feature enabled, each process also has a protection key value associated with it. On a memory access the hardware checks that the current process's protection key matches the value associated with the memory block being accessed; if not, an exception occurs.

See the wikipedia page for details: http://en.wikipedia.org/wiki/Memory_protection#Protection_keys