SIMPLE IS BETTER: How to defend against Stack Pivoting attacks on existing 32-bit x86 processor architecture?

Stack Pivoting is a common technique widely used by vulnerability exploits to bypass hardware protections like NX/SMEP, or to chain ROP (Return-Oriented Programing, the Wikipedia link) gadgets. However, there is NO hardware protection solution to defend against it (at least for now:-). This blog will describe a software solution to detect Stack Pivoting at run time, and I will also point out some limitations due to current processor architecture implementations. <Please let me know if this is NOT a new idea, or NOT doable.>

The basic idea of detecting stack pivoting is: configure the appropriate stack base/limit (normally, the modern OS sets base/limit with 0~4G in 32bit mode) in stack segment register for a specific thread, then if a stack pivoting that causes the stack address (ESP) out of the defined range is detected, the processor will generate a #SS fault (limit violation exception).

Before introducing my solution, let me briefly talk about an existing solution to detect stack pivot in Windows 8 OS.

Microsoft implements a simple protection mechanism: every function associated with manipulating virtual memory, including the often-abused VirtualProtect and VirtualAlloc, now includes a check that the stack pointer, as contained in the trap frame, falls within the range defined by the Thread Environment Block (TEB, see below picture, StackBase/StackLimit)

You can take a look at this blog for detailed descriptions. However, the blog author (Dan Rosenberg) also describes an approach to bypassing it.

Now I'm going to talk about the solution and limitations in greater details.

What's stack pivoting?
Please skip this section if you already know about what's stack pivoting.

With stack pivoting, attacks can pivot from the real stack to a fake stack which could be an attacker-controlled buffer, such as the heap, then attackers can control the program execution. For example, this is achieved by controlling data pointed to by RSP(stack pointer register), such that each ret instruction results in incrementing RSP and transferring execution to the next address chosen by attackers.

Here are some good blogs to briefly explain what is stack-pivoting, how to pivot a stack, and how it is used for attacks (e.g. ROP).
http://neilscomputerblog.blogspot.com/2012/06/stack-pivoting.html
http://blogs.mcafee.com/mcafee-labs/emerging-stack-pivoting-exploits-bypass-common-security
http://neilscomputerblog.blogspot.com/2013/04/rop-return-oriented-programming.html

#SS (Stack Fault Exception)
In x86/Intel processor architecture, exception vector 12 is assigned to #SS fault. There are a couple of conditions that can result in a #SS fault. One of them, according to IA32 architecture manual, is limit violation as below:

A limit violation is detected during an operation that refers to the SS register. Operations that can cause a limit violation include stack-oriented instructions such as POP, PUSH, CALL, RET, IRET, ENTER, and LEAVE, as well as other memory references which implicitly or explicitly use the SS register (for example, MOV AX, [BP+6] or MOV AX, SS:[EAX+6]). The ENTER instruction generates this exception when there is not enough stack space for allocating local variables.

So, basically processor checks stack base and limit value when operating any stack-oriented instructions. If the referenced stack address is out of the range (indicated by base/limit values in SS register, see picture below), then a #SS fault will be generated.

However, please note that this limit violation only applies to 32-bit processor mode, I will talk about this later.

Segment Register (SS)
Every segment register, including SS, has a “visible” part and a “hidden” part (see below). The hidden part is sometimes referred to as a “descriptor cache” or a “shadow register”.

According to the IA32 architecture, when a segment selector is loaded into the visible part of a segment register, the processor also loads the hidden part of the segment register with the base address, segment limit, and access control information from the segment descriptor (see next section) pointed to by the segment selector. The information cached in the segment register (visible and hidden) allows the processor to translate addresses without taking extra bus cycles to read the base address and limit from the segment descriptor.

Segment Descriptor
A segment descriptor (see picture below) is a data structure in a GDT or LDT that provides the processor with the size and location (e.g. base/limit) of a segment, as well as access control and status information.

The segment descriptor is pointed by the corresponding segment selector, for example, a stack segment descriptor is referenced by SS selector, and normally OS uses different SS selectors for kernel and applications.

As indicated in last section, the "hidden" part of segment register is loaded from the corresponding segment descriptor (in GDT table residing in RAM). However, it is software's responsibility to reload the segment registers when the segment descriptor tables are modified (e.g. when base or/and limit value are changed). If this is not done, an old segment descriptor cached in a segment register might be used after its memory-resident version (segment descriptor in GDT table) has been modified.

So, when OS system software modifies stack base/limit in SS segment descriptor for a particular thread, it must reload the corresponding SS segment register. According to x86/Intel architecture, there are two kinds of load instructions provided for loading the segment registers:

Direct load instructions such as the MOV, POP, LSS instructions. These instructions explicitly reference the segment registers.
Implicit load instructions such as the far pointer versions of the CALL, JMP, and RET instructions, the SYSENTER and SYSEXIT instructions, and the IRET, INTn, INTO and INT3 instructions. These instructions change the contents of the SS register (and sometimes other segment registers) as an incidental part of their operation.

OS Implementation
To simplify the discussion, I'm taking user mode application as an example for stack pivoting detection.

Normally, OS software allocates unique stack space for each user mode thread. We can change thread scheduler to modify the stack base/limit values in SS segment descriptor (in GDT table) pointed by user mode SS selector, as part of thread context switching.

When that user mode thread starts to execution in user mode after switching stack from kernel to user, the base/limit values in RAM will be automatically reloaded to "hidden" part of SS segment register.

Then if there is an attack initialed by a stack pivoting that causes the user mode stack address (ESP) out of the defined range (base/limit in "hidden" part of SS segment register) is detected, the processor will generate a #SS fault (limit violation exception), then the anti-malware software can detect such an attack.

Limitations

One of big problems is that we cannot apply this solution to x86/Intel 64-bit processor mode. This is because SS (and DS/ES) segment registers are not used in 64-bit mode, their fields (base, limit, and attribute) in segment descriptor of GDT table are ignored. Address calculations that reference the ES, DS, or SS segments are treated as if the segment base is zero. So the #SS exception due to "limit violation" cannot be generated.
Because the SS segment descriptor is located in kernel memory space, so the application cannot modify it directly in user mode. Hence, this solution cannot apply to User Mode Thread, one of examples is Microsoft UMS or User-Mode Scheduling, which is a lightweight mechanism that applications can use to schedule their own threads. An application can switch between UMS threads in user mode without involving the system scheduler. For details, please see the link
http://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx Note that this feature is not available on 32-bit versions of Windows:)
It requires extra changes for thread schedule (as part of context switching) in 32-bit OS, but the change is very minimal, please see above.
One of assumptions is that the thread stack is virtually contiguous in address space, so that the base/limit checks can apply.
It cannot detect the stack pivoting to other memory space that is also part of stack (still in the range of base/limit).

References:
Intel IA32 architecture software development manual:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Transparent ROP Detection using CPU Performance Counters: https://www.trailofbits.com/threads/2014/transparent_rop_detection_using_cpu_perfcounters.pdf

Defeating Windows 8 ROP Mitigation:
http://vulnfactory.org/blog/2011/09/21/defeating-windows-8-rop-mitigation/

SIMPLE IS BETTER

Friday, January 16, 2015

How to defend against Stack Pivoting attacks on existing 32-bit x86 processor architecture?

5 comments: