Untitled

SEGMENTED STACKS FOR LLVM
=========================

This describes the implementation details for my GSoC proposal, on implementing segmented stacks for LLVM. I will primarily be working on the X86 platform in the duration of my GSoC work.


Blocks
------

Stack space will allocate in fixed blocks. Blocks are linked doubly to form a list. Blocks are never freed, and are reused when possible. Each frame is contained completely inside _one_ block. The converse is not true.

Each block has a header containing, apart from the next and previous pointers, the value of the stack pointer and the instruction pointer before this new block was created. The previous stack pointer needs to be stored so that it can be correctly restored once the block is no longer a part of the stack. The IP needs to be stored since the "real" return address on the stack (i.e. the one that will be popped) will be hijacked to perform cleanup (mentioned later).

Block sizes need to be a constant power of two, so that checking for available space is cheap.


Prologue & Cleanup
------------------

Since every stack frame must completely fit inside one block, the current block needs to have enough space to hold the current function's stack frame. This will be done in the prologue (X86FrameLowering::emitPrologue). If the function uses a constant amount of stack space, that value (known at compile-time) will be used. Otherwise some worst-case upper limit (8 KiB, say) is chosen as the "size" of the stack-frame.

If the current block does not have enough space, control branches to a stub of code (a naked function, setup_block) which

 1. Allocates a new block if required (may not always be needed, since we're keeping never de-allocating old blocks).
 2. Copies everything addressed SP relative to the new block. Adjusts the stack pointer.
 3. Saves the real return address in the block header. Modifies the return address on the stack to point to a cleanup routine, destroy_block.
 4. "Returns" to the original function prologue. Since we can't really use the stack to communicate the return address, one of the callee-clobbered registers will be used.

The destroy_block stub restores the stack pointer and the instruction pointer from the current block header. Since the block sizes are constant and a power of two, it is always possible to tell the header and the space left in the current block from the current stack pointer.


Link Time Optimizations
-----------------------

Checking if there is enough space on the current block and making the conditional jump is expensive, and should be avoided whenever possible. One paper on this topic (Whole-Program Optimization for Time and Space Efficient Threads) talks about using the call graph to reason about the (upper-bound on) stack space required by various functions. For instance, if the functions F, G and H (each requiring f, g and h bytes of stack space, respectively) is dominated by the function A (which requires a bytes of stack space) in the call graph, we only need to allocate (a + MAX(f, g, h)) bytes of stack space in A's prologue and then eliminate the check in F, G and H.

This can implemented by dividing the PEI pass into an analyze and an action pass. The analyze pass just calls calculateCallsInformation for each MachineFunction. After this, if we are compiling with split-stacks (i.e. the target supports it, and it is enabled), we run a CallGraphSCCPass (InferGlobalStackSpace, say) which implements the algorithm mentioned in the paper. The actual insertion of the prologue and epilogue code is delegated to a third pass. This (third) pass uses the data collected by InferGlobalStackSpace to insert appropriately modified prologues (the epilogue remains the same).


Issues
------

Dividing up the stack into a list of smaller blocks violates one important constraint: stack addresses are no longer continuous between function invocations. This can affect a number of things, all of which I've tried to address:


Unwinding
~~~~~~~~~

Unwinding should not be a problem if the .eh_frame sections are emitted carefully. Correct DWARF info can be emitted as follows:

Since we know the offset of base of the stack frame from the stack pointer (or we are maintaining a frame pointer), we can always say whether the concerned frame is the first frame in the block. If not, all the previous register values can be computed as usual. Otherwise, we can add an extra indirection, involving looking up the stack pointer saved in this block's header.


Frame Pointers
~~~~~~~~~~~~~~

Nothing changes if the function call does not have to create a new block, even if the function maintains a frame pointer. So all the changes can be localized in setup_block.

However, in case the frame pointer is being maintained for the function, and a new block is being allocated, the value at %rbp will also have to be copied to the new block, and %rbp will have to be changed to point to this new location. This is so that the FP relative indexing works. %rbp will then be set to its "correct" value by the epilogue itself. We'll probably need a new routine (setup_block_fp) for this situation. Since we're resetting the stack pointer in destroy_block, the epilogue is free to clobber %rsp with the "wrong" value of %rbp.


Stack Space For Block Setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~

setup_block and setup_block_fp will be written to use the minimum amount of stack space possible, and /that/ stack space will have to be factored into the check in the function prologue. So every function prologue checks for the space it needs _plus_ the space required to run setup_block (or setup_block_fp) once, by a callee.


Red Zone
~~~~~~~~

The red zone can be usable only if there actually _is_ 128 bytes of stack space left in the current block. One way the red-zone can be profitably used is to always add 128 bytes to the stack space required by a function. Assuming setup_block is written carefully (and needs < 128 bytes of stack space), this should also take care of the previous point.


Varargs
~~~~~~~

When we create a new block, and are copying all the data that is addressed SP relative, we will not copy the arguments for a vararg function. We will copy the va_list structure, however, and since the fields point (correctly) to locations on the previous block anyways, accessing the arguments should work fine.


Interoperability
~~~~~~~~~~~~~~~~

We can't safely make calls from code using segmented-stacks to code expecting a vanilla contiguous stack. This can be fixed by the linker by padding calls into non-segmented-stack code from segmented-stack code with allocation and de-allocation of a 64 k block. For the linker to be able to do this, the compiler needs to emit a dummy section into object files compiled with segmented stacks. The linker will then use as a flag. Of course, no such thing needs to be done when linking bc files, since the native code has not been generated yet.

This is exactly the approach mentioned in [1]. However, this approach is suboptimal, and to get full benefits of segmented stacks, all code needs to be compiled with segmented stacks.

[1] http://gcc.gnu.org/wiki/SplitStacks