\documentclass[11pt]{article}
\usepackage[T1]{fontenc}
\usepackage{amsmath, amsfonts, amssymb, amsthm, url, algorithmicx,
    algpseudocode, lmodern, color, graphicx}

\providecommand{\floor}[1]{\left \lfloor #1 \right \rfloor }
\providecommand{\NOTE}[1]{
    \begin{center}
        \fbox{\parbox{\textwidth}{\textbf{\textit{#1}}}}
    \end{center}
}
\newcommand{\emod}{\ \rm mod \ }
\newcommand{\ralloc}{\emph{ralloc} }

\title{Ralloc -- An Efficient One-Fits-All General Memory Allocator}
\author{Ticki\thanks{Thanks to Thomas Levy for helping with the implementation of \ralloc.}}
\date{\today}

\begin{document}
    %%% DISCLAIMER %%%

    \begin{titlepage}
        \centering \huge\bfseries The following document is an incomplete draft.
    \end{titlepage}

    %%% START OF DOCUMENT %%%

    \maketitle

    \begin{abstract}
        We present \ralloc, a novel memory allocation algorithm. In contrary to
        many other memory allocator which are based on queues of similarly
        sized blocks. \ralloc is built on a data structure, holding blocks of
        arbitrarily sized, eliminating internal fragmentation, while preserving
        a high performance. Furthermore, we show how this can be thread cached
        to enable high performance under massive multi-threading.
    \end{abstract}

    \section{Introduction}
    Memory allocators are a core component in modern software. The heap is
    crucial to be able to store dynamically sized objects. Naturally, that
    implies the need for high performance.

    Unfortunately, modern memory allocators tend to be complex simply to
    achieve a good performance. Complexity has its downside, from
    maintainability to security. The core algorithm of \ralloc is relatively
    simple and a basic implementation requires only 1000 CLOC.

    This project was originally born out of the need of a general-purpose
    userspace memory allocator for Redox -- The Rust
    Operating-System\cite{redoxweb}, an OS in which all the core primitives are
    written in the Rust programming language\cite{rustweb}.  This programming
    language utilizes extensive static analysis methods which we use throughout
    our implementation.

    The primary concern of a good allocator is the trade-off between memory
    usage and performance. \ralloc happens to perform well per memory usage,
    and demonstrably too in performance.

    \

    The approach \ralloc takes is radically different from popular memory
    allocators such as slab\cite{bonwick94}, slob\cite{mackall05}, and buddy
    allocation\cite{knowlton65}. In particularly, we present a data structure
    which keeps track of longest possible continuous free segment. We
    represents this efficiently in a tree-like structure allowing us to query
    blocks of satisfying some condition

    %%% CORE %%%

    \section{Blocks}
    `Blocks' are the core primitive of \ralloc. Strictly speaking they're just
    fat pointers\footnote{i.e. a pointer equipped with a size.} with some
    additional invariants (disjointness and validity), which are maintained
    through the affine type system of Rust.

    Blocks represents some segment of memory and has certain operations which
    preserves the invariants spaecified above.

    \begin{description}
        \item [split(block, position)] Split the block at some position.
        \item [merge(block\textsubscript{1}, block\textsubscript{2})] Merge two
            blocks.
        \item [size(block)] Get the size of a block.
        \item [align(align, block)] Split the block to align to some specified
            alignment.
    \end{description}

    \subsection{Memory alignment}
    Especially the last one is of interest. All alignment padding functions
    satisfy

    $$
        \mathtt{aligner}(b) \equiv -b \pmod A
    $$

    The simplest one is

    $$
        \mathtt{aligner}(b) = -b \emod A
    $$

    Since $ \mathtt{aligner}(b) < A $, it tends to leave small, unusable blocks
    back, which increases external fragmentation.

    For this reason, we use a slightly more complex algorithm:

    $$
        \mathtt{aligner}(b) = \begin{cases}
            b & \mbox{if } b|A \\
            \floor{\frac{\mathtt{size}(b)}{2}} - (\floor{\frac{\mathtt{size}(b)}{2}} \emod A) & \mbox{if } \mathtt{size}(b) \geq 2A \\
            -b \emod A  & \mbox{else}
        \end{cases}
    $$

    By observation, this tends to reduce external fragmentation by an order of
    magnitude.

    \section{Memory bookkeeping}
    \NOTE{This section was revided in multiple turns and is thus sketchy. Rewrite pending.}
    In contrary to other interval data structure, we are not concerned with
    overlaps, but rather querying blocks of some size. \ralloc is based on a
    data structure in which every block is maximized such that it marks the
    longest possible continuous segment, in other words we merge adjacent
    blocks.

    A list of blocks is stored in a data structure that resembles Skip Lists as
    first described in \cite{pugh89} with $ p = {^1} / {_2} $.

    \subsection{Efficiently querying blocks of some size or more}
    Linear search through the list for a fitting block is rather slow,
    especially in the case of high fragmentation. To deal with this problem, we
    track the biggest skipped block. We call this block ``the fat block of a
    shortcut''.

    To update this value, you merely have to recurse over the levels. Note that
    the average skip will have $ p^{-1} = 2 $ children\footnote{We use the term
    "child" due to the analogy to binary trees.}.

    \subsection{Insertion}
    To preserve the invariant that every block is ``as long as possible'', i.e.
    is never adjacent to neighbouring blocks, we need to handle the case of
    adjacent insertion.

    Let $ b $ be the block in question, then

    \begin{description}
        \item [Double adjacent] If $ b $ has adjacent blocks $ (a, c) \in L $.
            Then remove $ c $ and set $$ a \gets \mathtt{merge}(a, b, c). $$
        \item [Left adjacent] Else if $ a \in L $ is left adjacent to $ b $, then
            set $$ a \gets \mathtt{merge}(a, b). $$
        \item [Right adjacent] Else if $ c \in L $ is right adjacent to $ b $, then
            set $$ c \gets \mathtt{merge}(b, c). $$
        \item [No adjacent] Continue conventional Skip List insertion.
    \end{description}

    A simple trick can improve performance by avoiding to backtrack and update
    the fat blocks in certain cases. When the pointer is overshot, set

    $$
        f \gets \max(f, \rm{size}(b))
    $$

    where $ f $ is size of the biggest skipped node, and $ b $ is the block
    we're inserting. Assume that merging was not possible, in that case we
    clear don't need backtracking.

    Consider the configuration,

    \begin{figure}
        \caption{An example configuration.}

        \NOTE{This diagram is totally useless right now. Improve it.}
        \begin{center}
            \scalebox{1.5}{
                $ x \xrightarrow{a \xrightarrow{b \xrightarrow{c \longrightarrow d} e} f \xrightarrow{g \xrightarrow{h} i} j} y $
            }
        \end{center}
    \end{figure}

    % TODO: Use algorithmic to specify the algorithm.

    \subsection{Cache efficiency}
    Cache efficiency is a major problem regarding performance in pointer-based
    structures. In the case of trees and linked list where each node contains a
    pointer to its children or successor, cache misses can damage performance
    severely.

    To avoid this, we use a SLOB arena, which is a block we allocate and break
    down to a linked list. This ensures that the nodes are scattered across few
    cache lines.

    \subsection{Metacircular allocation}
    The structure maintains its own buffer. Despite it leading to complication,
    it turns out to be faster than simply allocating it through the source
    allocator (which we will explain in next section).

    The main complication there is to this approach is the fact that replacing
    the internal buffer actually needs allocation and thus results in unbounded
    recursion. To solve this we maintain the arena to hold $ n $ more blocks
    than needed whenever reallocation is inactive ($ n $ is implementation
    defined).

    Whenever reallocation is triggered (i.e. when the invariant is broken), a
    flag is set, which disables the maintenance of said invariant; after the
    reallocation is over we reset the flag.

    \section{Avoiding locks}
    Locking a global locks can hurt performance of an allocator severely.
    \cite{manghwani05} analyze two different thread-caching methods.
    Unfortunately, none of them are suited for an arbitrary-size allocator, so
    we modify LLAlloc's approach.

    \ralloc's ``source allocator'' is generic and that is exploited for thread
    caching. In particular, we associate a local allocator with each thread.
    All \ralloc bookkeeping is done in this allocator.

    When the first \ralloc call of a thread is made, the allocator is
    initialized and fueled by memory requested from the global allocator
    (shared across threads and hidden behind a lock). Then a thread destructor
    is said: when the thread exits the memory of the local allocator is freed
    to the global allocator.

    Whenever the local allocator is unable to serve the requested memory, the
    global allocator is locked and used as source allocator.

    \subsection{Local-to-global memory trimming}

    Two new problems arises when introducing this form of thread caching:

    \begin{enumerate}
        \item There is increased external fragmentation in the case of
            cross-thread frees, due to one thread getting the allocated buffer
            and another getting the adjacent blocks.
        \item If one thread allocates a large buffers, and then later frees it,
            it ends up keeping inaccessible memory, which could be used in
            other threads.
    \end{enumerate}

    To counter these problems, we implement memory trimming for the local to
    the global allocator. What this essentially means is that the memory
    trimming routine, which will free blocks to the global allocator until some
    condition is met, is triggered when the allocator hits some memory limit or
    some fragmentation limit.

    We define a local allocator to have ``high fragmentation''
    whenever\footnote{This is a rather simplistic measure, and will be improved
        in the future.}

    $$
        t < cb
    $$

    with $ t $ being the total number of bytes contained in the allocator and $
    b $ being the number of blocks.

    When this condition is met or $ t > l $ for some constant $ l $, we free
    local blocks until $ t < s $ for some constant $ s $.

    \section{OS memory trimming}
    For long running processes, memory peaks can be problematic if it is never
    returned to the operating-system.

    To avoid the case where the maximal memory usage is retained, we trim the
    global allocator similarly to how we trim the local allocator, with the
    exception of not measuring fragmentation.

    When OS memory trimming is triggered, we go pop the last block and, if it
    is big enough to be worthy to memory trim, we free it to the system. Since
    blocks in the memory bookkeeper represent longest possible continuous
    segments, we can at most memory trim one block.

    \section{Canonicalization}
    Both in cases of system source allocation (which we use BRK for) and global
    source allocation, we are interested in reserving more than actually needed
    to avoid the overhead of repeated calls.

    Every request to the source allocator will be canonicalized by

    $$
        f_s(x) = x + \max(k_1, \min(k_2 x, k_3))
    $$

    To explain our reasoning, we essentially set a minimum extra space ($ k_1
    $), and then define the excessive space as a multiple ($ k_2 $) of the
    requested space bounded by some maximum extra space ($ k_3 $).

    BRKs are especially expensive (they can even trigger remapping of the
    process). For this reason, they need even higher canonicalization. For this
    reason, we apply the same function but with stronger constants.

    \section{Partial frees}
    An interesting implication of our approach is the possibilities of
    ``partial frees'', in which \emph{a part} of a buffer is deallocated. This
    is not allowed in ``classical allocators'', but useful in some scenario.

    \section{Adding libc malloc compatibility}
    libc's allocation API diverges from Rust's in one major way: In libc the
    only information provided upon reallocation and free is the pointer to the
    buffer. In Rust's, however, the size of the allocation is provided as
    well\cite{crichton15}.

    This leads to interesting possibilities, mainly concerning the reduction
    (or even complete removal -- as in \ralloc's case) of metadata of blocks.

    The drawback is incompatibility. To solve this incompatibility, we encode
    the buffer size into a metadata precursor of the buffer, when the
    allocation is done through the libc API.

    This metadata block is unaligned due to the blocks original alignment being
    unknown. Before the metadata block we have an alignment padding block,
    serving as aligning the allocated buffer.

    When free or reallocation is invoked, we can obtain the size of the buffer
    by simply reading this metadata block. This allows us to translate the libc
    API calls into Rust-like allocation API calls.

    %%% IMPLEMENTATION/PRACTICAL CONSIDERATION %%%

    \section{Implementation}
    \ralloc is a part of the Redox OS ecosystem and thus is, like all other
    components, written in the Rust programming language. Rust puts great
    emphasis on multithreading, and so does \ralloc.

    Rust deploys a great deal of static analysis, which we naturally use to
    improve correctness of \ralloc. The feature we use the most is affine
    types, which themself are a form of substructural types where passing the
    type ``moves'' it out of scope. This allows us to effectively reason about
    things such as use-after-free and illegal aliasing.

    Our implementation is written as semi-literate programming, and more than
    half of the lines of code are API documentation or comments. This is a
    deliberate choice, since we hope to force the programmer to think and
    explain why the algorithm works, hopefully uncovering bugs in it.

    \subsection{Debugging}
    Our implementation deploys extensive debug checks. Every invariant is
    carefully checked in ``debugging mode'', this allows for efficiently
    uncovering subtle bugs.

    Invariants which are relied on for safety (guarantees) are checked through
    assertions, which are tested even in non-debug mode. Disassembling shows
    that the vast majority of these are optimized out or have no impact on
    performance\footnote{i.e. no observable difference is seen.}.

    For a downstream user, who wish to debug their program, we make
    configurable API calls to the debugger, informing about free and
    uninitialized memory.

    \subsection{Logging}
    Everything is (optionally) logged to a configurable target. The log is
    separated into multiple ``log levels'', an idea which is used in many
    modern logging systems.

    \subsection{Configuration}
    Configuration is handled by linking to a shim library, which provides
    constants and functions for binding and configuring \ralloc. This approach
    saves \ralloc from the problems \texttt{mallopt}\footnote{The problems are
        mainly caused by global configuration, in which querying a parameter
        requires locking a global lock.}.

    The downside being that \ralloc is no longer configurable at runtime.
    This can be fixed by replacing the default shim implementation with a
    custom one.

    \section{Measuring performance}

    \section{Limitations}

    %%% CLOSING WORDS %%%

    \section{Future work}

    We are interested in investigating storing the original thread of an
    allocated buffer to be able to return it and reduce external fragmentation.
    The thread is represented by a pointer to some concurrent channel, which
    is used to pass over the block in case of cross-thread drops. Naturally, to
    avoid alignment issues we place this pointer after the buffer. This has the
    positive side-effect of revealing buffer overruns by damaging the pointer
    and likely producing a segmentation fault.

    \section{Thanks to}

    Special thanks to Thomas Levy, whom helped implementing and developing
    \ralloc.

    %%% BIBLIOGRAPHY %%%

    \begin{thebibliography}{9}
        \bibitem{redoxweb}
            \url{http://redox-os.org},
            retrieved \today
        \bibitem{rustweb}
            \url{http://rust-lang.org},
            retrieved \today
        \bibitem{bonwick94}
            Jeff Bonwick,
            \emph{The Slab Allocator: An Object-Caching Kernel Memory Allocator},
            1994
        \bibitem{mackall05}
            Matt Mackall,
            ``slob: introduce the SLOB allocator'',
            Linux Mailing List,
            2005
        \bibitem{knowlton65}
             Kenneth C. Knowlton,
             \emph{A Fast storage allocator},
             1966
        \bibitem{pugh89}
            William Pugh,
            \emph{Skip Lists: A Probabilistic Alternative to Balanced Trees},
            1989
        \bibitem{manghwani05}
             Rahul Manghwani and Tao He,
             \emph{Scalable Memory Allocation},
             2005
        \bibitem{crichton15}
            Alex Crichton,
            ``RFC: Allow changing the default allocator'',
            2015
    \end{thebibliography}
\end{document}