123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422 |
- \documentclass[11pt]{article}
- \usepackage[T1]{fontenc}
- \usepackage{amsmath, amsfonts, amssymb, amsthm, url, algorithmicx,
- algpseudocode, lmodern, color, graphicx}
- \providecommand{\floor}[1]{\left \lfloor #1 \right \rfloor }
- \providecommand{\NOTE}[1]{
- \begin{center}
- \fbox{\parbox{\textwidth}{\textbf{\textit{#1}}}}
- \end{center}
- }
- \newcommand{\emod}{\ \rm mod \ }
- \newcommand{\ralloc}{\emph{ralloc} }
- \title{Ralloc -- An Efficient One-Fits-All General Memory Allocator}
- \author{Ticki\thanks{Thanks to Thomas Levy for helping with the implementation of \ralloc.}}
- \date{\today}
- \begin{document}
- %%% DISCLAIMER %%%
- \begin{titlepage}
- \centering \huge\bfseries The following document is an incomplete draft.
- \end{titlepage}
- %%% START OF DOCUMENT %%%
- \maketitle
- \begin{abstract}
- We present \ralloc, a novel memory allocation algorithm. In contrary to
- many other memory allocator which are based on queues of similarly
- sized blocks. \ralloc is built on a data structure, holding blocks of
- arbitrarily sized, eliminating internal fragmentation, while preserving
- a high performance. Furthermore, we show how this can be thread cached
- to enable high performance under massive multi-threading.
- \end{abstract}
- \section{Introduction}
- Memory allocators are a core component in modern software. The heap is
- crucial to be able to store dynamically sized objects. Naturally, that
- implies the need for high performance.
- Unfortunately, modern memory allocators tend to be complex simply to
- achieve a good performance. Complexity has its downside, from
- maintainability to security. The core algorithm of \ralloc is relatively
- simple and a basic implementation requires only 1000 CLOC.
- This project was originally born out of the need of a general-purpose
- userspace memory allocator for Redox -- The Rust
- Operating-System\cite{redoxweb}, an OS in which all the core primitives are
- written in the Rust programming language\cite{rustweb}. This programming
- language utilizes extensive static analysis methods which we use throughout
- our implementation.
- The primary concern of a good allocator is the trade-off between memory
- usage and performance. \ralloc happens to perform well per memory usage,
- and demonstrably too in performance.
- \
- The approach \ralloc takes is radically different from popular memory
- allocators such as slab\cite{bonwick94}, slob\cite{mackall05}, and buddy
- allocation\cite{knowlton65}. In particularly, we present a data structure
- which keeps track of longest possible continuous free segment. We
- represents this efficiently in a tree-like structure allowing us to query
- blocks of satisfying some condition
- %%% CORE %%%
- \section{Blocks}
- `Blocks' are the core primitive of \ralloc. Strictly speaking they're just
- fat pointers\footnote{i.e. a pointer equipped with a size.} with some
- additional invariants (disjointness and validity), which are maintained
- through the affine type system of Rust.
- Blocks represents some segment of memory and has certain operations which
- preserves the invariants spaecified above.
- \begin{description}
- \item [split(block, position)] Split the block at some position.
- \item [merge(block\textsubscript{1}, block\textsubscript{2})] Merge two
- blocks.
- \item [size(block)] Get the size of a block.
- \item [align(align, block)] Split the block to align to some specified
- alignment.
- \end{description}
- \subsection{Memory alignment}
- Especially the last one is of interest. All alignment padding functions
- satisfy
- $$
- \mathtt{aligner}(b) \equiv -b \pmod A
- $$
- The simplest one is
- $$
- \mathtt{aligner}(b) = -b \emod A
- $$
- Since $ \mathtt{aligner}(b) < A $, it tends to leave small, unusable blocks
- back, which increases external fragmentation.
- For this reason, we use a slightly more complex algorithm:
- $$
- \mathtt{aligner}(b) = \begin{cases}
- b & \mbox{if } b|A \\
- \floor{\frac{\mathtt{size}(b)}{2}} - (\floor{\frac{\mathtt{size}(b)}{2}} \emod A) & \mbox{if } \mathtt{size}(b) \geq 2A \\
- -b \emod A & \mbox{else}
- \end{cases}
- $$
- By observation, this tends to reduce external fragmentation by an order of
- magnitude.
- \section{Memory bookkeeping}
- \NOTE{This section was revided in multiple turns and is thus sketchy. Rewrite pending.}
- In contrary to other interval data structure, we are not concerned with
- overlaps, but rather querying blocks of some size. \ralloc is based on a
- data structure in which every block is maximized such that it marks the
- longest possible continuous segment, in other words we merge adjacent
- blocks.
- A list of blocks is stored in a data structure that resembles Skip Lists as
- first described in \cite{pugh89} with $ p = {^1} / {_2} $.
- \subsection{Efficiently querying blocks of some size or more}
- Linear search through the list for a fitting block is rather slow,
- especially in the case of high fragmentation. To deal with this problem, we
- track the biggest skipped block. We call this block ``the fat block of a
- shortcut''.
- To update this value, you merely have to recurse over the levels. Note that
- the average skip will have $ p^{-1} = 2 $ children\footnote{We use the term
- "child" due to the analogy to binary trees.}.
- \subsection{Insertion}
- To preserve the invariant that every block is ``as long as possible'', i.e.
- is never adjacent to neighbouring blocks, we need to handle the case of
- adjacent insertion.
- Let $ b $ be the block in question, then
- \begin{description}
- \item [Double adjacent] If $ b $ has adjacent blocks $ (a, c) \in L $.
- Then remove $ c $ and set $$ a \gets \mathtt{merge}(a, b, c). $$
- \item [Left adjacent] Else if $ a \in L $ is left adjacent to $ b $, then
- set $$ a \gets \mathtt{merge}(a, b). $$
- \item [Right adjacent] Else if $ c \in L $ is right adjacent to $ b $, then
- set $$ c \gets \mathtt{merge}(b, c). $$
- \item [No adjacent] Continue conventional Skip List insertion.
- \end{description}
- A simple trick can improve performance by avoiding to backtrack and update
- the fat blocks in certain cases. When the pointer is overshot, set
- $$
- f \gets \max(f, \rm{size}(b))
- $$
- where $ f $ is size of the biggest skipped node, and $ b $ is the block
- we're inserting. Assume that merging was not possible, in that case we
- clear don't need backtracking.
- Consider the configuration,
- \begin{figure}
- \caption{An example configuration.}
- \NOTE{This diagram is totally useless right now. Improve it.}
- \begin{center}
- \scalebox{1.5}{
- $ x \xrightarrow{a \xrightarrow{b \xrightarrow{c \longrightarrow d} e} f \xrightarrow{g \xrightarrow{h} i} j} y $
- }
- \end{center}
- \end{figure}
- % TODO: Use algorithmic to specify the algorithm.
- \subsection{Cache efficiency}
- Cache efficiency is a major problem regarding performance in pointer-based
- structures. In the case of trees and linked list where each node contains a
- pointer to its children or successor, cache misses can damage performance
- severely.
- To avoid this, we use a SLOB arena, which is a block we allocate and break
- down to a linked list. This ensures that the nodes are scattered across few
- cache lines.
- \subsection{Metacircular allocation}
- The structure maintains its own buffer. Despite it leading to complication,
- it turns out to be faster than simply allocating it through the source
- allocator (which we will explain in next section).
- The main complication there is to this approach is the fact that replacing
- the internal buffer actually needs allocation and thus results in unbounded
- recursion. To solve this we maintain the arena to hold $ n $ more blocks
- than needed whenever reallocation is inactive ($ n $ is implementation
- defined).
- Whenever reallocation is triggered (i.e. when the invariant is broken), a
- flag is set, which disables the maintenance of said invariant; after the
- reallocation is over we reset the flag.
- \section{Avoiding locks}
- Locking a global locks can hurt performance of an allocator severely.
- \cite{manghwani05} analyze two different thread-caching methods.
- Unfortunately, none of them are suited for an arbitrary-size allocator, so
- we modify LLAlloc's approach.
- \ralloc's ``source allocator'' is generic and that is exploited for thread
- caching. In particular, we associate a local allocator with each thread.
- All \ralloc bookkeeping is done in this allocator.
- When the first \ralloc call of a thread is made, the allocator is
- initialized and fueled by memory requested from the global allocator
- (shared across threads and hidden behind a lock). Then a thread destructor
- is said: when the thread exits the memory of the local allocator is freed
- to the global allocator.
- Whenever the local allocator is unable to serve the requested memory, the
- global allocator is locked and used as source allocator.
- \subsection{Local-to-global memory trimming}
- Two new problems arises when introducing this form of thread caching:
- \begin{enumerate}
- \item There is increased external fragmentation in the case of
- cross-thread frees, due to one thread getting the allocated buffer
- and another getting the adjacent blocks.
- \item If one thread allocates a large buffers, and then later frees it,
- it ends up keeping inaccessible memory, which could be used in
- other threads.
- \end{enumerate}
- To counter these problems, we implement memory trimming for the local to
- the global allocator. What this essentially means is that the memory
- trimming routine, which will free blocks to the global allocator until some
- condition is met, is triggered when the allocator hits some memory limit or
- some fragmentation limit.
- We define a local allocator to have ``high fragmentation''
- whenever\footnote{This is a rather simplistic measure, and will be improved
- in the future.}
- $$
- t < cb
- $$
- with $ t $ being the total number of bytes contained in the allocator and $
- b $ being the number of blocks.
- When this condition is met or $ t > l $ for some constant $ l $, we free
- local blocks until $ t < s $ for some constant $ s $.
- \section{OS memory trimming}
- For long running processes, memory peaks can be problematic if it is never
- returned to the operating-system.
- To avoid the case where the maximal memory usage is retained, we trim the
- global allocator similarly to how we trim the local allocator, with the
- exception of not measuring fragmentation.
- When OS memory trimming is triggered, we go pop the last block and, if it
- is big enough to be worthy to memory trim, we free it to the system. Since
- blocks in the memory bookkeeper represent longest possible continuous
- segments, we can at most memory trim one block.
- \section{Canonicalization}
- Both in cases of system source allocation (which we use BRK for) and global
- source allocation, we are interested in reserving more than actually needed
- to avoid the overhead of repeated calls.
- Every request to the source allocator will be canonicalized by
- $$
- f_s(x) = x + \max(k_1, \min(k_2 x, k_3))
- $$
- To explain our reasoning, we essentially set a minimum extra space ($ k_1
- $), and then define the excessive space as a multiple ($ k_2 $) of the
- requested space bounded by some maximum extra space ($ k_3 $).
- BRKs are especially expensive (they can even trigger remapping of the
- process). For this reason, they need even higher canonicalization. For this
- reason, we apply the same function but with stronger constants.
- \section{Partial frees}
- An interesting implication of our approach is the possibilities of
- ``partial frees'', in which \emph{a part} of a buffer is deallocated. This
- is not allowed in ``classical allocators'', but useful in some scenario.
- \section{Adding libc malloc compatibility}
- libc's allocation API diverges from Rust's in one major way: In libc the
- only information provided upon reallocation and free is the pointer to the
- buffer. In Rust's, however, the size of the allocation is provided as
- well\cite{crichton15}.
- This leads to interesting possibilities, mainly concerning the reduction
- (or even complete removal -- as in \ralloc's case) of metadata of blocks.
- The drawback is incompatibility. To solve this incompatibility, we encode
- the buffer size into a metadata precursor of the buffer, when the
- allocation is done through the libc API.
- This metadata block is unaligned due to the blocks original alignment being
- unknown. Before the metadata block we have an alignment padding block,
- serving as aligning the allocated buffer.
- When free or reallocation is invoked, we can obtain the size of the buffer
- by simply reading this metadata block. This allows us to translate the libc
- API calls into Rust-like allocation API calls.
- %%% IMPLEMENTATION/PRACTICAL CONSIDERATION %%%
- \section{Implementation}
- \ralloc is a part of the Redox OS ecosystem and thus is, like all other
- components, written in the Rust programming language. Rust puts great
- emphasis on multithreading, and so does \ralloc.
- Rust deploys a great deal of static analysis, which we naturally use to
- improve correctness of \ralloc. The feature we use the most is affine
- types, which themself are a form of substructural types where passing the
- type ``moves'' it out of scope. This allows us to effectively reason about
- things such as use-after-free and illegal aliasing.
- Our implementation is written as semi-literate programming, and more than
- half of the lines of code are API documentation or comments. This is a
- deliberate choice, since we hope to force the programmer to think and
- explain why the algorithm works, hopefully uncovering bugs in it.
- \subsection{Debugging}
- Our implementation deploys extensive debug checks. Every invariant is
- carefully checked in ``debugging mode'', this allows for efficiently
- uncovering subtle bugs.
- Invariants which are relied on for safety (guarantees) are checked through
- assertions, which are tested even in non-debug mode. Disassembling shows
- that the vast majority of these are optimized out or have no impact on
- performance\footnote{i.e. no observable difference is seen.}.
- For a downstream user, who wish to debug their program, we make
- configurable API calls to the debugger, informing about free and
- uninitialized memory.
- \subsection{Logging}
- Everything is (optionally) logged to a configurable target. The log is
- separated into multiple ``log levels'', an idea which is used in many
- modern logging systems.
- \subsection{Configuration}
- Configuration is handled by linking to a shim library, which provides
- constants and functions for binding and configuring \ralloc. This approach
- saves \ralloc from the problems \texttt{mallopt}\footnote{The problems are
- mainly caused by global configuration, in which querying a parameter
- requires locking a global lock.}.
- The downside being that \ralloc is no longer configurable at runtime.
- This can be fixed by replacing the default shim implementation with a
- custom one.
- \section{Measuring performance}
- \section{Limitations}
- %%% CLOSING WORDS %%%
- \section{Future work}
- We are interested in investigating storing the original thread of an
- allocated buffer to be able to return it and reduce external fragmentation.
- The thread is represented by a pointer to some concurrent channel, which
- is used to pass over the block in case of cross-thread drops. Naturally, to
- avoid alignment issues we place this pointer after the buffer. This has the
- positive side-effect of revealing buffer overruns by damaging the pointer
- and likely producing a segmentation fault.
- \section{Thanks to}
- Special thanks to Thomas Levy, whom helped implementing and developing
- \ralloc.
- %%% BIBLIOGRAPHY %%%
- \begin{thebibliography}{9}
- \bibitem{redoxweb}
- \url{http://redox-os.org},
- retrieved \today
- \bibitem{rustweb}
- \url{http://rust-lang.org},
- retrieved \today
- \bibitem{bonwick94}
- Jeff Bonwick,
- \emph{The Slab Allocator: An Object-Caching Kernel Memory Allocator},
- 1994
- \bibitem{mackall05}
- Matt Mackall,
- ``slob: introduce the SLOB allocator'',
- Linux Mailing List,
- 2005
- \bibitem{knowlton65}
- Kenneth C. Knowlton,
- \emph{A Fast storage allocator},
- 1966
- \bibitem{pugh89}
- William Pugh,
- \emph{Skip Lists: A Probabilistic Alternative to Balanced Trees},
- 1989
- \bibitem{manghwani05}
- Rahul Manghwani and Tao He,
- \emph{Scalable Memory Allocation},
- 2005
- \bibitem{crichton15}
- Alex Crichton,
- ``RFC: Allow changing the default allocator'',
- 2015
- \end{thebibliography}
- \end{document}
|