ralloc.tex 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422
  1. \documentclass[11pt]{article}
  2. \usepackage[T1]{fontenc}
  3. \usepackage{amsmath, amsfonts, amssymb, amsthm, url, algorithmicx,
  4. algpseudocode, lmodern, color, graphicx}
  5. \providecommand{\floor}[1]{\left \lfloor #1 \right \rfloor }
  6. \providecommand{\NOTE}[1]{
  7. \begin{center}
  8. \fbox{\parbox{\textwidth}{\textbf{\textit{#1}}}}
  9. \end{center}
  10. }
  11. \newcommand{\emod}{\ \rm mod \ }
  12. \newcommand{\ralloc}{\emph{ralloc} }
  13. \title{Ralloc -- An Efficient One-Fits-All General Memory Allocator}
  14. \author{Ticki\thanks{Thanks to Thomas Levy for helping with the implementation of \ralloc.}}
  15. \date{\today}
  16. \begin{document}
  17. %%% DISCLAIMER %%%
  18. \begin{titlepage}
  19. \centering \huge\bfseries The following document is an incomplete draft.
  20. \end{titlepage}
  21. %%% START OF DOCUMENT %%%
  22. \maketitle
  23. \begin{abstract}
  24. We present \ralloc, a novel memory allocation algorithm. In contrary to
  25. many other memory allocator which are based on queues of similarly
  26. sized blocks. \ralloc is built on a data structure, holding blocks of
  27. arbitrarily sized, eliminating internal fragmentation, while preserving
  28. a high performance. Furthermore, we show how this can be thread cached
  29. to enable high performance under massive multi-threading.
  30. \end{abstract}
  31. \section{Introduction}
  32. Memory allocators are a core component in modern software. The heap is
  33. crucial to be able to store dynamically sized objects. Naturally, that
  34. implies the need for high performance.
  35. Unfortunately, modern memory allocators tend to be complex simply to
  36. achieve a good performance. Complexity has its downside, from
  37. maintainability to security. The core algorithm of \ralloc is relatively
  38. simple and a basic implementation requires only 1000 CLOC.
  39. This project was originally born out of the need of a general-purpose
  40. userspace memory allocator for Redox -- The Rust
  41. Operating-System\cite{redoxweb}, an OS in which all the core primitives are
  42. written in the Rust programming language\cite{rustweb}. This programming
  43. language utilizes extensive static analysis methods which we use throughout
  44. our implementation.
  45. The primary concern of a good allocator is the trade-off between memory
  46. usage and performance. \ralloc happens to perform well per memory usage,
  47. and demonstrably too in performance.
  48. \
  49. The approach \ralloc takes is radically different from popular memory
  50. allocators such as slab\cite{bonwick94}, slob\cite{mackall05}, and buddy
  51. allocation\cite{knowlton65}. In particularly, we present a data structure
  52. which keeps track of longest possible continuous free segment. We
  53. represents this efficiently in a tree-like structure allowing us to query
  54. blocks of satisfying some condition
  55. %%% CORE %%%
  56. \section{Blocks}
  57. `Blocks' are the core primitive of \ralloc. Strictly speaking they're just
  58. fat pointers\footnote{i.e. a pointer equipped with a size.} with some
  59. additional invariants (disjointness and validity), which are maintained
  60. through the affine type system of Rust.
  61. Blocks represents some segment of memory and has certain operations which
  62. preserves the invariants spaecified above.
  63. \begin{description}
  64. \item [split(block, position)] Split the block at some position.
  65. \item [merge(block\textsubscript{1}, block\textsubscript{2})] Merge two
  66. blocks.
  67. \item [size(block)] Get the size of a block.
  68. \item [align(align, block)] Split the block to align to some specified
  69. alignment.
  70. \end{description}
  71. \subsection{Memory alignment}
  72. Especially the last one is of interest. All alignment padding functions
  73. satisfy
  74. $$
  75. \mathtt{aligner}(b) \equiv -b \pmod A
  76. $$
  77. The simplest one is
  78. $$
  79. \mathtt{aligner}(b) = -b \emod A
  80. $$
  81. Since $ \mathtt{aligner}(b) < A $, it tends to leave small, unusable blocks
  82. back, which increases external fragmentation.
  83. For this reason, we use a slightly more complex algorithm:
  84. $$
  85. \mathtt{aligner}(b) = \begin{cases}
  86. b & \mbox{if } b|A \\
  87. \floor{\frac{\mathtt{size}(b)}{2}} - (\floor{\frac{\mathtt{size}(b)}{2}} \emod A) & \mbox{if } \mathtt{size}(b) \geq 2A \\
  88. -b \emod A & \mbox{else}
  89. \end{cases}
  90. $$
  91. By observation, this tends to reduce external fragmentation by an order of
  92. magnitude.
  93. \section{Memory bookkeeping}
  94. \NOTE{This section was revided in multiple turns and is thus sketchy. Rewrite pending.}
  95. In contrary to other interval data structure, we are not concerned with
  96. overlaps, but rather querying blocks of some size. \ralloc is based on a
  97. data structure in which every block is maximized such that it marks the
  98. longest possible continuous segment, in other words we merge adjacent
  99. blocks.
  100. A list of blocks is stored in a data structure that resembles Skip Lists as
  101. first described in \cite{pugh89} with $ p = {^1} / {_2} $.
  102. \subsection{Efficiently querying blocks of some size or more}
  103. Linear search through the list for a fitting block is rather slow,
  104. especially in the case of high fragmentation. To deal with this problem, we
  105. track the biggest skipped block. We call this block ``the fat block of a
  106. shortcut''.
  107. To update this value, you merely have to recurse over the levels. Note that
  108. the average skip will have $ p^{-1} = 2 $ children\footnote{We use the term
  109. "child" due to the analogy to binary trees.}.
  110. \subsection{Insertion}
  111. To preserve the invariant that every block is ``as long as possible'', i.e.
  112. is never adjacent to neighbouring blocks, we need to handle the case of
  113. adjacent insertion.
  114. Let $ b $ be the block in question, then
  115. \begin{description}
  116. \item [Double adjacent] If $ b $ has adjacent blocks $ (a, c) \in L $.
  117. Then remove $ c $ and set $$ a \gets \mathtt{merge}(a, b, c). $$
  118. \item [Left adjacent] Else if $ a \in L $ is left adjacent to $ b $, then
  119. set $$ a \gets \mathtt{merge}(a, b). $$
  120. \item [Right adjacent] Else if $ c \in L $ is right adjacent to $ b $, then
  121. set $$ c \gets \mathtt{merge}(b, c). $$
  122. \item [No adjacent] Continue conventional Skip List insertion.
  123. \end{description}
  124. A simple trick can improve performance by avoiding to backtrack and update
  125. the fat blocks in certain cases. When the pointer is overshot, set
  126. $$
  127. f \gets \max(f, \rm{size}(b))
  128. $$
  129. where $ f $ is size of the biggest skipped node, and $ b $ is the block
  130. we're inserting. Assume that merging was not possible, in that case we
  131. clear don't need backtracking.
  132. Consider the configuration,
  133. \begin{figure}
  134. \caption{An example configuration.}
  135. \NOTE{This diagram is totally useless right now. Improve it.}
  136. \begin{center}
  137. \scalebox{1.5}{
  138. $ x \xrightarrow{a \xrightarrow{b \xrightarrow{c \longrightarrow d} e} f \xrightarrow{g \xrightarrow{h} i} j} y $
  139. }
  140. \end{center}
  141. \end{figure}
  142. % TODO: Use algorithmic to specify the algorithm.
  143. \subsection{Cache efficiency}
  144. Cache efficiency is a major problem regarding performance in pointer-based
  145. structures. In the case of trees and linked list where each node contains a
  146. pointer to its children or successor, cache misses can damage performance
  147. severely.
  148. To avoid this, we use a SLOB arena, which is a block we allocate and break
  149. down to a linked list. This ensures that the nodes are scattered across few
  150. cache lines.
  151. \subsection{Metacircular allocation}
  152. The structure maintains its own buffer. Despite it leading to complication,
  153. it turns out to be faster than simply allocating it through the source
  154. allocator (which we will explain in next section).
  155. The main complication there is to this approach is the fact that replacing
  156. the internal buffer actually needs allocation and thus results in unbounded
  157. recursion. To solve this we maintain the arena to hold $ n $ more blocks
  158. than needed whenever reallocation is inactive ($ n $ is implementation
  159. defined).
  160. Whenever reallocation is triggered (i.e. when the invariant is broken), a
  161. flag is set, which disables the maintenance of said invariant; after the
  162. reallocation is over we reset the flag.
  163. \section{Avoiding locks}
  164. Locking a global locks can hurt performance of an allocator severely.
  165. \cite{manghwani05} analyze two different thread-caching methods.
  166. Unfortunately, none of them are suited for an arbitrary-size allocator, so
  167. we modify LLAlloc's approach.
  168. \ralloc's ``source allocator'' is generic and that is exploited for thread
  169. caching. In particular, we associate a local allocator with each thread.
  170. All \ralloc bookkeeping is done in this allocator.
  171. When the first \ralloc call of a thread is made, the allocator is
  172. initialized and fueled by memory requested from the global allocator
  173. (shared across threads and hidden behind a lock). Then a thread destructor
  174. is said: when the thread exits the memory of the local allocator is freed
  175. to the global allocator.
  176. Whenever the local allocator is unable to serve the requested memory, the
  177. global allocator is locked and used as source allocator.
  178. \subsection{Local-to-global memory trimming}
  179. Two new problems arises when introducing this form of thread caching:
  180. \begin{enumerate}
  181. \item There is increased external fragmentation in the case of
  182. cross-thread frees, due to one thread getting the allocated buffer
  183. and another getting the adjacent blocks.
  184. \item If one thread allocates a large buffers, and then later frees it,
  185. it ends up keeping inaccessible memory, which could be used in
  186. other threads.
  187. \end{enumerate}
  188. To counter these problems, we implement memory trimming for the local to
  189. the global allocator. What this essentially means is that the memory
  190. trimming routine, which will free blocks to the global allocator until some
  191. condition is met, is triggered when the allocator hits some memory limit or
  192. some fragmentation limit.
  193. We define a local allocator to have ``high fragmentation''
  194. whenever\footnote{This is a rather simplistic measure, and will be improved
  195. in the future.}
  196. $$
  197. t < cb
  198. $$
  199. with $ t $ being the total number of bytes contained in the allocator and $
  200. b $ being the number of blocks.
  201. When this condition is met or $ t > l $ for some constant $ l $, we free
  202. local blocks until $ t < s $ for some constant $ s $.
  203. \section{OS memory trimming}
  204. For long running processes, memory peaks can be problematic if it is never
  205. returned to the operating-system.
  206. To avoid the case where the maximal memory usage is retained, we trim the
  207. global allocator similarly to how we trim the local allocator, with the
  208. exception of not measuring fragmentation.
  209. When OS memory trimming is triggered, we go pop the last block and, if it
  210. is big enough to be worthy to memory trim, we free it to the system. Since
  211. blocks in the memory bookkeeper represent longest possible continuous
  212. segments, we can at most memory trim one block.
  213. \section{Canonicalization}
  214. Both in cases of system source allocation (which we use BRK for) and global
  215. source allocation, we are interested in reserving more than actually needed
  216. to avoid the overhead of repeated calls.
  217. Every request to the source allocator will be canonicalized by
  218. $$
  219. f_s(x) = x + \max(k_1, \min(k_2 x, k_3))
  220. $$
  221. To explain our reasoning, we essentially set a minimum extra space ($ k_1
  222. $), and then define the excessive space as a multiple ($ k_2 $) of the
  223. requested space bounded by some maximum extra space ($ k_3 $).
  224. BRKs are especially expensive (they can even trigger remapping of the
  225. process). For this reason, they need even higher canonicalization. For this
  226. reason, we apply the same function but with stronger constants.
  227. \section{Partial frees}
  228. An interesting implication of our approach is the possibilities of
  229. ``partial frees'', in which \emph{a part} of a buffer is deallocated. This
  230. is not allowed in ``classical allocators'', but useful in some scenario.
  231. \section{Adding libc malloc compatibility}
  232. libc's allocation API diverges from Rust's in one major way: In libc the
  233. only information provided upon reallocation and free is the pointer to the
  234. buffer. In Rust's, however, the size of the allocation is provided as
  235. well\cite{crichton15}.
  236. This leads to interesting possibilities, mainly concerning the reduction
  237. (or even complete removal -- as in \ralloc's case) of metadata of blocks.
  238. The drawback is incompatibility. To solve this incompatibility, we encode
  239. the buffer size into a metadata precursor of the buffer, when the
  240. allocation is done through the libc API.
  241. This metadata block is unaligned due to the blocks original alignment being
  242. unknown. Before the metadata block we have an alignment padding block,
  243. serving as aligning the allocated buffer.
  244. When free or reallocation is invoked, we can obtain the size of the buffer
  245. by simply reading this metadata block. This allows us to translate the libc
  246. API calls into Rust-like allocation API calls.
  247. %%% IMPLEMENTATION/PRACTICAL CONSIDERATION %%%
  248. \section{Implementation}
  249. \ralloc is a part of the Redox OS ecosystem and thus is, like all other
  250. components, written in the Rust programming language. Rust puts great
  251. emphasis on multithreading, and so does \ralloc.
  252. Rust deploys a great deal of static analysis, which we naturally use to
  253. improve correctness of \ralloc. The feature we use the most is affine
  254. types, which themself are a form of substructural types where passing the
  255. type ``moves'' it out of scope. This allows us to effectively reason about
  256. things such as use-after-free and illegal aliasing.
  257. Our implementation is written as semi-literate programming, and more than
  258. half of the lines of code are API documentation or comments. This is a
  259. deliberate choice, since we hope to force the programmer to think and
  260. explain why the algorithm works, hopefully uncovering bugs in it.
  261. \subsection{Debugging}
  262. Our implementation deploys extensive debug checks. Every invariant is
  263. carefully checked in ``debugging mode'', this allows for efficiently
  264. uncovering subtle bugs.
  265. Invariants which are relied on for safety (guarantees) are checked through
  266. assertions, which are tested even in non-debug mode. Disassembling shows
  267. that the vast majority of these are optimized out or have no impact on
  268. performance\footnote{i.e. no observable difference is seen.}.
  269. For a downstream user, who wish to debug their program, we make
  270. configurable API calls to the debugger, informing about free and
  271. uninitialized memory.
  272. \subsection{Logging}
  273. Everything is (optionally) logged to a configurable target. The log is
  274. separated into multiple ``log levels'', an idea which is used in many
  275. modern logging systems.
  276. \subsection{Configuration}
  277. Configuration is handled by linking to a shim library, which provides
  278. constants and functions for binding and configuring \ralloc. This approach
  279. saves \ralloc from the problems \texttt{mallopt}\footnote{The problems are
  280. mainly caused by global configuration, in which querying a parameter
  281. requires locking a global lock.}.
  282. The downside being that \ralloc is no longer configurable at runtime.
  283. This can be fixed by replacing the default shim implementation with a
  284. custom one.
  285. \section{Measuring performance}
  286. \section{Limitations}
  287. %%% CLOSING WORDS %%%
  288. \section{Future work}
  289. We are interested in investigating storing the original thread of an
  290. allocated buffer to be able to return it and reduce external fragmentation.
  291. The thread is represented by a pointer to some concurrent channel, which
  292. is used to pass over the block in case of cross-thread drops. Naturally, to
  293. avoid alignment issues we place this pointer after the buffer. This has the
  294. positive side-effect of revealing buffer overruns by damaging the pointer
  295. and likely producing a segmentation fault.
  296. \section{Thanks to}
  297. Special thanks to Thomas Levy, whom helped implementing and developing
  298. \ralloc.
  299. %%% BIBLIOGRAPHY %%%
  300. \begin{thebibliography}{9}
  301. \bibitem{redoxweb}
  302. \url{http://redox-os.org},
  303. retrieved \today
  304. \bibitem{rustweb}
  305. \url{http://rust-lang.org},
  306. retrieved \today
  307. \bibitem{bonwick94}
  308. Jeff Bonwick,
  309. \emph{The Slab Allocator: An Object-Caching Kernel Memory Allocator},
  310. 1994
  311. \bibitem{mackall05}
  312. Matt Mackall,
  313. ``slob: introduce the SLOB allocator'',
  314. Linux Mailing List,
  315. 2005
  316. \bibitem{knowlton65}
  317. Kenneth C. Knowlton,
  318. \emph{A Fast storage allocator},
  319. 1966
  320. \bibitem{pugh89}
  321. William Pugh,
  322. \emph{Skip Lists: A Probabilistic Alternative to Balanced Trees},
  323. 1989
  324. \bibitem{manghwani05}
  325. Rahul Manghwani and Tao He,
  326. \emph{Scalable Memory Allocation},
  327. 2005
  328. \bibitem{crichton15}
  329. Alex Crichton,
  330. ``RFC: Allow changing the default allocator'',
  331. 2015
  332. \end{thebibliography}
  333. \end{document}