In this series of three posts, we’re going to try to cover a deep dive into the pertinent details of the two Android libc allocators, followed by some thoughts on exploitation in light of those allocators.
All of the information I’ll impart is the result of our own research into the allocators in question, including a thorough code review of the implementations of those allocators. That said, much of the information is available online in one form or another. I’ve yet to encounter a concise but in-depth description of both allocators and the relevant exploitation techniques. Hopefully that’s what this presentation will provide.
It’s 2018. The days of trivially exploitable stack buffer overflows are over. Modern exploitable vulnerabilities fall into a few meager classes, we’ll focus on two of these.
Even at this late stage in the game, memory corruption bugs are still a thing. Chief among these is the good old buffer overflow. Stack cookies have largely neutered the exploitability of stack based memory corruptions, so most modern memory corruption vulnerabilities are in objects and buffers on the heap.
In addition to these heap-based memory corruption vulnerabilities, we have use-after-free vulnerabilities.
This class of bugs is all about heap objects coupled with bad memory management practices.
Together these two classes make up a very large portion of the exploitable bugs we find in modern software.
What these classes of bugs have in common is that they both occur mostly in heap objects. Understanding how the heap works is a critical, often overlooked, step in crafting reliable exploits for these kinds of vulnerabilities.
Other prevalent classes of bugs are type confusions and race conditions. We’re not going to focus on those here, because they are not necessarily heap-related.
When we talk about the ‘heap’, what we usually mean is any and all memory objects which are managed using the libc malloc/free interface. This very simple interface lets us allocate so-called “dynamic memory” for our use, and free it when we are done using it. When we approach the task of exploiting a heap-overflow or a use-after-free, it’s not enough to know the semantics of this interface. We need to know what is happening under the hood.
Android uses its own libc implementation, called bionic. When the Android developers came to implement these heap functions, they wisely chose to use an existing, battle tested implementation instead of rolling their own.
The dynamic memory allocator implementation they chose is called dlmalloc. It’s named after its author, Doug Lea. Doug started writing this allocator way back in 1987. It has received many updates and improvements over the years, and was last updated in 2012.
When you call malloc, dlmalloc does a bunch of stuff behind the scenes, and will eventually return a pointer to a block of contiguous memory which you can use in your program. This block is called a ‘chunk’, and is guaranteed to be at least as big as the size you requested.
These chunks don’t come from nowhere. When dlmalloc needs memory to use for chunks, it requests an allocation from the operating system. Each such system allocation is called a ‘segment’.
Segments are the base unit of allocation from the OS. dlmalloc keeps a linked list of segments it has allocated from the system, with the pointers stored in the segment’s footer. The most recently allocated segment is the ‘current’ segment. When it needs more system memory, dlmalloc first tries to extend the current segment using sbrk, falling back to mmap-ing a new segment if that doesn’t work. Segments can be of different sizes, but are always a multiple of the page size. Segments are not guaranteed to be adjacent to one another in memory, and, in fact, are allocated at random addresses when system-wide ASLR is enabled, as it is on Android. If a new segment happens to be contiguous to an existing segment, the two segments are consolidated into a single larger segment.
The current segment contains the ‘top chunk’, which is the chunk of free space available for immediate allocation of chunks. Here’s an example ‘current’ segment, with in use (allocated) chunks in light green and free (unallocated) chunks in blue.
When dlmalloc needs to allocate a new chunk for a malloc call, it will check if the top chunk is big enough to contain the new chunk, and will carve the new chunk from within the top chunk by splitting it. The first half of the top chunk becomes the new chunk to be returned, and the second half becomes the new ‘top chunk’. If the ‘top chunk’ is not large enough to contain the new chunk, a new segment is allocated from the operating system, and the new chunk is allocated from that new segment.
Each chunk has two pointers worth of metadata: in 32bit processes this is 8 bytes. This metadata sits directly before the pointer returned by malloc, i.e. inline before the useable memory. The minimum amount of actual usable memory returned by malloc is two pointers wide.
Chunks of different sizes can be allocated one after the other in the segments. Each chunk marks its size and whether it is in use or not, via the C_INUSE flag. It also marks whether the previous chunk in the segment is in use, with the P_INUSE flag, and the previous chunk’s size. Because the metadata contains the size of the previous chunk, we can easily walk backwards through the chunks in a segment.
When you call free on a given chunk, the first thing that happens is that dlmalloc checks to see if the preceding chunk is in use. If the preceding chunk is free, dlmalloc will consolidate the two chunks into one larger free chunk.
This means that it is impossible for two consecutive chunks in a segment to both be free. The chunks immediately before and after a free chunk are both guaranteed to be in use.
Simple right? Obviously what we’ve described is a pretty naïve allocator implementation. There’s a little more to it. Specifically, what we’ve described is a system which never reuses freed memory, as it always allocates from the ‘top chunk’. So how do we efficiently reuse freed memory?
We need some bins.
Bins are used to keep a record of recently freed chunks which can be reused. There are two types of bins: ‘small’ and ‘tree’. Small bins are used for chunks smaller than 0x100 bytes. Each small bin contains chunks of the same size. Tree bins are used for larger chunks, and contain chunks of a given range of sizes. Small bins are implemented as simple doubly-linked lists, and tree bins are implemented as bitwise digital trees (aka ‘tries’), keyed on chunk size. There are 32 small bins and 32 tree bins.
When a chunk is freed, it undergoes consolidation if needed, and then the consolidated chunk is added to the appropriate bin for its size. The list and tree node pointers are stored within the actual chunk data, which is safe to use for metadata as it is ‘free’. This is where the minimum size for a chunk comes from: we need space for previous and next pointers in the free chunk’s data.
Here’s an example showing a few segments with some in use and free chunks. The 0x18 bin points to the first of the free chunks of size 0x18, and the rest of them are chained together in a doubly-linked-list.
Note that small bins contain chunks of exactly one size. Tree bins contain ranges of chunk sizes.
dlmalloc is a best fit allocator. It will always try to find the free chunk with the smallest size greater or equal to the request size.
During allocation, before looking at the ‘top chunk’, dlmalloc will first try to find a free chunk in the bins. It first tries to find a chunk which matches the exact size of the allocation request, and then moves upwards through the non-empty bins till it finds the smallest chunk which is larger than the request. If a larger chunk is used, it will be split, and the remainder will be added to the relevant bin to possibly be used for future allocations. Only if no chunk exists in the bins to satisfy the allocation request will the ‘top chunk’ be used.
Note that the bins are First In First Out. So chunks are allocated in the order that they were freed. This can be an important factor in exploitation.
After looking in the bins for an exact size match, but before going to the ‘top chunk’, dlmalloc will try to see if the ‘designated victim’ is large enough to contain the allocation request.
The ‘designated victim’ is the preferred chunk for servicing small requests that don’t have an exact fit. It is the chunk which was most recently split off. It doesn’t sit in any bin. Having the ‘designated victim’ helps to localize allocations to a given memory segment, which can be useful when considering how CPU caches work. Small allocations which don’t have an exact fit in the bins will be split off from this chunk.
So for a small allocation, a request size smaller than 0x100 bytes, this is the flow:
- We first calculate the exact size including metadata and padding
- We then look for an exact match in the small bins
- If that fails, we next see if the ‘designated victim’ is large enough to allocate from
- If the ‘designated victim’ is too small, we then look for a ‘best fit’ in the small bins larger than our request size
- If that fails, we look in the tree bins for a ‘best fit’ match
- Finally, if all else has failed, we look at the top chunk, potentially causing more memory to be allocated from the system.
Larger allocations are a little simpler. We just try to allocate from the tree bins before attempting the ‘designated victim’ and then the top chunk.
There are no bins for so-called very large allocations, which means anything larger than the MMAP_THRESHOLD, which is 64kb on Android. These allocations don’t come from the segments. Instead, each such allocation is mmaped directly.
So that’s dlmalloc in a nutshell. Hopefully I’ve covered all the salient points. There are a couple of things we should note before moving on.
While dlmalloc takes some steps to reduce fragmentation of the heap, particularly the reuse of freed chunks based on bin size, it is still common for smaller free chunks to become trapped between larger consecutive chunks which often remain in use for longer periods in application flow.
dlmalloc is not thread safe. At all. Both malloc and free touch process global data structures and the inline metadata between chunks and inside free chunks. Remember that dlmalloc was designed long before the Age of Parrellism, before every application was multithreaded, before hyper-threading and multi-core processors. To make dlmalloc usable in multi-threaded processes, Doug Lee chose the simplest possible fix: the big lock.
Every single malloc or free call locks a global mutex on entry and unlocks at function exit. This makes dlmalloc usable with threads, but has a major performance impact. Essentially all allocator operations are serialized. This is ok on lightly multi-threaded processes, but can be a significant drag on more complex applications
The poor multithreading performance of dlmalloc is one of the main reasons that the bionic developers decided to switch to a more modern heap implementation.
That wraps up the discussion of dlmalloc. Read the next post in this series to find out about jemalloc, the more modern Android libc allocator.