A Tale of Two Mallocs: On Android libc Allocators – Part 2 – jemalloc

In the first post of this series, we discussed why it is important to understand the inner workings of the libc heap allocator, and did a deep dive into the original Android libc allocator: dlmalloc. In this post, we’ll examine the allocator which replaced dlmalloc as Android’s allocator.

The new allocator selected by the bionic developers is called jemalloc. It is named after its implementer Jason Evans. Jason started implementing jemalloc in 2005. It was then added to FreeBSD’s libc to become that platform’s default allocator. In 2007, the Firefox Mozilla project adopted the stand-alone version of jemalloc as their primary allocator. Since 2009, jemalloc has been used extensively in Facebook’s backend servers. It’s currently maintained by a team at Facebook.

During May 2014, jemalloc was added to the bionic source tree for Android 5.0.0. dlmalloc continued to be the default, but it was possible to select this new heap implementation using a board config flag. In July 2014, jemalloc was made the default.

In an ideal world, every vendor of an Android phone would have made the transition to jemalloc with their Android 5.0.0 ROMS. Unfortunately, many vendors chose to remain with the tried and tested dlmalloc heap until later versions of Android. In fact I’ve seen dlmalloc being used on both lollipop and marshmallow devices. The fact that there is no clear line in time separating the two implementations means that you cannot really know a priori whether a given Android 5 or 6 device is using dlmalloc or jemalloc.

jemalloc was designed from the ground up to be highly-performant in symmetric-multi-processing environments. It has many features which are geared towards increasing efficiency and locality in multi-threaded apps, while reducing overall fragmentation.

The first important concept in jemalloc’s implementation is the arena. Each thread is assigned to a given arena, and it allocates and frees only from that arena. Each arena is completely separate from other arenas, and most importantly they have separate mutexs guarding their data structures. This means that you can actually perform allocator operations in parallel so long as the threads involved are assigned to different arenas.

In general jemalloc usage, there should be slightly more arenas then there are hardware cores. For some reason, on android, this is not the case. Instead there are exactly two arenas.

Threads are assigned to arenas in a round-robin fashion, which should ensure that the arenas have a more or less equal number of threads.

In jemalloc, memory is allocated from the operating system using mmap. Each mmap operation allocates a chunk. jemalloc chunks roughly correlate to dlmalloc segments. Chunks are all of the same size, 256k bytes on Android versions up to 7.0.0. From 7.0.0, chunks are 512kB for 32-bit processes and 2MB for 64-bit processes. Each chunk belongs to a specific arena. There is a chunk header containing metadata for this chunk, specifically including a pagemap which defines which pages are associated with which runs.

For any jemalloc managed address, the relevant chunk header can easily be found by simply rounding down the address to the chunk size. This means that we have o(1) lookups of metadata in most situations.

A run is an area of contiguous memory, located in a chunk. Each run contains a fixed number of regions of a specific size. Different size classes have different numbers of regions. Runs are always a multiple of the page size. Run metadata is stored in the chunk header for the chunk which contains them. In other words, the metadata is out-of-band. Each run has a bitmap which indicates the state of each region in the run. A region can either be in-use or free.

Regions are the smallest unit of the jemalloc system. These are analogous to dlmalloc chunks, except that regions do not carry any metadata at all. Instead each region belongs to a run of regions of the same size. The run stores the metadata for all its regions in the chunk’s header. The region address is the return value from a malloc call, and should be the argument to free.

jemalloc is, at its core, a bucket allocator. Each arena has a set of logical bins, each of which services a specific size class. Allocations are made from the bin with the smallest size large enough to fit the allocation request. On Android, there are 39 bins. By having a carefully selected and limited list of bin sizes, with small steps between them, fragmentation can be decreased.

Note that on dlmalloc, bins are used only as free lists. On jemalloc, bins are used for ALL allocations.

Bin metadata is stored in the arena structure. Each run is associated with a specific bin. Each bin has a ‘current’ run which points to the non-full run from which it is currently allocating. Here you can see the 39 bins for this arena, with their metadata addresses, size classes and current run pointers.

If a run becomes full during an allocation, jemalloc will check if there are any non-full runs for this bin in the arena. If more than one non-full runs exist, the one with the lowest address will be selected and set as the ‘current’ run. If no non-full runs are available in this bin, a new run will be created in either an existing chunk or in a new chunk, and that run will be set as the ‘current’ run of the bin.

Arenas keep track of their non-full runs and available chunk space using a set of red-black trees. Finding a non-full run or available space for a new run is thus at most an O(log(n)) operation.

jemalloc reduces lock contention in a few ways, thereby improving multi-threaded performance. Firstly, each arena has its own locks, so operations on different arenas do not contend for locks. Secondly, the critical time is very short. The lock only needs to be held when allocating new runs to a given bin, or when flipping the in-use bit of a region in a run. These mechanisms already make jemalloc significantly more thread-friendly than dlmalloc. However, Jason didn’t stop there. He also implemented thread specific caches.

For each thread, for each bin there is a tcache. The tcache is a list of recently freed regions for the specific bin and thread.

When allocating, jemalloc first looks to see if there is a region in the tcache for the required size’s bin before going to the ‘current’ run for that bin. If so, it uses that region.

When freeing a region, jemalloc pushes the region onto the tcache for the relevant bin. The tcache is LIFO.

Regions which are currently held in tcaches do not have their in-use bit set to free. Instead they are considered by the greater jemalloc system to be in-use. This saves on locks, as it is only necessary to grab a lock when updating global data structures. Thread specific data structures are by definition safe from other threads, and thus in many cases jemalloc allocations will not grab locks at all.

If jemalloc tries to allocate a region of a given size, and the thread’s tcache for that bin size is empty, a pre-fill event will occur. When prefilling, jemalloc will lock the arena mutex, ‘allocate’ a number of regions for this bin from the ‘current’ run, marking their bits in the run’s bitmaps as in-use, push these regions onto the thread’s tcache and release the lock. This ensures that there are always a ‘sane’ number of regions in a tcache, and significantly improves locality, as a given thread will allocate regions of the same size from mostly contiguous memory.

Each tcache has a maximum number of regions which it can contain. For small bins this is 8, and for larger bins this is 20. When we reach this maximum a flush event occurs. At a flush event, jemalloc takes the oldest half of the tcache’s regions and really frees them. I.e. it grabs the lock and marks the region’s bits as free. At this point they are free to be allocated by other threads.

In addition, jemalloc implements a ‘garbage collection’ mechanism. Essentially, jemalloc counts each allocation and free event. When that count reaches a certain threshold, a so-called ‘hard event’ occurs. Each ‘hard event’, jemalloc looks at a specific bin across all threads, and clears out three-quarters of the regions from the tcaches for that bin. During the next ‘hard event’ the next bin will be targeted for cleanup. This is another way that regions can be removed from tcaches and returned to general availability.

So when allocating on jemalloc, we observe the following flow.

  • We first calculate the bin for our request size
  • We then look in the tcache of the current thread for the calculated bin
  • If the tcache is empty, we prefill from the bin’s ‘current’ run
  • When the current run is exhausted, we prefill from the non-full run with the lowest address
  • If there are not enough regions in the existing non-full runs, a new run will be allocated in a chunk which has available space
  • If no space is available in a chunk, a new chunk is allocated from the system, and a new run is allocated in that chunk and is used to prefill the tcache.

So now we’ve covered the essential details of jemalloc.

Let’s compare some of the important properties of dlmalloc and jemalloc.

  • dlmalloc is a best-fit allocator while jemalloc is a bucket allocator
  • dlmalloc uses in-line metadata
  • user allocations on dlmalloc are called chunks, in jemalloc they’re called regions
  • dlmalloc allocates variable sized segments from the system, while jemalloc allocates fixed-sized chunks
  • jemalloc always allocates from fixed size regions. dlmalloc chunks can be arbitrary 8-byte aligned sizes,
  • In dlmalloc, adjacent allocations are usually not the same size. In jemalloc they are.
  • dlmalloc only has the big lock, jemalloc has fine grained mutexes which reduce lock contention
  • jemalloc has thread specific free lists (aka tcaches) to further increase multithreading performance
  • jemalloc has a garbage collection mechanism which helps to clean up tcaches.
  • Recently freed chunks or regions on dlmalloc are reused in a FIFO fashion, while on jemalloc they are reused LIFO.

In our estimation, we believe that the current distribution of devices in use is about 70% jemalloc and 30% dlmalloc. This is largely due to the fact that most people update their phones relatively frequently, skewing the distribution towards the more modern jemalloc based systems. Even though the bulk of devices are on jemalloc, it is still necessary to exploit dlmalloc devices in order to get good real-world coverage. For example, certain geographical regions are more likely to have dlmalloc based phones.