Operating Systems 2020W: Tutorial 9
In this tutorial, we will be playing with eBPF in more detail than you have seen previously. Parts of this tutorial may be slightly confusing. That is okay. You are learning a completely new programming paradigm in one tutorial and you are not expected to master it. The primary goal here is to understand the difference between kernel modules and BPF programs, and be able to explain at a high level how they work.
You may wish to look at the BPF Compiler Collection (BCC) documentation for further information:
Also, here's a dated, but still insightful introductory article on eBPF:
To get started, we will first examine the source code for 3000shellwatch.py and bpf_program.c, the userspace and kernelspace components of our BPF program respectively. You can download a zip file containing all the coded needed for this tutorial here.
Getting Started
Open 3000shellwatch.py and bpf_program.c. Try to get an idea of the following (if you are stuck, check the documentation linked above):
- What is a tracepoint? What is a kprobe? What is a uprobe? What are their similarities and differences? bpf_program.c contains examples of all three.
- How do we pass events from kernelspace to userspace?
- Do you notice any big differences between BPF programs and kernel modules? What things have you seen in kernel modules that are missing in the BPF program?
- BPF programs are supposed to be completely production-safe due to the BPF verifier. Do you think it is possible to cause a kernel panic from a BPF program? Make a guess now, you will have a chance to test your guess later.
- Run a familiar trace command of your choice from one of the previous tutorials, but append the -v flag to the end of the command. trace will now output the source code for the BPF program it generated. How does this output compare with the hand written BPF program in bpf_program.c? Do you think it would be possible to write one long trace command to do the same thing as 3000shellwatch.py?
- Note that we never run a C compiler on bpf_program.c directly; instead, the python code compiles then loads the C program. Why not just compile the program in advance?
Playing with 3000shellwatch
- Run sudo apt install python3-bcc in your terminal on the class VM. This installs a necessary dependency. (If you are running elsewhere, make sure to add python3 support for BCC when installing from source.)
- Open two terminals. In one terminal, run ./3000shell (From Tutorial 3), and in another terminal run sudo ./3000shellwatch.py -p `pidof 3000shell`. Run a several commands in 3000shell and observe the output of 3000shellwatch.py. What system calls is 3000shell generating according to 3000shellwatch.py? Compare this output with that of strace.
- Let's try to make 3000shellwatch.py crash the kernel. Modify the tracepoint on lines 69-95 to do something dangerous like dereferencing a NULL pointer. Run your BPF program. What is all that output? Did you crash your kernel or did something else happen?
- Can you use kernelspace helper functions in BPF programs? Try including a header file and calling a kernel function like copy_to_user. What happens when you try to run your program?
- Could you write a kernel module that does everything our BPF program does? How hard would this be?
- Optional: See if you can implement your own tracepoint in 3000shellwatch. To do this, you need to do the following:
- Examine the list of available tracepoints with sudo tplist. You can search for specific strings by providing it an optional argument. When you find one that looks promising, you can view it in detail by passing the -v flag. If you're looking for an easy suggestion, try syscalls:sys_enter_write.
- Add your own struct definition that will contain data from the event.
- Add your own perf event buffer that will pass event data to userspace.
- Using the same syntax as the existing tracepoint (lines 69-95), add in the definition of your tracepoint.
- Using the same syntax from the perf buffers in the userspace python script (e.g. lines 25-29), attach your perf buffer so that it will produce output.
Code
3000shellwatch.c
#! /usr/bin/env python3
import os, sys
import time
import argparse
from utils import syscall_name, syscall_ret, signal_name
from bcc import BPF
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('-p', '--pid', type=int, required=1,
help='PID of 3000shell.')
args = parser.parse_args()
# Set BPF program flags
flags = []
flags.append(f'-I{os.path.realpath(os.path.dirname(__file__))}')
flags.append(f'-DFILTER_PID={args.pid}')
# Load BPF program
bpf = BPF(src_file='bpf_program.c', cflags=flags)
# Define a hook for syscall_events perf buffer
def syscall_events(cpu, data, size):
event = bpf['syscall_events'].event(data)
print(f'syscall {syscall_name(event.syscall):<16s} = {syscall_ret(event.ret):>8s}')
bpf['syscall_events'].open_perf_buffer(syscall_events)
# Define a hook for signal_events perf buffer
def signal_deliver_events(cpu, data, size):
event = bpf['signal_deliver_events'].event(data)
print(f'3000shell received {signal_name(event.signal)} from pid {event.sending_pid}')
bpf['signal_deliver_events'].open_perf_buffer(signal_deliver_events)
# Define a hook for fgets_events perf buffer
def fgets_events(cpu, data, size):
event = bpf['fgets_events'].event(data)
print(f'user wrote: \"{event.str.decode("utf-8").strip()}\"')
bpf['fgets_events'].open_perf_buffer(fgets_events)
# Attach uprobes
bpf.attach_uprobe(name='c', sym='fgets', fn_name='uprobe_fgets')
bpf.attach_uretprobe(name='c', sym='fgets', fn_name='uretprobe_fgets')
if __name__ == '__main__':
print(f'Tracing pid {args.pid}, ctrl-c to exit...', file=sys.stderr)
try:
while 1:
bpf.perf_buffer_poll(30)
time.sleep(0.1)
except KeyboardInterrupt:
print()
print('Here is the distribution of read lengths:')
bpf['readlens'].print_linear_hist('read lengths:')
print(file=sys.stderr)
print('Goodbye BPF world!', file=sys.stderr)
utils.py
from errno import errorcode
from bcc.syscall import syscall_name as _syscall_name
# Patch errorcode to add kernel-only errors
errorcode[512] = 'ERESTARTSYS'
errorcode[513] = 'ERESTARTNOINTR'
errorcode[514] = 'ERESTARTNOHAND'
errorcode[515] = 'ENOIOCTLCMD'
errorcode[516] = 'ERESTART_RESTARTBLOCK'
errorcode[517] = 'EPROBE_DEFER'
errorcode[518] = 'EOPENSTALE'
errorcode[521] = 'EBADHANDLE'
errorcode[522] = 'ENOTSYNC'
errorcode[523] = 'EBADCOOKIE'
errorcode[524] = 'ENOTSUPP'
errorcode[525] = 'ETOOSMALL'
errorcode[526] = 'ESERVERFAULT'
errorcode[527] = 'EBADTYPE'
errorcode[528] = 'EJUKEBOX'
errorcode[529] = 'EIOCBQUEUED'
signals = {
1: 'SIGHUP',
2: 'SIGINT',
3: 'SIGQUIT',
4: 'SIGILL',
5: 'SIGTRAP',
6: 'SIGABRT',
7: 'SIGBUS',
8: 'SIGFPE',
9: 'SIGKILL',
10: 'SIGUSR1',
11: 'SIGSEGV',
12: 'SIGUSR2',
13: 'SIGPIPE',
14: 'SIGALRM',
15: 'SIGTERM',
16: 'SIGSTKFLT',
17: 'SIGCHLD',
18: 'SIGCONT',
19: 'SIGSTOP',
20: 'SIGTSTP',
21: 'SIGTTIN',
22: 'SIGTTOU',
23: 'SIGURG',
24: 'SIGXCPU',
25: 'SIGXFSZ',
26: 'SIGVTALRM',
27: 'SIGPROF',
28: 'SIGWINCH',
29: 'SIGIO',
30: 'SIGPWR',
31: 'SIGSYS',
}
def syscall_name(num):
return _syscall_name(num).decode('utf-8')
def syscall_ret(code):
try:
return str(code) if code > 0 else '-' + errorcode[-code]
except KeyError:
return str(code)
def signal_name(sig):
try:
return signals[sig]
except KeyError:
return f'UNKNOWN SIGNAL'
bpfprogram.c
#include <uapi/asm/unistd_64.h>
#include <linux/sched.h>
#include <linux/signal_types.h>
/* Type definitions below this line --------------------------------- */
#define MAX_STRLEN 512
/* This struct will contain useful information about system calls.
* We will use to to pass data between system call tracepoints and
* to return useful information back to userspace. */
struct syscall_event
{
int syscall;
long ret;
};
struct fgets_event
{
void *bufptr;
char str[MAX_STRLEN];
};
struct signal_deliver_event
{
void *ksigptr;
int sending_pid;
int signal;
};
/* Map definitions below this line ---------------------------------- */
/* This is a perf event buffer. Perf event buffers allow us
* to submit events to userspace. Our userspace program will
* read submitted events at regular intervals. */
BPF_PERF_OUTPUT(syscall_events);
BPF_PERF_OUTPUT(fgets_events);
BPF_PERF_OUTPUT(signal_deliver_events);
/* This is used to store intermediate values
* between entry and exit points. For example, storing the argument
* to fgets and printing it on return. */
BPF_ARRAY(fgets_intermediate, struct fgets_event, 1);
BPF_ARRAY(signal_deliver_intermediate, struct signal_deliver_event, 1);
/* This is used to keep track of value distributions.
* We can use the data to draw fancy histograms in userspace. */
BPF_HISTOGRAM(readlens, long, 10240);
/* Helper functions below this line --------------------------------- */
/* This is a simple filter() function that allows
* us to look at the specified process, ignoring others. */
/* Return 0 on pass, 1 on fail */
static int filter()
{
u32 pid = (bpf_get_current_pid_tgid() >> 32);
if (pid == FILTER_PID)
return 0;
return 1;
}
/* BPF programs below this line ------------------------------------- */
/* This is a tracepoint. They represent a stable API for accessing
* various events within the kernel. This one keeps track of every time
* we return from a system call. You can see all tracepoints on the system
* using the "tplist" bcc tool. */
TRACEPOINT_PROBE(raw_syscalls, sys_exit)
{
/* Filter PID */
if (filter())
return 0;
int zero = 0;
if (args->id < 0)
return 0;
/* Load intermediate value from percpu_array */
struct syscall_event event = {};
/* Store what we know about the system call */
event.ret = args->ret;
event.syscall = (int)args->id;
/* If we are in a read(2) call, let's keep track of a histogram of lengths */
if (args->id == __NR_read && args->ret >= 0)
readlens.increment(args->ret);
/* This is how we submit an event to userspace. */
syscall_events.perf_submit(args, &event, sizeof(event));
return 0;
}
/* Part 1 of the get_signal kprobe */
int kprobe__get_signal(struct pt_regs *ctx, struct ksignal *ksig)
{
/* Filter PID */
if (filter())
return 0;
int zero = 0;
struct signal_deliver_event *event = signal_deliver_intermediate.lookup(&zero);
if (!event)
return -1;
event->ksigptr = ksig;
return 0;
}
/* Part 2 of the get_signal kprobe */
int kretprobe__get_signal(struct pt_regs *ctx)
{
/* Filter PID */
if (filter())
return 0;
int zero = 0;
struct signal_deliver_event *event = signal_deliver_intermediate.lookup(&zero);
if (!event)
return -1;
struct ksignal *ksig = (struct ksignal *)event->ksigptr;
if (!ksig)
return -2;
event->signal = ksig->info.si_signo;
event->sending_pid = ksig->info.si_pid;
signal_deliver_events.perf_submit(ctx, event, sizeof(*event));
return 0;
}
/* Part 1 of the fgets uprobe */
int uprobe_fgets(struct pt_regs *ctx)
{
/* Filter PID */
if (filter())
return 0;
int zero = 0;
struct fgets_event *event = fgets_intermediate.lookup(&zero);
if (!event)
return -1;
/* Store location of parameter 1 (the buffer) */
event->bufptr = (void *)PT_REGS_PARM1(ctx);
return 0;
}
/* Part 2 of the fgets uprobe */
int uretprobe_fgets(struct pt_regs *ctx)
{
/* Filter PID */
if (filter())
return 0;
int zero = 0;
struct fgets_event *event = fgets_intermediate.lookup(&zero);
if (!event)
return -1;
/* Read the buffer into event.str */
bpf_probe_read_str(event->str, sizeof(event->str), event->bufptr);
fgets_events.perf_submit(ctx, event, sizeof(*event));
return 0;
}