Operating Systems 2022F: Tutorial 7
In this tutorial, you will use a variety of bpftrace scripts in order to observe 3000shell and other running programs.
The Linux Kernel Tracepoint API
A tracepoint is a piece of code, which can be used to hook into important sections of the Linux kernel. They can be used by a number of tools like eBPF to allow userspace programs to debug, measure performance, or diagnose problems that exist within the Linux kernel. Tracepoints are defined in a number of header files in the Linux kernel via various macros.
ptrace, eBPF, and bpftrace
Processes are normally isolated from each other, in that code and data in one cannot be accessed by another. However, in the past tutorials we've used tools to observe process behaviour: strace, ltrace, and gdb. It turns out these programs use a special system call, ptrace, to observe and control another process.
The ptrace system call is designed for debugging and is very intrusive: running ptrace on a process can change how that process behaves even when you don't want it to. As a result, you don't want to use ptrace on anything in a production environment.
But, what if you need to observe what is happening in a production system? How can you do so safely? Well, Linux has a technology called eBPF for doing exactly this. We will later discuss more about how eBPF works. In this tutorial, though, we will just use some scripts that make use of bpftrace, a relatively easy to use yet powerful front end to eBPF.
Because eBPF can monitor and even change almost anything in a running system, only root can use eBPF. In contrast, ptrace-based tools like strace can be used safely by anyone because they can only affect one process at a time.
Getting set up (Running the right kernel)
bpftrace and BCC are already installed on the class virtual machines. If you are using a different version of Linux, you'll need to install both yourself. You may be able to use a binary package; however, to get the latest version you may have to build from source. Note that bpftrace's functionality is very closely tied to the running version of the Linux kernel, thus scripts that work on one system may not work on another.
To minimize problems, you should use the COMP 3000 VM if at all possible.
Checking bpftrace
If you are running the right kernel, the following command should produce a long list:
sudo bpftrace -l
(How can you keep this command from filling your screen?)
If this command stops with an error, you aren't running the right kernel.
bpftrace scripts
The bpftrace scripts we want to run are all in
/usr/local/share/bpftrace/tools/
You can run them by typing
sudo bpftrace <file>
So, to run bashreadline.bt, type
sudo bpftrace /usr/local/share/bpftrace/tools/bashreadline.bt
Copy/paste, up arrow, and Ctrl-K and Ctrl-Y are your friends! Also, you can cd to this directory and run them from there.
Tasks & Questions
Part A: Examining Predefined Linux Kernel Tracepoints
The purpose of the following questions and tasks is to help you understand how predefined Linux kernel tracepoints work.
- Run "/usr/share/bcc/tools/tplist". What does the output represent?
- Inspect the implementation of syscount.bt and killsnoop.bt using your favourite text editor.
- Are any of the lines outputted by tplist used in the implementation of syscount.bt? If so, which line(s)?
- tplist uses Python's built-in file IO functions to open a file. Which file does tplist open? Why?
- A common characteristic of bpftrace tools is its use of of the args struct. For example, in killsnoop.bt, args->pid is used to retrieve the pid of the process that initiated the kill system call. Where can we find the fields of the args struct? Hint: Explore the directory structure you found in the previous question.
- The Linux kernel code that partially implements the tracepoint for task creation can be found here:
- What do you think each line does?
- How do you think this kernel space code affects the args struct that is available from userspace?
- Using the trace_write_syscall() function as a reference, finish the trace_open_syscall() function to make it trace the sys_open system call. Make use of every field of the args struct in your implementation.
Part B: Playing with bpftrace tools
- Run opensnoop.bt:
  - What files are being opened when you run ls? Are these what you expect?
- What files are being opened when you run top? What directory are most of them in? Why?
- Compare the files being opened by bash and static-sh. Is there a significant difference? Why?
- What files does 3000shell open? Can you change what files it opens by changing how you create the 3000shell binary?
 
- Run execsnoop.bt:
  - Does every typed command result in one execve system call?
- If you type in a non-existent command, does it produce any execve's in bash? What about 3000shell?
- When you ssh into the VM you should see many execve calls. Where are they coming from?
 
- Run killsnoop.bt:
  - Are kill system calls happening when you do nothing? Who is sending them?
- When you interrupt a command using Ctrl-C or Ctrl-S, is a kill system call generated?
- If you hit Ctrl-C at a bash prompt, does it generate a kill system call?
- Are all signals being sent via kill system calls? How do you know?
- What is a 0 signal used for? Do you see any process sending these signals when you log in or out?
 
- Run syscount.bt:
  - What programs are generating lots of system calls?
- What calls are the most common? Use /usr/include/x86_64-linux-gnu/asm/unistd_64.h for the names associated with system call numbers on the class VM.
 
- Run bashreadline.bt:
  - What does this program do?
- Look at the code for this script. How do you think it works?
- If bash is already running, this script won't report the first command that you type. Why?
 
Code
trace_sys_write.py
#!/usr/bin/python3
#
# trace_sys_write.py
#
# Demonstrates stateful sys_write system call recording along with the
# associated PID of the userspace process that requested the sys_write system call.
# 
# Copyright (c) 2022 Huzaifa Patel.
#
# Author(s):
#   Huzaifa Patel <huzaifa.patel@carleton.ca>
from __future__ import print_function
from bcc import BPF
import time
def trace_write_syscall():
    # Load BPF program
    trace_point = BPF(text="""
    TRACEPOINT_PROBE(syscalls, sys_enter_write) {
        bpf_trace_printk("sys_write(fd: %lu, buf: %lx, count: %lu)", args->fd, args->buf, args->count);
    }
    """, cflags=["-Wno-macro-redefined", "-Wno-return-type"])
    # Header
    print("%-11s %-20s %-8s %s\n" % ("TIME", "PROGRAM", "PID", "FUNCTION CALL"))
    # Format Output
    while 1:
        try:
            (task, pid, cpu, flags, ts, msg) = trace_point.trace_fields()
        except ValueError:
            continue
        time.sleep(1)
        print("%-3f %-20s %-8d %s" % (ts, task.decode('utf-8'), pid, msg.decode('utf-8')))
# TODO - Part A: Question 7 
# WRITE YOUR CODE IN THE FUNCTION BELOW:
def trace_open_syscall():
        print("")
# Try uncommenting the call to trace_write_syscall() if you want to test trace_open_syscall()
def main():
    trace_write_syscall()
    trace_open_syscall()
main()