Operating Systems 2022F: Tutorial 7
This tutorial is still being developed.
In this tutorial, you will use a variety of bpftrace scripts in order to observe 3000shell and other running programs.
The Linux Kernel Tracepoint API
A tracepoint is a piece of code, which can be used to hook into important sections of the Linux Kernel. They can be used by a number of tools like eBPF to allow userspace programs to debug, measure performance, or diagnose problems that exist within the Linux Kernel. Tracepoints are defined in a number of header files in the Linux Kernel via various macros.
ptrace, eBPF, and bpftrace
Processes are normally isolated from each other, in that code and data in one cannot be accessed by another. However, in the past tutorials we've used tools to observe process behaviour: strace, ltrace, and gdb. It turns out these programs use a special system call, ptrace, to observe and control another process.
The ptrace system call is designed for debugging and is very intrusive: running ptrace on a process can change how that process behaves even when you don't want it to. As a result, you don't want to use ptrace on anything in a production environment.
But, what if you need to observe what is happening in a production system? How can you do so safely? Well, Linux has a technology called eBPF for doing exactly this. We will later discuss more about how eBPF works. In this tutorial, though, we will just use some scripts that make use of bpftrace, a relatively easy to use yet powerful front end to eBPF.
Because eBPF can monitor and even change almost anything in a running system, only root can use eBPF. In contrast, ptrace-based tools like strace can be used safely by anyone because they can only affect one process at a time.
Getting set up (Running the right kernel)
bpftrace is already installed on the class virtual machines. If you are using a different version of Linux, you'll need to install bpftrace yourself. You may be able to use a binary package; however, bpftrace's functionality is very closely tied to the running version of the Linux kernel, thus scripts that work on one system may not work on another. To minimize problems, then, you should use the COMP 3000 VM if at all possible.
Having said this, there is one problem with the original class VMs: they are running the wrong kernel! The Ubuntu kvm kernel for some reason doesn't have all of the eBPF-related configuration options enabled even though the generic Ubuntu kernel does. So, you'll need to change the kernel that you're running in your VM.
There are two solutions:
- Make a new VM with an updated image COMP3000A-2021F-2. This VM has the right kernel installed. (If you do this, please delete your old VM.)
- Follow the steps below to install the right kernel and make it the default.
Installing a new kernel in the (old) class VM
Run the following commands: (Note the # indicate comments:)
 sudo -i                      #  become root
 apt update                   #  update the package database
 apt -y dist-upgrade          #  upgrade all packages
 apt install linux-virtual    #  install linux-virtual, which will
                              #    install a generic linux kernel
 apt clean; apt -y autoremove #  clean up
 grub-reboot 1\>4             #  on next reboot, select the 2nd
                              #    menu item and then the 5th one
 reboot                       #  reboot!
Note the grub-reboot command assumes that you have two kvm kernels installed plus the new generic one. Your installed packages should look something like this:
student@comp3000:~$ dpkg --list | grep "linux-image-5" | col2 linux-image-5.11.0-1015-kvm linux-image-5.11.0-1017-kvm linux-image-5.11.0-37-generic
If it doesn't, you'll need to change the 4 in the 1\>4 to the right number. Every kernel generates two entries with the second of each being a recovery image; thus, the number we want is the base-0 index times two. In other words, to run each of the above kernels we would put in 0, 2, or 4, respectively.
(Note this is all easier if you can see the grub menu when the kernel boots. We will later show you how to navigate this menu; for now the above should get you running the correct kernel.)
Checking bpftrace
If you are running the right kernel, the following command should produce a long list:
sudo bpftrace -l
(How can you keep this command from filling your screen?)
If this command stops with an error, you aren't running the right kernel.
bpftrace scripts
The bpftrace scripts we want to run are all in
/usr/local/share/bpftrace/tools/
You can run them by typing
sudo bpftrace <file>
So, to run bashreadline.bt, type
sudo bpftrace /usr/local/share/bpftrace/tools/bashreadline.bt
Copy/paste, up arrow, and Ctrl-K and Ctrl-Y are your friends! Also, you can cd to this directory and run them from there.
Tasks & Questions
Part A: Examining Predefined Linux Kernel Tracepoints
The purpose of the following questions and tasks is to help you understand how predefined Linux kernel tracepoints work.
- Run "/usr/share/bcc/tools/tplist". What does the output represent?
- Inspect the implementation of syscount.bt and killsnoop.bt using your favourite text editor.
- Are any of the lines outputted by tplist used in the implementation of syscount.bt? If so, which line(s)?
- tplist uses Python's built-in file IO functions to open a file. Which file does tplist open? Why?
- A common characteristic of bpftrace tools is its use of of the args struct. For example, in killsnoop.bt, args->pid is used to retrieve the pid of the process that initiated the kill system call. Where can we find the fields of the args struct?
- The Linux kernel code that partially implements the tracepoint for task creation can be found here:
- What do you think each line does?
- How do you think this kernel space code affects the args struct that is available from userspace?
- Using the trace_write_syscall() function as a reference, finish the function trace_read_syscall() to make it trace the sys_read system call. Make use of every field of the args struct in your implementation.
Part B: Playing with bpftrace tools
- Run opensnoop.bt:
  - What files are being opened when you run ls? Are these what you expect?
- What files are being opened when you run top? What directory are most of them in? Why?
- Compare the files being opened by bash and static-sh. Is there a significant difference? Why?
- What files does 3000shell open? Can you change what files it opens by changing how you create the 3000shell binary?
 
- Run execsnoop.bt:
  - Does every typed command result in one execve system call?
- If you type in a non-existent command, does it produce any execve's in bash? What about 3000shell?
- When you ssh into the VM you should see many execve calls. Where are they coming from?
 
- Run killsnoop.bt:
  - Are kill system calls happening when you do nothing? Who is sending them?
- When you interrupt a command using Ctrl-C or Ctrl-S, is a kill system call generated?
- If you hit Ctrl-C at a bash prompt, does it generate a kill system call?
- Are all signals being sent via kill system calls? How do you know?
- What is a 0 signal used for? Do you see any process sending these signals when you log in or out?
 
- Run syscount.bt:
  - What programs are generating lots of system calls?
- What calls are the most common? Use /usr/include/x86_64-linux-gnu/asm/unistd_64.h for the names associated with system call numbers on the class VM.
 
- Run bashreadline.bt:
  - What does this program do?
- Look at the code for this script. How do you think it works?
- If bash is already running, this script won't report the first command that you type. Why?
 
Code
trace_sys_write.py
#!/usr/bin/python3
#
# trace_sys_write.py
#
# Demonstrates stateful sys_write system call recording along with the
# associated PID of the userspace process that requested the sys_write system call.
# 
# Copyright (c) 2022 Huzaifa Patel.
#
# Author(s):
#   Huzaifa Patel <huzaifa.patel@carleton.ca>
from __future__ import print_function
from bcc import BPF
import time
def trace_write_syscall():
    # Load BPF program
    trace_point = BPF(text="""
    TRACEPOINT_PROBE(syscalls, sys_enter_write) {
        bpf_trace_printk("sys_write(fd: %lu, buf: %lx, count: %lu)", args->fd, args->buf, args->count);
    }
    """, cflags=["-Wno-macro-redefined", "-Wno-return-type"])
    # Header
    print("%-11s %-20s %-8s %s\n" % ("TIME", "PROGRAM", "PID", "FUNCTION CALL"))
    # Format Output
    while 1:
        try:
            (task, pid, cpu, flags, ts, msg) = trace_point.trace_fields()
        except ValueError:
            continue
        time.sleep(1)
        print("%-3f %-20s %-8d %s" % (ts, task.decode('utf-8'), pid, msg.decode('utf-8')))
# TODO
# WRITE YOUR CODE IN THE FUNCTION BELOW:
def trace_read_syscall():
        print("")
def main():
    trace_write_syscall()
    trace_read_syscall()
main()