Watching activity in the kernel with the bpftrace tool
Programming Snapshot – bpftrace
Who is constantly creating the new processes that are paralyzing the system? Which process opens the most files and how many bytes is it reading or writing? Mike Schilli pokes inside the kernel to answer these questions with bpftrace and its code probes.
If you are tasked with discovering the cause of a performance problem on a Linux system that has slowed down to a crawl, you will typically turn to tools such as iostat
, top
, or mpstat
to see exactly what is throwing a spanner in the works [1]. Not enough RAM? Lame hard disk? CPU overloaded? Or is network throughput the bottleneck?
Although a tool like top
shows you the running processes, it cannot detect short-lived instances that start and end again immediately. Periodically querying the process list only makes sense to visualize long-running processes.
Fortunately, the Linux kernel already contains thousands of test probes known as Kprobes and tracepoints. Users can inject code, log events, or create statistics there. One totally hot tool for doing this is bpftrace. With simple one-liners, it injects into the kernel scripts that determine in real time metrics like bytes heading off into the network or onto the hard drive, or lists which processes open or close which files.
BPF stands for Berkeley Package Filter and testifies to the origin of the corresponding tool from the BSD world as a tracing tool for network packets. The practice of scattering probes throughout the code that are usually tacit, but run small snippets of code when triggered, proved so practical that it soon entered the Linux world as eBPF.
Once there, it lost its ties to network packets and conquered wide areas of the kernel code as a generic tracing concept. Good naming is hard work that engineers often shy away from, so the author of eBPF changed the name of his now popular work back to BPF. Of course, that complicates things for authors writing tutorials like this one, who are hard pressed to find an explanation as to why the BPF name has lost all meaning with respect to the product as it is today.
The approach of distributing dynamically deployable probes in kernel code came from the Sun world. For a long time, Solaris was the only operating system that allowed administrators to use DTrace to activate small pieces of D language code at strategic points, such as the system call entry point, and fire off counters or timers for performance analysis.
BPF on newer Linux kernels works in a similar way to DTrace, but has been rewritten (also for patent reasons). It executes instructions assigned to the probes in the BPF language, either in an interpreter or via a JIT compiler in native code, directly inside the kernel.
Status: Improving
The bpftrace programming language is very reminiscent of scripting with Unix veteran Awk, but it's still incomplete, and programmers sometimes struggle to complete even the simplest of tasks.
The bpftrace parser (implemented via the Unix veterans Lex and Yacc) is in a sorry state that doesn't even come close to the functionality of Awk – but maybe it will at some point. Netflix engineer Brendan Gregg and some open source friends are working on fixing it. Brendan's book on BPF [2] will be published in December 2019 (a preview is already available).
Back to the task at hand: How do you enable a probe in the kernel that outputs a message each time any userspace program calls the open()
function to open a file? With this function, you'll be able to monitor in real time processes of active files. Turns out this is really easy to do. Listing 1 [3] shows the program code; Figure 1 shows the program output.
Listing 1
sys-open.bt
01 #!/usr/bin/bpftrace 02 03 interval:s:5 04 { 05 exit(); 06 } 07 08 kprobe:do_sys_open 09 { 10 printf("%s %s\n", comm, str(arg1)); 11 }
Compact Code
The actual work starts in line 8 with the definition of the kprobe:do_sys_open
probe; the following block contains instructions to be executed when the probe triggers. When triggering it, the kernel tells the probe which file the open()
system call wants to open. In the block, the printf()
instruction outputs the Unix command of the triggering Unix process stored in the comm
variable along with the first argument arg1
, which carries the name of the file to be opened. Because printf()
expects a string, but BPF saves arg1
as a character pointer, the standard str()
function converts the pointer appropriately.
The code for the interval:s:5
event starting in line 3 is just some optional feature that cancels the program after five seconds. The event defines an interval of five seconds at which bpftrace jumps into the code block. The call to exit()
, which shuts down the program, occurs here as soon as the block has been accessed for the first time. Tracing tools often use intervals like this to output consolidated statistics every few seconds. Once bpftrace has been installed on a Ubuntu system like this:
$ sudo apt-get update $ sudo apt-get install bpftrace
all you need to do is run Listing 1 with sudo
. It launches in the blink of an eye and keeps showing you which processes on the system are currently attempting to open which files. Before you get too excited, however, please note that bpftrace only works on relatively new kernels. Its creators recommend at least version 4.9, and preferably a series 5 kernel.
It is a very powerful tool. Astonished users will rub their eyes in amazement thinking about what just happened behind the scenes during the inconspicuous call: Bpftrace activated the do_sys_open
kprobe
in the kernel and translated the printf()
statement into an internal format. It then installed the compiled code on the probe, causing it to display a message every time the kernel passes the probe. When the bpftrace call terminates, it deactivates the probe in the kernel and removes the injected code.
Full Tilt While Idle
How does this work inside the kernel? It would obviously be devastating for kernel performance if it had to check whether each probe is currently active and then carry on normally in the program in almost 100 percent of the cases when the probe is inactive. There are always very few, if any, probes active from thousands of possible ones.
Instead, the BPF technology, just like DTrace under Solaris, uses a trick: Normally, when the probe is inactive, it inserts a 5 byte no-op instruction into the code, which the processor skips with practically no impact at run time. If the user activates the probe, for example by calling bpftrace, BPF replaces the no-op instruction in the kernel with a jump address to the interpreter that executes the desired code.
No doubt, the CPU will consume time when executing the BPF instructions, which will slow down the kernel a bit. But since the processor stays in kernel mode and doesn't have to switch to user space every time, the probe can quickly refresh the desired statistics – then the flow continues with the actual kernel code.
However, if the infiltrated code were to block the kernel, the result would be devastating: The entire system would stop, which is tantamount to a computer crash. That's why BPF verifies the code before it is introduced and only inserts it if the analysis shows that it will terminate relatively quickly. This is why the bpftrace language does not offer for
loops or similar constructs for which it cannot predict with certainty whether they will stop running in the foreseeable future.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.