如何在Linux中跟踪系统调用？

Question

如何在Linux中跟踪系统调用？

linuxkernelsystem-calls

11

我该如何跟踪从陷阱到内核的系统调用，包括参数传递、系统调用在内核中的定位、内核中实际处理系统调用、返回给用户并恢复状态的过程？

- luminous12

3个回答

3

使用ftrace实际上相对容易。这是Steven, "Mr. ftrace", Rostedt的一篇经典文章。第二部分在这里。

Linux基金会的Jan-Simon Möller制作了免费视频，还有许多其他好的介绍文章可以使用“ftrace教程”或“ftrace示例”等搜索词找到。

- Jonathan Ben-Avraham

2

您可以使用“-f”和“-ff”选项。例如：

strace -f -e trace=process bash -c 'ls; :'

-f：通过当前正在被跟踪的进程创建的子进程，跟踪它们的行为，这是通过fork(2)系统调用实现的。

-ff：如果启用了-o filename选项，则每个进程的跟踪结果将被写入filename.pid文件中，其中pid是每个进程的数字进程ID。这与-c不兼容，因为不会记录每个进程的计数。

- Rahul Tripathi

注意：在编程中，“process”指的是内核对进程的概念，而在用户空间通常称为“线程”。 - o11c

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ciro Santilli OurBigBook.com · Accepted Answer

SystemTap

这是我目前发现的最强大的方法。它甚至可以显示调用参数：ftrace是否允许捕获Linux内核的系统调用参数，还是仅限函数名称？

用法：

sudo apt-get install systemtap
sudo stap -e 'probe syscall.mkdir { printf("%s[%d] -> %s(%s)\n", execname(), pid(), name, argstr) }'

然后在另一个终端中执行：

sudo rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b

样例输出：

mkdir[4590] -> mkdir("/tmp/a", 0777)
mkdir[4593] -> mkdir("/tmp/b", 0777)

文档：https://sourceware.org/systemtap/documentation.html

似乎是基于kprobes：https://sourceware.org/systemtap/archpaper.pdf

参见：如何使用ftrace仅跟踪系统调用事件而不显示Linux内核中的任何其他函数？

在Ubuntu 18.04、Linux kernel 4.15上进行了测试。

ltrace -S 显示系统调用和库调用

因此，这个神奇的工具可以进一步展示可执行文件正在做什么。

例如，我用它来分析 dlopen 所做的系统调用：https://unix.stackexchange.com/questions/226524/what-system-call-is-used-to-load-libraries-in-linux/462710#462710

ftrace 最小运行示例

提到了https://dev59.com/a10a5IYBdhLWcg3w48FP#29840482，但这里有一个最小可运行示例。

使用 sudo 运行：

#!/bin/sh
set -eux

d=debug/tracing

mkdir -p debug
if ! mountpoint -q debug; then
  mount -t debugfs nodev debug
fi

# Stop tracing.
echo 0 > "${d}/tracing_on"

# Clear previous traces.
echo > "${d}/trace"

# Find the tracer name.
cat "${d}/available_tracers"

# Disable tracing functions, show only system call events.
echo nop > "${d}/current_tracer"

# Find the event name with.
grep mkdir "${d}/available_events"

# Enable tracing mkdir.
# Both statements below seem to do the exact same thing,
# just with different interfaces.
# https://www.kernel.org/doc/html/v4.18/trace/events.html
echo sys_enter_mkdir > "${d}/set_event"
# echo 1 > "${d}/events/syscalls/sys_enter_mkdir/enable"

# Start tracing.
echo 1 > "${d}/tracing_on"

# Generate two mkdir calls by two different processes.
rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b

# View the trace.
cat "${d}/trace"

# Stop tracing.
echo 0 > "${d}/tracing_on"

umount debug

输出示例：

# tracer: nop
#
#                              _-----=> irqs-offhttps://sourceware.org/systemtap/documentation.html
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           mkdir-5619  [005] .... 10249.262531: sys_mkdir(pathname: 7fff93cbfcb0, mode: 1ff)
           mkdir-5620  [003] .... 10249.264613: sys_mkdir(pathname: 7ffcdc91ecb0, mode: 1ff)

这种方法的一个很酷的地方在于它可以一次性显示系统上所有进程的函数调用，虽然您也可以使用set_ftrace_pid筛选感兴趣的PID。

文档位于：https://www.kernel.org/doc/html/v4.18/trace/index.html 在Ubuntu 18.04、Linux内核4.15上进行了测试。

GDB逐步调试Linux内核

根据您需要的内部详细信息的级别，这是一个选项：如何使用GDB和QEMU调试Linux内核？ strace最小可运行示例

这里有一个最小可运行的strace示例：应该如何使用strace？带有一个独立的hello world，使得所有东西的工作原理都非常清晰。

更多信息

https://en.pingcap.com/blog/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance might be worth a read, it mentions:
```
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
```
and the BPF-based traceloop: https://github.com/kinvolk/traceloop which the article claims to be a very fast method:
```
sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service
```