Linux 应用程序性能分析。

46
我该如何在Linux机器上记录一个应用程序的性能?我没有集成开发环境(IDE)。
理想情况下,我需要一个可以附加到进程并记录定期快照的应用程序,其中包括:
- 内存使用情况 - 线程数量 - CPU 使用情况
6个回答

66
理想情况下,我需要一个应用程序,可以附加到进程并记录以下内容的定期快照: - 内存使用情况 - 线程数量 - CPU 使用率
嗯,为了收集有关进程的这种信息,在 Linux 上实际上不需要使用分析器。
  1. You can use top in batch mode. It runs in the batch mode either until it is killed or until N iterations is done:

    top -b -p `pidof a.out`
    

    or

    top -b -p `pidof a.out` -n 100
    

    and you will get this:

    $ top -b -p `pidof a.out`
    
    top - 10:31:50 up 12 days, 19:08,  5 users,  load average: 0.02, 0.01, 0.02
    Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:  16330584k total,  2335024k used, 13995560k free,   241348k buffers
    Swap:  4194296k total,        0k used,  4194296k free,  1631880k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    24402 SK        20   0 98.7m 1056  860 S 43.9  0.0   0:11.87 a.out
    
    
    top - 10:31:53 up 12 days, 19:08,  5 users,  load average: 0.02, 0.01, 0.02
    Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.9%us,  3.7%sy,  0.0%ni, 95.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:  16330584k total,  2335148k used, 13995436k free,   241348k buffers
    Swap:  4194296k total,        0k used,  4194296k free,  1631880k cached
    
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    24402 SK      20   0 98.7m 1072  860 S 19.0  0.0   0:12.44 a.out
    
  2. You can use ps (for instance in a shell script)

    ps --format pid,pcpu,cputime,etime,size,vsz,cmd -p `pidof a.out`
    

    I need some means of recording the performance of an application on a Linux machine

    In order to do this you need to use perf if your Linux kernel is greater than 2.6.32 or OProfile if it is older. Both programs don't require from you to instrument your program (like Gprof requires). However, in order to get the call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

关于Linux的perf
  1. To record performance data:

    perf record -p `pidof a.out`
    

    or to record for 10 seconds:

    perf record -p `pidof a.out` sleep 10
    

    or to record with a call graph ()

    perf record -g -p `pidof a.out`
    
  2. To analyze the recorded data

    perf report --stdio
    perf report --stdio --sort=dso -g none
    perf report --stdio -g none
    perf report --stdio -g
    

    On RHEL 6.3 it is allowed to read /boot/System.map-2.6.32-279.el6.x86_64, so I usually add --kallsyms=/boot/System.map-2.6.32-279.el6.x86_64 when doing a performance report:

    perf report --stdio -g --kallsyms=/boot/System.map-2.6.32-279.el6.x86_64
    

    Here I wrote some more information on using Linux `perf`:

    First of all - this is tutorial about Linux profiling with perf

    You can use perf if your Linux Kernel is greater than 2.6.32 or OProfile if it is older. Both programs don't require from you to instrument your program (like Gprof requires). However, in order to get call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

    You can see a "live" analysis of your application with perf top:

     sudo perf top -p `pidof a.out` -K
    

 

或者您可以记录运行应用程序的性能数据,并在之后进行分析:

  1. To record performance data:

    perf record -p `pidof a.out`
    

    or to record for 10 seconds:

    perf record -p `pidof a.out` sleep 10
    

    or to record with a call graph ()

    perf record -g -p `pidof a.out`
    
  2. To analyze the recorded data

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

或者,您可以通过以这种方式启动应用程序并等待其退出来记录应用程序的性能数据,然后进行分析:

perf record ./a.out

这是一个测试程序的性能分析示例。

测试程序在文件main.cpp中(main.cpp在答案底部):

我以以下方式进行编译:

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

我使用libmalloc_minimal.so,因为它是使用-fno-omit-frame-pointer编译的,而libc malloc似乎没有使用此选项进行编译。然后我运行我的测试程序:

./my_test 100000000

然后我记录一个正在运行的进程的性能数据:

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

然后我会分析每个模块的负载:

perf report --stdio -g none --sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

然后分析每个函数的加载:

perf report --stdio -g none -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

然后调用链会被分析:

perf report --stdio -g graph -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

现在,您已经知道程序花费时间的位置。

这是测试的main.cpp文件:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j = 0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j = 0; j < 40; ++j) {
    ++time_value;
  }
  time_value = f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{
  for (int j = 0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m = 0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i = 0; i < 10; ++i) {
    time_value = f1(time_value);
    time_value = f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value);

  for (int i = 0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value);
  return 0;
}

1
perf record -p $(ps aux | grep '[H]elloWord' | awk '{print $2}') 注意在您的应用程序的第一个字母周围使用方括号技巧,以便grep仅返回应用程序pid而不是grep本身的pid。 - kroiz

25

9
如果你想加快程序速度,可以使用stackshots。一个简单的方法是使用pstack实用程序或lsstack(如果您能获取到它)。 你可以比Gprof做得更好。如果你想使用官方的分析工具,你需要一些在壁钟时间上对调用栈进行采样并呈现行级别成本的工具,例如OProfileRotateRight Zoom

4
您可以使用 Valgrind 工具来记录数据,并将其保存在文件中,然后使用适当的 GUI 工具(如KCacheGrind)进行分析。

以下是一个使用示例:

valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes your_program

它将生成一个名为callgrind.out.xxx的文件,其中xxx是程序的PID

Gprof不同,Valgrind适用于许多不同的语言,包括Java,在一些限制下

2

可以了解一下Gprof。需要使用-pg选项编译代码,以进行代码仪器化。之后,您可以运行程序并使用Gprof查看结果。


1
你也可以尝试使用cpuprofiler.com。它可以获取通常从top命令中获得的信息,并且CPU使用数据甚至可以远程从Web浏览器中查看。

1
链接已损坏。该域名不再有效。 - Sam Sirry

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接