Python 2.6和2.7多线程性能问题（futex）对比

Question

Python 2.6和2.7多线程性能问题（futex）对比

7

我有一个简单的 Monte-Carlo Pi 计算程序。我尝试在两个不同的设备上运行它（硬件相同，但内核版本略有不同）。结果发现在一台设备上性能下降明显（用时是另一台的两倍）。没有使用线程时，性能大致相同。对程序进行分析表明，速度较慢的程序在每次 futex 调用时花费的时间较少。

这与任何内核参数有关吗？
CPU 标志是否会影响 futex 的性能？/proc/cpuinfo 表示 CPU 标志略有不同。
这与 Python 版本有关吗？

Linux(3.10.0-123.20.1 (Red Hat 4.4.7-16)) Python 2.6.6

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.69   53.229549           5  10792796   5385605 futex

Profile Output
============== 
256 function calls in 26.189 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   39   26.186    0.671   26.186    0.671 :0(acquire)

Linux(3.10.0-514.26.2 (Red Hat 4.8.5-11)) Python 2.7.5

这是一条关于计算机技术的信息，显示了操作系统Linux的版本和Python编程语言的版本。

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.69   94.281979           8  11620358   5646413 futex

Profile Output
==============
259 function calls in 53.448 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 38   53.445    1.406   53.445    1.406 :0(acquire)

测试程序

import random
import math
import time
import threading
import sys
import profile

def find_pi(tid, n):
    t0 = time.time()
    in_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()

        dist = math.sqrt(pow(x, 2) + pow(y, 2))
        if dist < 1:
            in_circle += 1

    pi = 4.0 * (float(in_circle)/float(n))
    print 'Pi=%s - thread(%s) time=%.3f sec' % (pi, tid, time.time() - t0)
    return pi

def main():
        if len(sys.argv) > 1:
            n = int(sys.argv[1])
        else:
            n = 6000000

        t0 = time.time()
        threads = []
        num_threads = 5
        print 'n =', n
        for tid in range(num_threads):
            t = threading.Thread(target=find_pi, args=(tid,n,))
            threads.append(t)
            t.start()

        for t in threads:
                t.join()

#main()
profile.run('main()')
#profile.run('find_pi(1, 6000000)')

- Anoop

1

问题在于GIL阻止纯Python线程同时执行。如果您想获得更好的性能，请改用多进程。 - Jean-François Fabre

1

这里我正在比较在两个相同的硬件和略有不同的内核中运行相同Python程序，其中一个需要两倍的时间，而且这些时间似乎花费在futex上。 - Anoop

1

目标不是优化Python程序，而是了解futex在两个Python/内核版本之间性能下降的原因。 - Anoop

@cdarke - 我运行了多次，每一次运行2.7版本的都需要两倍的时间。 - Anoop

3

你的测试似乎不公平，因为你使用了Python 2.6和2.7。如果两种情况都使用相同版本的Python，结果是否会相同？ - Oleg Kuralenko

显示剩余7条评论

3个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- soundstripe · Answer 1

看起来非常可能是由于这两个版本之间内核代码的一些变化导致的。内核中 futex 代码中的错误导致某些进程死锁。修复错误可能导致性能下降。对于CentOS的3.10.0-514 变更日志提到了许多对[kernel] futex的更改。

- changhwan · Answer 2

我不熟悉内核和CPU标志，所以无法告诉您CPU标志或内核标志将如何影响结果。

因此，这并没有回答您所有的问题，只是满足了我的兴趣，在CentOS 7.4.1708（Linux 3.10.0-693.2.2.el7.x86_64 x86_64）上使用不同的Python版本（2.6.6、2.7.5、3.6.3）测试了您的代码。

Python版本2.6.6

Profile Output
==============
256 function calls in 19.838 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    39   19.019    0.488   19.019    0.488 :0(acquire)
    18    0.000    0.000    0.000    0.000 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.98    6.319220          55    114693      2293 futex
  1.03    0.068830           1     55485           madvise
  0.10    0.006869          95        72           munmap
...

Python版本2.7.5

Profile Output
==============
247 function calls in 23.293 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    34   22.717    0.668   22.717    0.668 :0(acquire)
    18    0.047    0.003    0.047    0.003 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.54    7.360687         196     37613       667 futex
  0.04    0.002798           4       629       492 open
  0.01    0.000918           4       235       203 stat
...

Python 版本 3.6.3

Profile Output
==============
213 function calls in 17.818 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5    0.000    0.000    0.000    0.000 :0(__enter__)
     5    0.000    0.000    0.000    0.000 :0(__exit__)
    25   15.923    0.637   15.923    0.637 :0(acquire)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.71    0.032639         244       134        38 futex 
  1.90    0.000742           1       849           clock_gettime
  1.74    0.000680           4       160           mmap
...

执行多次后，我得到了几乎相同的结果，所以我选择了随机结果。Python 2.6.6比2.7.5稍微快一些，而3.6.3比2.6.6略快。

strace的结果与2.6.6和2.7.5几乎相同，但3.6.3的结果却有很大不同。

因此，在你的问题中，

这是否与任何内核参数有关？

CPU标志是否会影响futex性能？/proc/cpuinfo表明CPU标志略有不同。

我不知道，

这是否与python版本有关？

是的。

- Michał Zaborowski · Answer 3

我认为你不可能得到严格的答案。

Futex是与内核相关的东西。这里是man页面。

简而言之，例如，线程由内核调度，如果高优先级线程被低优先级线程阻塞，则会发生称为优先级反转的情况。因此，观察到的掉帧可能是由于内核标志引起的。另一个问题是获取时间-这需要向内核获取实时值。

另一方面，您只启动了一个线程，所以这不应该是问题。您的线程没有干扰，因此不应该有任何锁定之类的问题。我看到acquire被调用，但是查看花费的时间表明它是关于在最后等待线程的join()。

您能否执行测试-比如说50次，并提供统计数据？那将需要一个小时，但是一分钟的测试几乎可以受到任何影响。

顺便说一下，您错过了（导入）：

import random
import math