使用GCC 4.5编译的程序崩溃了,而使用GCC 4.4则正常。

3

最近我尝试编译和安装ns-2,这是一个基于C++和Tcl的网络模拟器。

通过对源代码进行一些轻微修改(不用担心,它不会导致崩溃),我可以使用最新的gcc 4.5版本进行编译。

但是当我执行二进制文件时,它会出现以下错误:

$bin/ns
*** buffer overflow detected ***: bin/ns terminated

如果使用早期版本的gcc编译,相同的代码可以正常运行。因此,我认为这是由于gcc 4.5中的一些增强功能所导致的。

我该如何解决这个问题?当然,使用gcc 4.4进行编译是一个选择,但我想知道出了什么问题 :)

更新:

以下是使用gdb的完整堆栈跟踪和回溯:

$ bin/ns
*** buffer overflow detected ***: bin/ns terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f01824ac1d7]
/lib/x86_64-linux-gnu/libc.so.6(+0xfd0f0)[0x7f01824ab0f0]
bin/ns[0x8d5b5a]
bin/ns[0x8d56de]
bin/ns[0x841077]
bin/ns[0x842b19]
bin/ns(Tcl_EvalEx+0x16)[0x843256]
bin/ns(Tcl_Eval+0x1d)[0x84327d]
bin/ns(Tcl_GlobalEval+0x2b)[0x84391b]
bin/ns(_ZN3Tcl4evalEPc+0x27)[0x83352b]
bin/ns(_ZN3Tcl5evalcEPKc+0xdd)[0x8334e9]
bin/ns(_ZN11EmbeddedTcl4loadEv+0x24)[0x834712]
bin/ns(Tcl_AppInit+0xb2)[0x8331a5]
bin/ns(Tcl_Main+0x1d0)[0x8ad6a0]
bin/ns(nslibmain+0x25)[0x8330c5]
bin/ns(main+0x20)[0x833254]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xff)[0x7f01823cceff]
bin/ns[0x5bc1a9]

使用开启符号的GDB:
(gdb) bt
#0  0x00007ffff6970d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff6974ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff69a9d7b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff6a3b1d7 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff6a3a0f0 in __chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00000000008d5b5a in strcpy (interp=0xd2dda0, optionIndex=<value optimized out>, objc=<value optimized out>, objv=0x7fffffffdad0)
    at /usr/include/bits/string3.h:105
#6  TraceVariableObjCmd (interp=0xd2dda0, optionIndex=<value optimized out>, objc=<value optimized out>, objv=0x7fffffffdad0)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:912
#7  0x00000000008d56de in Tcl_TraceObjCmd (dummy=<value optimized out>, interp=0xd2dda0, objc=<value optimized out>, objv=0xd2ec00)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:293
#8  0x0000000000841077 in TclEvalObjvInternal (interp=0xd2dda0, objc=5, objv=0xd2ec00,
    command=0x7ffff7f680fe "trace variable defaultRNG w { abort \"cannot update defaultRNG once assigned\"; }\n\n\nClass RandomVariable/TraceDriven -superclass RandomVariable\n\nRandomVariable/TraceDriven instproc init {} {\n$self instv"..., length=80, flags=0)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:3689
#9  0x0000000000842b19 in TclEvalEx (interp=0xd2dda0,
    script=0x7ffff7f52010 "\n\n\n\n\n\nproc warn {msg} {\nglobal warned_\nif {![info exists warned_($msg)]} {\nputs stderr \"warning: $msg\"\nset warned_($msg) 1\n}\n}\n\nif {[info commands debug] == \"\"} {\nproc debug args {\nwarn {Script debugg"..., numBytes=422209, flags=<value optimized out>, line=4141,
    clNextOuter=<value optimized out>,
    outerScript=0x7ffff7f52010 "\n\n\n\n\n\nproc warn {msg} {\nglobal warned_\nif {![info exists warned_($msg)]} {\nputs stderr \"warning: $msg\"\nset warned_($msg) 1\n}\n}\n\nif {[info commands debug] == \"\"} {\nproc debug args {\nwarn {Script debugg"...)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4386
#10 0x0000000000843256 in Tcl_EvalEx (interp=<value optimized out>, script=<value optimized out>, numBytes=<value optimized out>,
    flags=<value optimized out>) at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4043
#11 0x000000000084327d in Tcl_Eval (interp=0xd2dda0, script=<value optimized out>)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4955
#12 0x000000000084391b in Tcl_GlobalEval (interp=0xd2dda0, command=<value optimized out>)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:6005
#13 0x000000000083352b in Tcl::eval(char*) ()
#14 0x00000000008334e9 in Tcl::evalc(char const*) ()
#15 0x0000000000834712 in EmbeddedTcl::load() ()
#16 0x00000000008331a5 in Tcl_AppInit ()
#17 0x00000000008ad6a0 in Tcl_Main (argc=<value optimized out>, argv=0x7fffffffe1d0, appInitProc=0x8330f3 <Tcl_AppInit>)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclMain.c:418
#18 0x00000000008330c5 in nslibmain ()
#19 0x0000000000833254 in main ()    
3个回答

6
著名的遗言:“别担心——我的更改没有破坏任何东西。”我们如何确定呢?
然而,如果代码在4.4下运行正常,在4.5下崩溃的可能性较大。
GCC采用了一些与检测整数溢出和消除它有关的激进优化。在这种情况下,您将不得不在ns-2中找到该代码,并尝试修复它——无论是由ns-2开发人员还是自己修复。
您应该尝试在调试器下运行程序,以便在检测到缓冲区溢出时获得控制,并查看代码所在的位置。如果您禁用了核心转储(使用或等效命令),请考虑启用它们,并查看程序终止时是否会生成核心转储。这应该为您提供一个起点。
进一步的想法:
  • When you compiled the code, how stringent were the warning flags used? Can you recompile with more warnings enabled?

    One technique that often works (with AutoTools-configured programs) if you can find no other way to get special options to the C or C++ compiler is:

    ./configure --prefix=/opt/ns CC="gcc -Wall -Wextra" CXX="g++ -Wall -Wextra"
    

    (I also use this technique to specify 32-bit vs 64-bit builds, adding -m32 or -m64.)

    Warning: if the code was not created to compile clean under these options, it can be traumatic to do the first compilation using these options. However, there is also a decent chance that in amongst all the warnings is one about the source of your problem. However, it is also indisputable that there will likely be 50 warnings not related to it to any 1 that is (or worse), and fixing all the warnings thus spotted still might not cure the problem. If the code compiles with stringent warnings anyway, then you are faced with enabling many more exotic warnings instead. But if you can get the compiler to help diagnose the problem that it is causing, you should certainly do so - it is much simpler than finding the problem unaided.

  • Also, make sure you are producing a debuggable program - even if you keep the optimization enabled.

  • Also, consider compiling with optimization off and see whether the program still crashes. If the program does not crash without optimization and does with optimization, you have some useful information. It won't make it easier to find the cause, but you know it is (probably) related to the optimizer. Or it might just be that the bug moves when not optimized and doesn't fail fatally.

扩展的堆栈跟踪信息很有趣:

#5  0x00000000008d5b5a in strcpy (interp=0xd2dda0, optionIndex=<value optimized out>,
                                  objc=<value optimized out>, objv=0x7fffffffdad0)
    at /usr/include/bits/string3.h:105
#6  TraceVariableObjCmd (interp=0xd2dda0, optionIndex=<value optimized out>,
                         objc=<value optimized out>, objv=0x7fffffffdad0)
    at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:912

这些不是strcpy()的普通参数。通常,您只有两个参数。我无法立即想到一个适当的情况,可以将字符串复制到Tcl解释器的主控结构指针上。因此,为了进一步研究此问题,我将会非常仔细地查看tclTrace.c中的900-920行左右,特别是第912行。这可能只是优化器混合对象代码的方式的产物,或者它可能是一个真正的问题。

我找到了tcl8.5.8源代码,tclTrace.c的第912行是以下代码中的strcpy()

    if ((enum traceOptions) optionIndex == TRACE_ADD) {
        CombinedTraceVarInfo *ctvarPtr;

        ctvarPtr = (CombinedTraceVarInfo *) ckalloc((unsigned)
                (sizeof(CombinedTraceVarInfo) + length + 1
                - sizeof(ctvarPtr->traceCmdInfo.command)));
        ctvarPtr->traceCmdInfo.flags = flags;
        if (objv[0] == NULL) {
            ctvarPtr->traceCmdInfo.flags |= TCL_TRACE_OLD_STYLE;
        }
        ctvarPtr->traceCmdInfo.length = length;
        flags |= TCL_TRACE_UNSETS | TCL_TRACE_RESULT_OBJECT;
        strcpy(ctvarPtr->traceCmdInfo.command, command);       // Line 912
        ctvarPtr->traceInfo.traceProc = TraceVarProc;
        ctvarPtr->traceInfo.clientData = (ClientData)
                &ctvarPtr->traceCmdInfo;
        ctvarPtr->traceInfo.flags = flags;
        name = Tcl_GetString(objv[3]);
        if (TraceVarEx(interp,name,NULL,(VarTrace*)ctvarPtr) != TCL_OK) {
            ckfree((char *) ctvarPtr);
            return TCL_ERROR;
        }
    } else {

所以,从GDB的输出和堆栈跟踪来看,有两个变量传递给strcpy(),其中一个在堆上本地分配。
我建议从嵌入ns-2源代码中单独编译tcl,看看是否可以单独找出这个bug。这段代码与追踪tcl变量有关 - trace add varname ...
假设这样做没有问题,那么我会考虑获取GCC 4.6并查看在使用它而不是GCC 4.5编译ns-2时是否出现相同的问题。

Valgrind

由于您正在Linux上运行,应该能够使用Valgrind。它非常擅长发现内存滥用问题。为了最大化效益,请使用ns-2的调试版本。

2

"缓冲区溢出检测": 您正在向未分配的区域写入数据。GCC 4.4生成的代码似乎没有触发问题(或者存在问题,但不会导致崩溃,只是产生了错误的结果,现在无法检测到),而GCC 4.5生成的代码可以检测到此问题并警告您。唯一的解决方法是找到问题的源头并修复代码。


5
唯一的解决方案是修复代码吗?不是,已经提出的解决方案是使用GCC 4.4编译,把手放在耳朵上,并且唱着“lalalala”,这听起来对我来说完全有效 :) - Jeff Parker
1
@AProgrammer,说来奇怪... ... :) - Jeff Parker
1
“Hear No Evil, See No Evil, Speak No Evil”(也称为“lalalala”)解决方案只在短期内有效。至少,需要警告ns-2的开发人员有关此问题-他们可能已经意识到了这个问题。不久之后(不久之前),GCC 4.4将不再得到维护;然后你将被迫使用GCC 4.5和死软件。因此,是的,鸵鸟策略在短期内有效,但这只是一种短期措施。找到潜在问题并解决它要好得多-尽管更加痛苦。 - Jonathan Leffler
1
@Jonathan,我觉得你错过了Jeff的笑脸;)我担心那些没有给我们的回答点赞却点赞了他的评论的人也认真地接受了它。:( - AProgrammer
我必须承认我试图充当一点魔鬼的代言人。我希望人们之所以给它点赞是因为它有趣,而不是支持建议的行动。 - Jeff Parker

1

可能是各种各样的问题。它可能是GCC的错误。它可能是Tcl的错误(作为Tcl开发人员之一,我希望不是这样,但我不排除Tcl经常假设结构上没有保护代码;Tcl绝对是C89代码)。它可能是ns2中的一个错误。据我所知,甚至可能是其他地方的错误(因为ns2是建立在Tcl上的,它可以加载外部代码库;在那里出现问题是很可能的)。

遗憾的是,我们无法从发布的信息中确定其中哪些可能性。您知道程序崩溃时调用堆栈位于哪个库中吗?虽然这并不能保证那是实际问题的所在地,但至少是开始查找错误的地方...


这是堆栈跟踪,如果有帮助的话。它指向Tcl... https://docs.google.com/document/d/1Rp5dsXLMifS72Ccy8W6TAJg6-bxr49wCf4GVi-1xO_A/edit?hl=en_US - AIB

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接