使用jstack -F进行线程转储后,Java进程未响应但已恢复

3

我有一个Java进程存在奇怪的问题,会卡死(每天一两次),只有在我执行以下操作后才能恢复:

jstack -F ${PID}

当Java进程挂起时,如果我尝试使用jcmd进行线程转储,则会收到AttachNotSupportedException。我只能使用jstack -F进行线程转储,并且使用与启动Java进程的JRE版本同步的JDK版本。
我唯一能想到的是,也许操作系统调度程序不允许Java进程占用CPU时间,而如果我执行jstack -F,它会强制允许其运行?
欢迎任何反馈。
更新-1
今天再次发生这种情况。我首先检查了内存使用情况(99.1%)然后执行了jmap -heap,在堆转储后进程恢复正常。附上堆转储文件。
问候,
Cristi
jmap -heap 7703
Attaching to process ID 7703, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.162-b12

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 536870912 (512.0MB)
   NewSize                  = 89128960 (85.0MB)
   MaxNewSize               = 178782208 (170.5MB)
   OldSize                  = 179306496 (171.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 143130624 (136.5MB)
   used     = 73244792 (69.85167694091797MB)
   free     = 69885832 (66.64832305908203MB)
   51.1733897003062% used
From Space:
   capacity = 17825792 (17.0MB)
   used     = 8176960 (7.79815673828125MB)
   free     = 9648832 (9.20184326171875MB)
   45.871510225183826% used
To Space:
   capacity = 17825792 (17.0MB)
   used     = 0 (0.0MB)
   free     = 17825792 (17.0MB)
   0.0% used
PS Old Generation
   capacity = 243269632 (232.0MB)
   used     = 23534032 (22.443801879882812MB)
   free     = 219735600 (209.5561981201172MB)
   9.674052534432247% used

25964 interned Strings occupying 2759784 bytes.

更新-2

启用 GC 日志后,当进程冻结时,以下是 GC 日志的末尾。

2020-09-02T06:51:11.286+0000: 86020.549: Total time for which application 

threads were stopped: 0.0001978 seconds, Stopping threads took: 0.0000666 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Application time: 0.0000610 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Total time for which application threads were stopped: 0.0001793 seconds, Stopping threads took: 0.0000589 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0003371 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001749 seconds, Stopping threads took: 0.0000283 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0001277 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001554 seconds, Stopping threads took: 0.0000364 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Application time: 0.0000400 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Total time for which application threads were stopped: 0.0001082 seconds, Stopping threads took: 0.0000158 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Application time: 0.0010649 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Total time for which application threads were stopped: 0.0001945 seconds, Stopping threads took: 0.0000571 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0001078 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0001852 seconds, Stopping threads took: 0.0000336 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000366 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0000910 seconds, Stopping threads took: 0.0000180 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000412 seconds
2020-09-02T06:51:11.289+0000: 86020.553: Total time for which application threads were

如果我发送kill -SIGCONT $(PID),似乎可以恢复进程,这意味着内核可能已经向进程发送了kill -SIGSTOP,可能是由于资源不足。我在出现此问题的机器上看到负载很高。 - Cristi
1
起初看起来像是这个问题的问题 https://dev59.com/m5Lea4cB1Zd3GeqPwgDb 但看起来这个错误已经在我使用的当前内核版本中修复了 root@hostname /]# rpm -q --changelog kernel-2.6.32-754.29.1.el6.x86_64 | grep 'get_futex_key_refs'
  • [kernel] futex: 确保 get_futex_key_refs() 总是意味着一个屏障 (Larry Woodman) [1167405]
- Cristi
2
我们一直在与类似的问题斗争,确实是内核 bug。升级到 Linux 4.x 版本有所帮助。 - apangin
1
一个可能的原因是在那个时候正在进行full GC,您可能想将GC更改为CMS,在那个时候获取线程转储并查看是否有任何代码创建了太多对象或者是否存在内存泄漏以及一些对象没有被GC? - Ravi Yadav
1个回答

1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接