解析Apache错误日志文件的正则表达式

3
我需要一个正则表达式,用于在Java程序中解析Apache错误文件,例如:
[Thu Sep 27 12:08:18 2012] [error] [client 151.10.158.10] File does not exist: /srv/www/htdocs/pad/favicon.ico
[Thu Oct 04 17:02:42 2012] [error] [client 151.10.1.10] File does not exist: > /srv/www/htdocs/pad/favicon.ico
[Wed Oct 17 10:16:40 2012] [error] [client 151.10.14.60] File does not exist: /srv/www/htdocs/pad/sites/all/modules/fckeditor/fckeditor/editor/userfiles, referer: http://pad.sta.uniroma1.it/sites/all/modules/fckeditor/fckeditor/editor/fckeditor.html?InstanceName=edit-body&Toolbar=DrupalFull

我已经尝试了几种解决方案(其中一些以前在stackoverflow上报告过),似乎最有效的是:

^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\])?([\w\s/.(")-]+[\-:]) ([\w/\s]+)$

然而,它似乎无法匹配以下字符串:
[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1

我该如何修复它?

编辑 我检查了所有提出的解决方案,虽然可以提高匹配行数,但它们仍然无法处理以下情况:

[Fri Jul 15 00:24:41 2011] [error] [client 219.12.35.141] script '/srv/www/htdocs/pad2/scripts/setup.php' not found or unable to stat
[Mon May 28 18:43:25 2012] [error] [client 88.110.28.25] Invalid URI in request GET HTTP/1.1 HTTP/1.1

请注意,如果在方括号中包含客户端关键字,那么将所有数据作为单个组接收也是可以的。

你能解释一下,你想要接收哪些数据吗?是捕获所有 [...] 组和行尾吗? - Sergii Lagutin
具体而言,我对接收编码在前三个[...]组中的信息感兴趣,再加上第四组包含所有剩余文本的信息。 - Umberto
5个回答

3

接收第一、二、三组编码信息

查找以[开头,以]结尾,中间没有其他]符号的最长字符串 - \[[^\]]+\]

将行的其余部分捕获为.* - 从当前位置匹配到行末。

因此,您的完整解决方案如下:

^(\[[^\]]+\]) (\[[^\]]+\]) (\[[^\]]+\]) (.*)$

正则表达式演示


嘿,你的帖子出现在低质量帖子队列中。可能需要添加细节并回答问题,以完善答案。 - Unihedron

0
 $a="[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1\n";
 $a .="[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1\n";
 $a .="[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1\n";
preg_match_all("/(\[.*\])\s+(\[.*\])\s+(\[.*\])\s+([a-zA-Z0-9\s]+:)\s*(.*)/",$a,$m) ; var_dump($m);

试一下这个...(输出)

array (size=6)
  0 => 
    array (size=3)
      0 => string '[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=128)
      1 => string '[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET :81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=128)
      2 => string '[Thu May 17 22:41:54 2012] [error] [client 118.238.211.206] Invalid URI in request GET : 81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=129)
  1 => 
    array (size=3)
      0 => string '[Thu May 17 22:41:54 2012]' (length=26)
      1 => string '[Thu May 17 22:41:54 2012]' (length=26)
      2 => string '[Thu May 17 22:41:54 2012]' (length=26)
  2 => 
    array (size=3)
      0 => string '[error]' (length=7)
      1 => string '[error]' (length=7)
      2 => string '[error]' (length=7)
  3 => 
    array (size=3)
      0 => string '[client 118.238.211.206]' (length=24)
      1 => string '[client 118.238.211.206]' (length=24)
      2 => string '[client 118.238.211.206]' (length=24)
  4 => 
    array (size=3)
      0 => string 'Invalid URI in request GET :' (length=28)
      1 => string 'Invalid URI in request GET :' (length=28)
      2 => string 'Invalid URI in request GET :' (length=28)
  5 => 
    array (size=3)
      0 => string '81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=40)
      1 => string '81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=40)
      2 => string '81/phpmyadmin/scripts/setup.php HTTP/1.1' (length=40)

0
以下正则表达式将匹配上述提到的所有错误格式。
^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\])?([\w\s\/.(")-]+[\-:])\s*>?\s*([\w\/\s.]+)(?:\s*,(\s*\w+:)\s*([\w\/.=?:&-]+))?$

演示


0

"GET :81" 后面不要有空格。

这个可以正常工作:

^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\])?([\w\s\/.(")-]+[\-:])\s?([\w\/\s.]+)

例子:http://regex101.com/r/xO1wG2/2


0

你的正则表达式的最后一段似乎不正确。这个简化的正则表达式应该可以工作:

^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\]) ([\s\w/.(")-]+[-:])(.+)$

正则表达式演示


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接