使用sed、awk或perl从行中提取特定模式

Question

使用sed、awk或perl从行中提取特定模式

perlsedawkgrepnawk

7

我需要提取一行中特定模式之间的模式，是否可以使用sed？

假设我有一个包含以下内容的文件：

There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

在这两种情况下，我必须扫描行以查找第一个出现的模式，即在各自的情况下是' [/ '或'/* '，并存储接下来的模式直到退出模式为止，即' /] '或' */ '。

简而言之，我需要提取fear和answer。如果可能的话，它可以扩展到多行；也就是说，如果退出模式出现在不同的行中。

欢迎任何形式的建议或算法。感谢您提前回复。

- Gil

我不确定是否可以使用SED完成，顺便说一句，我也不介意一个perl脚本。 - Gil

关于sed，请参见我的问题：目前还没有提出简单的方法，但是可以做一些事情。 - Lev Levitsky

@LevLevitsky 很有趣！一次看不够，我一定要再仔细研究一遍。感谢您添加链接 :) - Gil

3个回答

1

使用 awk 的快速且简单的方法

awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' input_file

测试：

$ cat file
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn't.
$ awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' file
fear

answer

- jaypal singh

1

单行匹配

如果你真的想在sed中实现这个功能，只要你的分隔模式在同一行上，就可以相对容易地提取它们。

# Using GNU sed. Escape a whole lot more if your sed doesn't handle
# the -r flag.
sed -rn 's![^*/]*(/\*?.*/).*!\1!p' /tmp/foo

多行匹配

如果您想使用sed执行多行匹配，事情会变得有些棘手。但是，这肯定是可以完成的。

# Multi-line matching of delimiters with GNU sed.
sed -rn ':loop
         /\/[^\/]/ { 
             N
             s![^*/]+(/\*?.*\*?/).*!\1!p
             T loop
         }' /tmp/foo

诀窍在于寻找起始分隔符，然后在循环中不断添加行，直到找到结束分隔符。

只要确实有一个结束分隔符，这种方法就非常有效。否则，文件的内容将一直被附加到模式空间，直到sed找到一个结束分隔符，或者到达文件的末尾。这可能会导致某些版本的sed或真正大的文件出现问题，其中模式空间的大小失控。

有关更多信息，请参见GNU sed的限制和非限制。

- Todd A. Jacobs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- TLP · Accepted Answer

use strict;
use warnings;

while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#g) {
        print "$2\n";
    }
}


__DATA__
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

作为一行命令：

perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt

内部的 while 循环将在所有带有 /g 修饰符的匹配项之间迭代。反向引用 \1 确保我们只匹配相同的开/闭标签。

如果您需要匹配跨越多行的块，您需要 slurp 输入：

use strict;
use warnings;

$/ = undef;
while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#sg) {
        print "$2\n";
    }
}

__DATA__
    There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */ 
    Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
foo bar /
baz 
baaz / fooz

一句话概括：

perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt

-0777开关和$/ = undef会导致文件读取，这意味着整个文件都将被读入一个标量中。我还添加了/s修饰符，允许通配符.匹配换行符。

正则表达式的解释：m#/(\*?)(.*?)\1/#sg

m#              # a simple m//, but with # as delimiter instead of slash
    /(\*?)      # slash followed by optional *
        (.*?)   # shortest possible string of wildcard characters
    \1/         # backref to optional *, followed by slash
#sg             # s modifier to make . match \n, and g modifier

“魔法”在于反向引用需要在其前面找到星号*。”