使用sed提取一行中未知分隔符的多个出现次数

Question

使用sed提取一行中未知分隔符的多个出现次数

9

我有一个包含嵌入在句子中的概率的大型文本文件。我想提取仅这些概率和它们之前的文本。例如

输入：

not interesting
foo is 1 in 1,200 and test is 1 in 3.4 not interesting
something else is 1 in 2.5, things are 1 in 10
also not interesting

需要的输出：

foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

我目前所拥有的：

$ sed -nr ':a s|(.*) 1 in ([0-9.,]+)|\1 1/\2\n|;tx;by; :x h;ba; :y g;/^$/d; p' input

foo is 1/1,200
 and test is 1/3.4
 not interesting
something else is 1/2.5,
 things are 1/10

something else is 1/2.5,
 things are 1/10

这段代码会在匹配时重复分割行，并尝试仅在包含匹配项时打印。我的问题似乎是当一行结束后，保留空间没有清除。

总体问题是sed不能进行非贪婪匹配，而我的分隔符可以是任何字符。

我猜用其他语言的解决方案也可以，但现在我很想知道是否可以在sed中实现？

- phiresky

3个回答

4

这可能对您有用（GNU sed）：

sed -r 's/([0-9]) in ([0-9]\S*\s*)/\1\/\2\n/;/[0-9]\/[0-9]/P;D' file

这将替换一些数字，接着是一个空格，再接着是一个以数字开头的标记，后面可能跟有空格。将第一个数字与以数字开头的第二个标记之间添加一个“/”并换行。如果下一行包含一个数字和一个“/”，则打印它，然后删除它；否则重复执行。

- potong

2

是的，sed 可以做到，虽然它不是最好的工具。我的尝试是搜索所有的 数字中的数字 模式，并在每个模式后添加一个换行符。然后删除尾随文本（没有换行符），删除前导空格并打印：

sed -nr '/([0-9]+) in ([0-9,.]+)/ { s//\1\/\2\n/g; s/\n[ ]*/\n/g; s/\n[^\n]*$//; p }' file

它的意思是：产生，得出。

foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

- Birei

谢谢，这很棒，看起来正是我想要的。 - phiresky

1

我认为这个解决方案并没有达到所需的要求。 - potong

对，缺少了 x in y -> x/y。它解决了我遇到的困难；我只是在那之后添加了另一个 sed 调用。我将接受的答案改为另一个。 - phiresky

@potong：谢谢。已经修复，使用\1和\2代替&。 - Birei

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ed Morton - SO stop bullying · Accepted Answer

sed只用于单行的简单替换，对于更加有趣的内容，请使用awk：

$ cat tst.awk
{
    while ( match($0,/\s*([^0-9]+)([0-9]+)[^0-9]+([0-9,.]+)/,a) ) {
        print a[1] a[2] "/" a[3]
        $0 = substr($0,RSTART+RLENGTH)
    }
}
$ awk -f tst.awk file
foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

上面的代码使用了GNU awk作为match()的第三个参数，而\s则是[[:space:]]的简写形式。