sed正则表达式能模拟“向后查找”和“向前查找”吗？

Question

sed正则表达式能模拟“向后查找”和“向前查找”吗？

regexsedawkregex-negationregex-lookarounds

9

我正在尝试编写一个sed脚本，以捕获文本文件中所有“裸露”的URL，并用<a href=[URL]>[URL]</a>替换它们。我所谓的“裸露”是指不包含在锚标签内的URL。

我的初步想法是匹配没有“或>”在前面的URL，并且在它们之后也没有<或“。然而，我在表达“在前面或后面没有”这个概念时遇到了困难，因为据我所知，sed没有向前或向后查找的功能。

示例输入：

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

样例期望输出:

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

请注意第三行没有被修改，因为它已经在<a href>标签中。另一方面，第一行和第二行都被修改了。最后，请注意所有非URL文本都没有被修改。

最终，我的目的是做到：

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

我首先验证以下代码是否能正确匹配和移除URL：

sed 's/http:\/\/[^\s]\+//g'

我尝试了这个方法，但它无法匹配以文件/输入开头的URL：

sed 's/[^\>"]http:\/\/[^\s]\+//g'

有没有一种方法可以在sed中解决这个问题，无论是通过模拟lookbehind / lookahead，还是显式匹配文件的开头和结尾？

- merlin2011

为什么要使用[^\>"]？ - SwiftMango

我正在寻找一个URL，该URL不以引号或大于号开头。 - merlin2011

请更新您的问题，展示一些代表性的输入样例和预期输出结果 - 这对我们来说比您尝试过什么更重要（虽然这也很有用）。 - Ed Morton

@EdMorton，请注意问题已经更新，附带了样例输入和输出。 - merlin2011

2个回答

2

您的命令显然存在问题。

You did not escape the parenthesis "("

关于 sed 正则表达式的奇怪之处在于，与 Perl 正则表达式不同的是，许多符号默认情况下是“字面量”，你必须将它们转义为“函数”。尝试：

s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g

- SwiftMango

作为澄清，我正在尝试匹配那些没有双引号或大于号在其前面的URL。 - merlin2011

给定的解决方案将不会在文件开头或输入开头匹配 http://google.com。 - merlin2011

@merlin2011 我明白你的意思。sed不支持向前/向后查找，我刚刚进行了编辑。问号使其变为可选项。 - SwiftMango

4

关于奇怪的 \(，一种选项是使用 sed -r，这样就不需要引用 ( 了。（我甚至有一个名为 rsed 的别名） - darque

@texasbruce，当您将其设置为可选时，它现在会匹配<a href=内部的URL，这不是意图。 - merlin2011

顺便提一下，您可以使用“-E”标志来使用“现代”正则表达式。然后您就不需要转义括号了。 - abalter

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ed Morton - SO stop bullying · Accepted Answer

sed是一个出色的工具，可用于单行简单替换，在处理其他文本操作问题时，只需使用awk。

在下面的BEGIN部分中检查我正在使用的定义，以获取与URL匹配的正则表达式。它适用于您的示例，但我不知道它是否捕获了所有可能的URL格式。即使它不能完全捕获，也可能满足您的需求。

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}