如何使用sed/grep提取两个单词之间的文本？

Question

如何使用sed/grep提取两个单词之间的文本？

213

我将尝试输出一个包含字符串中两个单词之间所有内容的字符串：

输入：

"Here is a String"

输出：

"is a"

使用：

sed -n '/Here/,/String/p'

这个包含了端点，但我不想把它们包含在内。

- user1190650

10

如果输入是 Here is a Here String，结果应该是什么？或者是 I Hereby Dub Thee Sir Stringy？ - ghoti

6

FYI。你的命令意味着打印出在包含单词“Here”的行和包含单词“String”的行之间的所有内容，这不是你想要的结果。 - Hai Vu

1

另一个常见的“sed” FAQ 是“我如何提取特定行之间的文本”; 这是 https://dev59.com/iXLYa4cB1Zd3GeqPWE5b - tripleee

14个回答

155

sed -e 's/Here\(.*\)String/\1/'

- Brian Campbell

2

谢谢！如果我想在“Here is a one is a String”中找到“one is”和“String”之间的所有内容怎么办？（sed -e 's/one is(.*)String/\1/' ? - user1190650

9

如果您想要看到"Here is a"，那么这个方法可以奏效。您可以尝试一下：echo "Here is a one is a String" | sed -e 's/one is$.*$String/\1/'。如果您只想得到"one is"和"String"之间的部分，则需要让正则表达式匹配整行内容：sed -e 's/.*one is$.*$String.*/\1/'。在sed中，s/pattern/replacement/表示“将每一行上与‘pattern’相匹配的内容替换为‘replacement’”。它只会改变与“pattern”相匹配的内容，因此如果您想要替换整行内容，必须让“pattern”匹配整行内容。 - Brian Campbell

9

当输入为 Here is a String Here is a String 时，这会出现错误。 - Jay D

1

希望能看到以下问题的解决方案：“这里是一段文字，这里是1的一段文字，这里是2的一段文字”，输出应该只选择介于“这里”和“文字”之间的第一个子字符串。 - Jay D

1

@JayD sed不支持非贪婪匹配，请参考此问题获取一些推荐的替代方案。 - Brian Campbell

显示剩余6条评论

97

被接受的答案没有删除可能在Here之前或String之后的文本。这样做会：

sed -e 's/.*Here\(.*\)String.*/\1/'

主要区别在于在String之后和Here之前立即添加.*。

- wheeler

你的回答很有前途。不过有一个问题。如果同一行中有多个字符串，我该如何将其提取到第一个出现的字符串？谢谢。 - Dr. Mian

@MianAsbatAhmad 您需要将 Here 和 String 之间的 * 量词设置为非贪婪模式（或懒惰模式）。然而，sed 使用的正则表达式类型不支持懒惰量词（即在 .* 后面紧跟着一个 ?）根据这个 Stackoverflow 问题。通常，要实现懒惰量词，您只需匹配除了您不想匹配的标记之外的所有内容，但在这种情况下，不仅仅是单个标记，而是整个字符串 String。 - wheeler

谢谢，我使用awk得到了答案，https://stackoverflow.com/questions/51041463/how-to-extract-line-portion-on-the-basis-of-start-substring-and-end-substring-us/51047792#51047792 - Dr. Mian

1

很遗憾，如果字符串中有换行符，这种方法就不起作用了。 - WitaloBenicio

@wheeler将.替换为[\s\s]而不是删除换行符. - sreekanth balu

显示剩余2条评论

47

您可以在 Bash 中单独剥离字符串：

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

如果您有一个包括PCRE的GNU grep，您可以使用零宽断言：

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

- ghoti

为什么这个方法这么慢？使用这种方法剥离大型HTML页面时，需要大约10秒钟的时间。 - Adam Johns

@AdamJohns，你指的是哪个方法？PCRE吗？PCRE相对复杂，但10秒似乎有些夸张。如果你担心的话，我建议你提出一个问题，包括示例代码，并看看专家们怎么说。 - ghoti

1

我认为对我来说速度很慢，是因为它将一个非常大的HTML文件的源代码存储在变量中。当我将内容写入文件并解析文件时，速度显著提高了。 - Adam Johns

应该被接受的答案，因为它使用纯Bash。 - Akito

33

如果您有一个包含很多多行出现的长文件，最好先打印行号：

cat -n file | sed -n '/Here/,/String/p'

- alemol

5

谢谢！这是唯一在我的情况下起作用的解决方案（多行文本文件，而不是没有换行符的单个字符串）。显然，为了使其没有行号，必须省略cat中的-n选项。 - Jeffrey Lebowski

2

在这种情况下，cat 可以完全省略；sed 知道如何读取文件或标准输入。 - tripleee

30

通过 GNU awk，

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a

使用-P（Perl正则表达式）参数的grep支持\K，它有助于丢弃先前匹配的字符。在我们的情况下，先前匹配的字符串是Here，因此它被从最终输出中丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

如果您希望输出为is a，那么您可以尝试以下方法：

Translated content:

如果您希望输出为is a，那么您可以尝试以下方法：

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

- Avinash Raj

这对以下命令无效：echo "Here is a string dfdsf Here is a string" | awk -v FS="(Here|string)" '{print $2}'，它只会返回 is a 而不是应该返回的 is a is a。@Avinash Raj - alper

12

为了理解sed命令，我们需要逐步构建它。以下是原始文本：

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

让我们尝试使用sed中的替换选项s来删除字符串Here

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

此时，我相信您已经能够移除String了。

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

但这不是您想要的输出。

要结合两个sed命令，请使用-e选项。

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

希望这对您有所帮助。

- user9013730

非常感谢您提供的解释，对于我理解它到底在做什么非常有帮助。 - Kosz

10

您可以使用两个s命令。

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

同样适用

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

- Ivan

9

这可能对你有用（GNU sed）：

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

这将每个标记之间的文本表示（在此示例中为Here和String）显示在新行上，并保留文本中的换行符。

- potong

8

以上所有解决方案都存在缺陷，即最后一个搜索字符串在字符串的其他位置重复出现。我发现编写一个bash函数是最好的选择。

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

- Gary Dean

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- anishsane · Accepted Answer

GNU grep也支持正向和反向预查和回顾：对于您的情况，命令应该是：

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果有多个出现的 Here 和 string，你可以选择是从第一个 Here 到最后一个 string 进行匹配，还是将它们分别匹配。在正则表达式中，这被称为贪婪匹配（第一种情况）或非贪婪匹配（第二种情况）。

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another