使用awk替换两个字符串或模式之间的文本

Question

使用awk替换两个字符串或模式之间的文本

4

我在这里有一些问题，但我是新手，不知道为什么帖子被锁定或删除：帖子我正在使用一个WordPress数据库，里面有大约60,000个“文章”，在“post_content”栏中，我想删除那些HTML标记以及它们之间的文本。我的文章内容看起来像这样：

<p style="text-align: left;"><span style="color: #fffff;">
An entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

p标签将是相同的，并且每篇文章仅出现一次，但颜色可能在某些文章中有所不同。

期望的输出应该像这样:

[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

我想删除所有段落标签中的文本。因此，我想要删除的文本是“一整段大约200字的文本”。这个文本在每篇文章中都不同，但唯一不变的是开放和关闭标签。

根据上一个问题，这个命令是：由用户“PS。”

awk '/<p/,/<\/p>/{next} {print $0}' inputfile

我在将数据库导出后，在.sql数据库上运行了该操作。但是在查看数据库后，文本仍然存在。

非常感谢任何帮助。

更新：这个问题已经被Ed Morton解决。

Using GNU awk for multi-char RS this:
awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file

- d.ariel

@JamesBrown 更新：我进行了一些研究，找到了如何将输出转储回sql文件。所以我运行了这个命令：awk '/<p/,/<\/p>/{next} {print $0}' test.sql > test_awk.sql问题是当我这样做时，数据库中没有任何帖子留下。在“wp_posts”表中和“wp_options”中的所有内容都被删除了（在运行该awk命令后看起来像这样http://prntscr.com/d0uc1g） - d.ariel

根据其下方的评论，您之前的问题因为质量较差而被投票下降（请参见[ask]），并且在一个月内没有得到任何答案而被删除。这仍然是一个质量较差的问题，因为它缺少必要的信息，例如给定样本输入的预期输出，所以这次也可能会有不同的结果。请发布一个[mcve]。 - Ed Morton

@EdMorton 对不起，我已经更新了问题并提供了预期输出。我在问题中发布的代码确实会产生错误的输出。谢谢。 - d.ariel

谢谢您添加这个。现在您已经添加了期望的输出，这与您所询问的问题非常不同。删除和之间的所有内容是微不足道的，但保留像<span...>和这样的标记并删除其他所有内容则更加困难。但既然它们之间没有任何东西可以操作，为什么您还要保留它们呢？您的输入显示整个文件中只有1个...对 - 是否真正代表您的实际数据？如果不是，请在示例中显示多个出现次数，因为这比1个更难处理。 - Ed Morton

好的，这告诉我们每个“post”中有什么，但由于我们不在您的领域工作，我们不知道“post”是什么，而且您还没有告诉我们每个输入文件中有多少“post”。我们只知道您告诉我们的内容，因此请对我们保持清晰简明，不要假设我们对您的领域有任何了解。 - Ed Morton

显示剩余10条评论

3个回答

1

您可以尝试以下的sed命令 -

sed '/<p/,/<\/p/d' kk.txt

需要使用转义字符来处理 </p。

- VIPIN KUMAR

嘿，感谢您的回复。我在数据库.sql的副本上尝试了该命令，然后将数据库导入到mysql并通过phpmyadmin进行了检查。仍然显示段落文本。当我运行您发布的确切命令时，它看起来像是在搜索和替换，但当我像我说的那样检查时，文本仍然存在。  本文本仍然存在  - d.ariel

更新：好的，我使用sed命令像这样：sed -i '/<p/,/<\/p/d' test_sed.sql现在的结果是，在运行该命令后，没有帖子和许多缺失的表。我在使用sed命令并导入数据库后检查了数据库，然后通过phpmyadmin查看它。我还是做错了什么吗？看起来你发送的代码确实转义了斜杠。 - d.ariel

@d.ariel 是的，你的正则表达式太“贪婪”了。你需要匹配后跟除了之外的任意数量的字符，然后是。否则，“TooThis Thingsomething that should not be removedThat thing Little”将被缩小为“Too Little”，因为整个...被匹配了。 - Michael - sqlbot

@Michael-sqlbot 我应该使用awk还是sed？我使用任何一种命令都会出现“贪婪”的正则表达式。确切的文本如下：  Text I want removed.  它们都以开头。你能否给我提供更好的正则表达式呢？ - d.ariel

@d.ariel - 你能分享一些与当前文件相关的输入文件数据吗？其中仍然存在你不想要的文本。 - VIPIN KUMAR

0

我不会使用awk、sed或perl。正则表达式的正确管理很困难，正如你所发现的那样。有一个古老的笑话：

一些人在面对问题时，会想：“我知道了，我会使用正则表达式。”现在他们有两个问题了。——Jamie Zawinski，1997

我甚至不会转储数据并编辑转储文件。那也很困难。

更简单的解决方案是使用MySQL内置的XPath函数直接在数据库中操作每个帖子。我测试了以下解决方案，查询了一个版本的您的示例帖子内容，并剥离了

标签（以及其中的所有内容）。

mysql> SELECT post_content, 
       UpdateXml(post_content, '/p', '') AS post_content_without_p 
       FROM posts\G

而且输出显示之前和之后的内容是：

*************************** 1. row ***************************
          post_content: <p style="text-align: left;"><span style="color: #fffff;">
An entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

post_content_without_p: 
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
1 row in set (0.00 sec)

UpdateXml() 函数是 MySQL 的文档内置函数之一。它需要三个参数：

要读取的列或表达式，应该包含 XML（HTML 是 XML 的子集）。
要匹配的 XML 的哪个部分的 XPath 表达式。
要替换匹配到的 XML 的替换字符串。

一旦您确认查询已经符合您的要求，您可以在不导出和恢复的情况下更新表中的内容：

mysql> UPDATE posts SET post_content = UpdateXml(post_content, '/p', '');

在尝试此类更改之前，始终要备份！否则，在您进行实验时，请将数据复制到另一个数据库中。

- Bill Karwin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ed Morton · Accepted Answer

使用GNU awk进行多字符RS的操作：

awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file

无论文件中只有1个还是多个<p...对，都可以工作，例如：

$ cat file
<p style="text-align: left;"><span style="color: #fffff;">
First entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
<p style="text-align: left;"><span style="color: #fffff;">
Second entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

.

$ awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

上述方法显然很脆弱，如果例如<p出现在[Text_between_brackets]中，就会失败。在sub()函数中指定更多<p...行的内容将使其更加健壮，例如你可以/应该尝试使用以下方法：

awk -v RS='</p>\\s*' -v ORS= '{sub(/<p style="text-align: left;"><span style="color: /,"")} 1' file