删除除最后一行以相同字符串开头的所有行

Question

删除除最后一行以相同字符串开头的所有行

3

我正在使用awk处理文件，筛选出感兴趣的特定行。对于所生成的输出，我希望能够删除除了以相同字符串开头的最后一行之外的所有行。

以下是所生成内容的示例：

this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text

由于第2行和第3行以duplicate开头，应予删除，第5行也是如此。因此，第5行应保留，因为它是最后一行以duplicate开头的。

对于第6行，同样地，它以example开头，第7行也是如此，所以第7行应被保留，因为它是最后一行以example开头的。

给定上面的例子，我想要产生以下输出：

this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

我该如何实现这个目标？

我尝试了以下内容，但它并未正常工作：

awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -

- Jack Greenhill

你的示例不清楚。 - maazza

2个回答

2

您可以使用（关联）数组来始终保留最后一次出现：

awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

- Andras Deak -- Слава Україні

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- fedorqui · Accepted Answer

为什么不从文件末尾开始读取，并打印第一行包含“duplicate”的内容呢？这样你就不必担心已经打印或未打印的内容，也不需要保留行数等信息。请参考以下代码：

tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac

第一次遇到duplicate时设置标志位f。从第二次开始，该标志位将使该行被跳过。

如果你想以使每个单词只打印最后一次为通用的方式来实现，请使用一个数组方法：

tac file | awk '!seen[$1]++' | tac

此代码跟踪迄今为止出现过的第一个单词。它们存储在数组 seen [] 中，因此通过说 ！seen [$1] ++ ，我们只有在 $1 第一次出现时将其设置为 True; 从第二次开始，它将评估为 False，该行不会打印。

测试

$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text