在原地从XML文件中删除一个特定的重复行

Question

在原地从XML文件中删除一个特定的重复行

3

我一直在Stack上阅读关于删除重复行的文章。有perl、awk和sed等解决方案，但都没有我想要的具体解决方案，我不知道该怎么办。

我希望能够使用快速的bash/shell perl命令来不区分大小写地删除此XML案例中的重复

输入的XML：

  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep 
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">                 
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   <------ Duplicate line to REMOVE
    </userinterface>
  </package>

到目前为止，我已经成功获取了重复的行，但是不知道如何删除它们。以下是代码：

grep -H path *.[Xx][Mm][Ll] | sort | uniq -id

给出结果：

test.xml:          <upath>/example/dir/here</upath>

如何现在删除那条线？

下面执行perl版本或awk版本将同时删除<start>和<end>日期。

perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new

- dlite922

“<start>…” 如何与 “<end>…” 重复？您的最终 awk 解决方案应该可以正常工作。 - William Pursell

@John，这里有一些大写的XML文件。 - dlite922

@WilliamPursell <start>2016-04-20</start>和<end>2017-04-20</end>在文件中各出现了两次。 - ThisSuitIsBlackNot

我想不区分大小写地删除重复的<upath>标签。 - dlite922

“我想要不区分大小写地从这个XML案例中删除重复的标签。” 你的意思是你想要一些代码，它将仅仅删除那些该死的行，无论它们对我意味着多少或者我正在前往谁的葬礼？ - Borodin

显示剩余3条评论

5个回答

2

如果你要解析XML，真的应该使用解析器。有多种选择 - 但不要使用正则表达式，因为它们会导致非常脆弱的代码 - 出于所有你发现的原因。

参见：使用正则表达式解析XML。

但长话短说 - XML是一种上下文语言。而正则表达式却不是。XML也存在一些完全有效的差异，在语义上相同，但正则表达式无法处理。

例如：一元标签、可变缩进、位于不同位置的标签路径以及换行符。

我可以按不同方式格式化您的源XML文件 - 所有这些方式都是有效的XML，并表达相同的意思。但是这样做会破坏基于正则表达式的解析。这是需要避免的 - 因为某一天，出于XML规范内部的更改结果，你的脚本将突然崩溃。

这就是为什么你应该使用解析器的原因：

我喜欢XML ::Twig，这是一个perl模块。你可以像这样做你想要的事情:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig; 

my %seen; 

#a subroutine to process any "upath" tags. 
sub process_upath {
   my ( $twig, $upath ) = @_; 
   my $text = lc $upath -> trimmed_text;
   $upath -> delete if $seen{$text}++; 
}

#instantiate the parser, and configure what to 'handle'. 
my $twig = XML::Twig -> new ( twig_handlers => { 'upath' => \&process_upath } );
   #parse from our data block - but you'd probably use a file handle here. 
   $twig -> parse ( \*DATA );
   #set output formatting
   $twig -> set_pretty_print ( 'indented_a' );
   #print to STDOUT.
   $twig -> print;

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
  </package>

这是为了说明概念而使用的长表单，它输出：

<package>
  <id>1523456789</id>
  <models>
    <model type="A">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
    <model type="B">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
  </models>
  <userinterface>
    <upath>/Example/Dir/Here</upath>
    <upath>/Example/Dir/Here2</upath>
  </userinterface>
</package>

通过parsefile_inplace方法，可以大大减少它的大小。

- Sobrique

谢谢，但是有点过度了。Ed Morton 在我的 awk 中解决了这个问题。 - dlite922

2

Awk 是个糟糕的想法，因为它不考虑上下文。XML 是一种上下文语言，这种解决方案总是容易出问题，并且容易在输入发生完全有效的更改时中断。请参见 https://dev59.com/X3I-5IYBdhLWcg3wq6do#1732454。 - Sobrique

没关系，我只是做一两次，直到我修复应用程序中的一个错误。否则我得手动编辑XML。 - dlite922

1

如果您只想忽略连续重复的行，可以存储上一行并将其与当前行进行比较。要忽略大小写，可以在比较时双方都使用tolower()函数。

awk '{ if (tolower(prev) != $0) print; prev = $0 }'

- fejese

不幸的是，有时这些行不是一个接一个的，它们中间可能会有一个不同的第三行。我会考虑到这一点并更新问题。 - dlite922

0

看起来你正在使用XML。你想解析它吗？

嘿，我以前从未用过Perl解析，但有入门教程等资料… 这并不是非常直接明了。通过阅读XML::SAX::ParserFactory和XML::SAX::Base，我编写出了你在本答案底部看到的代码。

问题已更新，不再有相邻的行；之前是：

好的，我看到你在整个文件中有两个日期匹配的<start>标签和两个日期匹配的<end>标签，但它们位于不同的部分。如果所有重复的行也是有效相邻的，就像你的示例中一样，你只需要使用GNU Coreutils的uniq命令或等效命令。该命令可以通过正确使用LC_COLLATE环境变量设置来忽略大小写，但老实说，我很难找到一个例子或者阅读如何使用LC_COLLATE来忽略大小写。

继续使用解析器：

#!/usr/bin/perl
use XML::SAX;

my $parser = XML::SAX::ParserFactory->parser(
    Handler => TestXMLDeduplication->new()
);

my $ret_ref = $parser->parse_file(\*TestXMLDeduplication::DATA);
close(TestXMLDeduplication::DATA);

print "\n\nDuplicates skipped: ", $ret_ref->{skipped}, "\n";
print "Duplicates cut: ", $ret_ref->{cut}, "\n";

package TestXMLDeduplication;
use base qw(XML::SAX::Base);

my $inUserinterface;
my $inUpath;
my $upathSeen;
my $defaultOut;
my $currentOut;
my $buffer;
my %seen;
my %ret;

sub new {
    # Idealy STDOUT would be an argument
    my $type = shift;
    #open $defaultOut, '>&', STDOUT or die "Opening STDOUT failed: $!";
    $defaultOut = *STDOUT;
    $currentOut = $defaultOut;
    return bless {}, $type;
}

sub start_document {
    %ret = ();
    $inUserinterface = 0;
    $inUpath = 0;
    $upathSeen = 0;
}

sub end_document {
    return \%ret;
}

sub start_element {
    my ($self, $element) = @_;

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface++;
      %seen = ();
    }
    if ('upath' eq $element->{Name}) {
      $buffer = q{};
      undef $currentOut;
      open($currentOut, '>>', \$buffer) or die "Opening buffer failed: $!";
      $inUpath++;
    }

    print $currentOut '<', $element->{Name};
    print $currentOut attributes($element->{Attributes});
    print $currentOut '>';
}

sub end_element {
    my ($self, $element) = @_;

    print $currentOut '</', $element->{Name};
    print $currentOut '>';

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface--;
    }

    if ('upath' eq $element->{Name}) {
      close($currentOut);
      $currentOut = $defaultOut;
      # Check if what's in upath was seen (lower-cased)
      if ($inUserinterface && $inUpath) {
    if (!exists $seen{lc($buffer)}) {
          print $currentOut $buffer;
    } else {
      $ret{skipped}++;
      $ret{cut} .= $buffer;
    }
    $seen{lc($buffer)} = 1;
      }
      $inUpath--;
    }
}

sub characters {
    # Note that this also capture indentation and newlines between tags etc.
    my ($self, $characters) = @_;

    print $currentOut $characters->{Data};
}

sub attributes {
    my ($attributesRef) = @_;
    my %attributes = %$attributesRef;

    foreach my $a (values %attributes) {
        my $v = $a->{Value};
      # See also XML::Quote
      $v =~ s/&/&amp;/g;
      $v =~ s/</&lt;/g;
      $v =~ s/>/&gt;/g;
      $v =~ s/"/&quot;/g;
    print $currentOut ' ', $a->{Name}, '="', $v, '"';
    }
}

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/<b>here</b></upath>   
    </userinterface>
  </package>

这个程序不再按行工作，而是查找userinterface标签内的upath标签，如果它们在父组中是重复的，则将其删除。周围的缩进和换行符保留。如果upath标签内有upath标签，那么情况可能会变得有点奇怪。

它看起来像这样：

$ perl saxEG.pl
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>

    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>

    </userinterface>
  </package>
Duplicates skipped: 2
Duplicates cut: <upath>/example/dir/here</upath><upath>/example/dir/<b>here</b></upath>

- dlamblin

0

$ awk '!(/<upath>/ && seen[tolower($1)]++)' file
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
    </userinterface>
  </package>

- Ed Morton

1

哈哈哈，Drive-By Downvoter... 呵呵。我不知道是谁，但我抓到你了伙计！这就是我要找的答案。是的，我可以为此编写一个程序，但在我的情况下，这只是一个补丁，直到我的软件发布下一个版本。每天手动编辑这些 XML 文件以使它们通过我的应用程序是很麻烦的。我需要一个一行命令放在 cron 中，直到我找到错误并修复它。是的，我可以使用 XMLStarlet 和 Perl XML 来 /编程/，但为什么要编写一个程序，当一个一行命令就能解决问题呢？！ - dlite922

如果您能够内联完成此操作，那么将获得额外的积分。我需要将其应用于XML目录，而不使用for循环和临时文件，例如：for afile in *.xml; do awk '...' $afile > to $afile.tmp && mv $afile.tmp $afile。看起来像是一个大师会想出更好的方法，在原地awk文件。 - dlite922

使用GNU awk，只需添加"-i inplace"标志即可。否则，临时文件是解决方法。还可以考虑使用 find ... -print0 | xargs -0 而不是 for 循环。 - Ed Morton

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rany Albeg Wein · Accepted Answer

以下脚本接受一个XML文件作为第一个参数，使用xmlstarlet（在脚本中表示为xml）解析XML树，并使用关联数组（需要Bash 4）存储唯一的<upath>节点值。

#!/bin/bash

input_file=$1
# XPath to retrieve <upath> node value.
xpath_upath_value='//package/userinterface/upath/text()'
# XPath to print XML tree excluding  <userinterface> part.
xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'
# Associative array to help us remove duplicated <upath> node values.
declare -A arr

print_userinterface_no_dup() { 
    printf '%s\n' "<userinterface>"
    printf '<upath>%s</upath>\n' "${arr[@]}"
    printf '%s\n' "</userinterface>"
}

# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.
while read -r upath; do
    key="${upath,,}"
    # We can remove this 'if' statement and simply arr[$key]="$upath"
    # if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>
    if [[ ! "${arr[$key]}" ]]; then
        arr[$key]="$upath"
    fi
done < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")

printf '%s\n' "<package>"

# Print XML tree excluding <userinterface> part.
xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"

# Print <userinterface> tree without duplicates.
print_userinterface_no_dup

printf '%s\n' "</package>"

测试（脚本名称为sof）：

$ ./sof xml_file
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">                 
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
        <upath>/Example/Dir/Here2</upath>
        <upath>/Example/Dir/Here</upath>
    </userinterface>
</package>

如果我的评论没有让代码对你足够清晰，请问我，我会回答并相应地编辑这个解决方案。

我的xmlstarlet版本是1.6.1，编译于libxml2 2.9.2和libxslt 1.1.28。