sed：用下划线替换引号内的空格

Question

sed：用下划线替换引号内的空格

4

我有一些输入（例如在OpenBSD上运行ifconfig run0 scan的结果），其中一些字段由空格分隔，但某些字段本身包含空格（幸运的是，这样包含空格的字段总是被引号括起来的）。

我需要区分引号内的空格和分隔符空格。想法是将引号内的空格替换为下划线。

示例数据：

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime

由于我还没有用下划线替换引号内的空格，因此结果不符合我的要求：

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\
    cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M
Websense 6 00:22:7f:xx:xx:xx 59dB 54M

- cnst

2

尝试使用AWK，它可能是您的解决方案，而不是sed。https://dev59.com/N0_Ta4cB1Zd3GeqPDLMC - Ricardo Ortega Magaña

是的，我认为我将不得不使用 awk。但我仍然希望在最终处理过程中将引号内的空格替换为下划线。 - cnst

1

请查看以下链接的SUB部分：http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_92.html您可以混合使用我给您的两个链接来解决您的问题。 - Ricardo Ortega Magaña

@cnst：相比于awk或sed，Perl更加合适。而且可扩展性也更强。 - Steve

5个回答

4

尝试这个：

awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file

它可以适用于同一行中的多个引用部分：

echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" 
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz

编辑替代表单：

awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file

- Kent

哦，我的tcsh里不起作用：

cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 | awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" | cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4

返回 Unmatched ". - cnst

好的，在tcsh中这个很好用（只是将一些双引号改成了单引号）：

cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 | awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS='"' | cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4

- cnst

我不同意使用awk/sed来完成这个任务，但这并不意味着不能完成。如果你要使用awk，你可以省略if语句。只需使用i+=2和i<NF即可。 - Steve

1

你的方法不错，但是将 i++ 修改为 i+=2，并且去掉 if(i%2==0) 和 gsub() 后面多余的分号 ;。另外，如果你想让 FS 和 OFS 拥有相同的值，在 BEGIN 部分应该将它们都赋为相同的值，像这样 BEGIN{FS=OFS="""}。 - Ed Morton

4

另一个值得尝试的awk命令：

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"

去掉引号：

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=

根据@steve之前的测试，我对三倍大小的测试文件进行了进一步的测试。我稍微修改了sed语句，以便让非GNU的sed也能处理它。我包括了awk（bwk）、gawk3、gawk4和mawk：

$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m27.802s
user    0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m6.565s
user    0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m21.486s
user    0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m14.270s
user    0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.251s
user    0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m13.229s
user    0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m33.965s
user    0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m15.437s
user    0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.002s
user    0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    5m14.008s
user    5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    4m11.026s
user    4m10.318s
sys 0m0.463s

mawk 可以快速呈现结果...

- Scrutinizer

不错！两者都很好用，似乎是回答中最短的解决方案，甚至比@Steve的最短的perl代码还要短（尽管可读性较差）。我需要放弃sed，学习awk！ - cnst

在 @steve 的测试基础上，还包括了一些额外的测试。 - Scrutinizer

2

你最好使用perl。代码更易读和易于维护：

perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'

通过您的输入，结果如下：

a b "c_d_e" f g "h_i"

解释：

-p            # enable printing
-e            # the following expression...

s             # begin a substitution

:             # the first substitution delimiter

"[^"]*"      # match a double quote followed by anything not a double quote any
              # number of times followed by a double quote

:             # the second substitution delimiter

($x=$&)=~s/ /_/g;      # copy the pattern match ($&) into a variable ($x), then 
                       # substitute a space for an underscore globally on $x. The
                       # variable $x is needed because capture groups and
                       # patterns are read only variables.

$x            # return $x as the replacement.

:             # the last delimiter

g             # perform the nested substitution globally
e             # make sure that the replacement is handled as an expression

一些测试：

for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done

time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m8.301s
user    0m8.273s
sys     0m0.020s

time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.967s
user    0m4.924s
sys     0m0.036s

time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.336s
user    0m4.244s
sys     0m0.056s

time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null

real    2m26.101s
user    2m25.925s
sys     0m0.100s

- Steve

1

很抱歉，我必须不同意那段代码比其他任何东西更易读的说法。显然，其他人会有不同的看法，但对我来说它完全无法理解，我已经诚实地尝试过了。您介意添加一下它在做什么的解释吗？ - Ed Morton

1

@ EdMorton：没问题。很高兴能帮忙。 1）〜只是意味着“运行此变量与此正则表达式匹配”。 2）Perl的e标志就像sedse标志一样。在父替换中，替换值是第二个（子）替换。默认情况下，Perl不会预期这种情况。因此需要e标志。 - Steve

1

@EdMorton：3）我的意思是，让父替换项为 $x。 - Steve

1

@EdMorton：如果你感兴趣的话，我发布了一些有趣的时间数据。然而，我仍然更喜欢perl，因为我认为它更好地描述了实际发生的情况。但在时间紧迫的管道中，在看到这些结果之后，我会牺牲可读性并选择awk。 - Steve

1

好的，我编译了 mawk、gawk3、gawk4 和 GNU sed 并将它们添加到了我的系统中，然后进行了一些进一步的测试。我将结果添加到了我的帖子末尾。 - Scrutinizer

显示剩余10条评论

1

这并不是回答，只是为了帮助那些对@steve的perl代码感兴趣的人，我发布了awk等效代码（也为了帮助我将来记住）：

@steve发布了以下内容：

perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'

从阅读@steve的解释来看，将那段perl代码转换成最简单的awk等效代码（不是首选的awk解决方案-请参见@Kent的答案）应该是使用GNU awk：

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'

我们从一个带有更多变量的POSIX awk解决方案开始，然后得到了这个结果：

awk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = substr(tail,RSTART,RLENGTH)
      gsub(/ /,"_",x)
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

使用GNU awk的gensub()函数保存一行：

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

然后将变量 x 删除：

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

如果你在循环后不需要$0、NF等变量，可以通过删除变量"tail"来实现：

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'

- Ed Morton

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joseph Quinsey · Accepted Answer

5

如果您只想使用sed实现（虽然我不建议这样做），请尝试以下方法：

echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"

翻译：

从行首开始。
查找零次或多次重复的模式junk"junk"，其中junk没有引号，后跟junk"junk space。
将最后一个空格替换为_。
如果成功，则跳回到开头。

- Joseph Quinsey

太好了，它实际上起作用了！ :-) 即使在OpenBSD 4.6上使用旧版本的sed，仍然没有“-E”选项！但是为什么必须转义括号？（尽管我试过用“\（”代替“（”，但它停止工作了。）此外，为什么第二个“[]”中不必包含空格，例如不是“[^" ]*”，而是“[^"]*”？它是如何知道不贪婪的呢？除此之外，正则表达式本身非常清晰明了！ :) 所以，“：a”是标签“a”，“ta”是跳转到“a”？跳转意味着将应用搜索/替换的行倒回去吗？真巧妙！我得把它加入我的武器库中。 :-) - cnst

@cnst 替换操作是反向进行的。要查看每个步骤（GNU sed），请在替换命令后面加上l0命令。例如：:a;s/.../.../;l0;ta - potong

@potong，你提供的其他选项很好。在我的sed中，l0不能工作，但是只使用l，例如;l;ta，似乎非常好用，确实展示了它的贪婪和反向处理。如果这种情况下也避免空格会更好吗？ - cnst