使用sha1sum和awk进行哈希处理

Question

使用sha1sum和awk进行哈希处理

4

我有一个“管道分隔”的文件，大约有20列。我想使用sha1sum哈希第一列（如账号），并将其余的列原样返回。

使用awk或sed，最好的方法是什么？

Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...

上面是一个文本文件的示例，只显示了3列。只有第一列实现了哈希函数。结果应该如下所示：

Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

- user1189851

2

您IP地址为143.198.54.68，由于运营成本限制，当前对于免费用户的使用频率限制为每个IP每72小时10次对话，如需解除限制，请点击左下角设置图标按钮（手机用户先点击左上角菜单按钮）。 - Ed Morton

2个回答

2

这里有一个可执行的awk脚本，可以做到你想要的效果：

#!/usr/bin/awk -f

BEGIN { FS=OFS="|" }

FNR != 1 { $1 = encodeData( $1 ) }

47

function encodeData( fld ) {
    cmd = sprintf( "echo %s | sha1sum", fld )
    cmd | getline output
    close( cmd )
    split( output, arr, " " )
    return arr[1]
    }

这是工作流程的详细说明：

- 将输入和输出字段分隔符设置为 | - 当行不是第一行（标题行）时，将$1重新分配为编码值 - 当47为真（总是）时，打印整个行

这是encodeData函数的详细说明：

- 创建一个cmd来向sha1sum提供数据 - 将其提供给getline - 关闭cmd - 在我的系统上，sha1sum之后有额外的信息，所以我通过split来丢弃它们 - 返回sha1sum输出的第一个字段。

使用您的数据，我得到以下结果：

Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

通过调用awk.script data（如果您使用bash，则为./awk.script data）来运行。

由EdMorton编辑：很抱歉修改了您的脚本，但上面的脚本是正确的方法，但需要进行一些调整，以使其更加健壮，这比试图在注释中描述它们要容易得多:

$ cat tst.awk
BEGIN { FS=OFS="|" }

NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }

function encodeData( fld,       cmd, output ) {
    cmd = "echo \047" fld "\047 | sha1sum"
    if ( (cmd | getline output) > 0 ) {
        sub(/ .*/,"",output)
    }
    else {
        print "failed to hash " fld | "cat>&2"
        output = fld
    }
    close( cmd )
    return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

< p > f[] 数组将您的脚本与硬编码需要进行哈希处理的字段数量分离，函数的附加参数使它们在每次调用时始终为局部变量，getline 中的 if 意味着如果失败，则不会返回先前的成功值（请参见 http://awk.info/?tip/getline），其余部分可能更多地涉及样式/偏好以及稍微的性能改进。

- n0741337

1

你的脚本是正确的方法，非常接近了，但还需要一些改动使其更加健壮。所以我编辑了你的答案来展示这些改动，而不是发表自己的答案，希望你不介意。 - Ed Morton

@EdMorton 用 foo"bar 试试看。记得还有 foo'bar，所以仅使用单引号并不能解决问题。还有美元符号之类的特殊字符。 - Wintermute

@EdMorton 在我的电脑上可以工作。你为什么认为我会这样做呢？ :P 无论如何，现在只剩下 gsub(/'/, "'\\''", fld) 需要完成了。 - Wintermute

@EdMorton - 我一点也不介意 - SO 就是这样建立的。我同意 encodeData 中的大部分更改，并承认为了在午餐前得出答案而采取了捷径。然而，我觉得 f 数组有点笨重，因为列的标题以及其位置可能会发生变化（作为后续支持问题）。在任何情况下，字段重新赋值都可以在函数中进行，以减少调用 encodeData 时的重复。 - n0741337

当你只操作第一个字段时，f[]数组可能过于复杂，因为关键字段不太可能改变到其他位置。但是，当你需要操作多个字段时，这样做可以节省无数的重新工作时间，因为我会发现在CSV中添加列会推出字段号码，但标题行保持不变，所以我总是尽量这样做。是的，在函数内部进行重新分配是有意义的。 - Ed Morton

显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wintermute · Accepted Answer

"Best Way™" 的最佳方式存在争议。使用 awk 的一种方法是：

awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\  -f 1"); command | getline hash; close(command); $1 = hash; print }' filename

那是关于IT技术的内容。

BEGIN {
  OFS = FS          # set output field separator to field separator; we will use
                    # it because we meddle with the fields.
}
NR == 1 {           # first line: just print headers.
  print
}
NR != 1 {           # from there on do the hash/replace
  # this constructs a shell command (and runs it) that echoes the field
  # (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
  # and gets it back into awk with getline (into the variable hash)
  # the gsub bit is to prevent the shell from barfing if there's an apostrophe
  # in one of the fields.
  gsub(/'/, "'\\''", $1);
  command = ("echo '" $1 "' | sha1sum -b | cut -d\\  -f 1")
  command | getline hash
  close(command)

  # then replace the field and print the result.
  $1 = hash
  print
}

您会注意到顶部的shell命令和底部的awk代码之间的区别，这完全是由于shell扩展造成的。因为我在shell命令中使用单引号将awk代码括起来（在那种情况下，双引号不容置疑，因为其中有$1等内容），而且代码包含单引号，使其内联工作会导致反斜杠的噩梦。因此，我的建议是将awk代码放入一个文件中，比如foo.awk，然后运行。

awk -F'|' -f foo.awk filename

替代方案。