Powershell - 将日志文件转换为CSV

3

我有像这样的日志文件...

2009-12-18T08:25:22.983Z     1         174 dns:0-apr-credit-cards-uk.pedez.co.uk P http://0-apr-credit-cards-uk.pedez.co.uk/ text/dns #170 20091218082522021+89 sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF - -
2009-12-18T08:25:22.984Z     1           5 dns:0-60racing.co.uk P http://0-60racing.co.uk/ text/dns #116 20091218082522037+52 sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG - -
2009-12-18T08:25:23.066Z     1          79 dns:0-addiction.metapress.com.wam.leeds.ac.uk P http://0-addiction.metapress.com.wam.leeds.ac.uk/ text/dns #042 20091218082522076+20 sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R - -
...plus millions of other records

我需要将这些文件转换为csv格式...

"2009-12-18T08:25:22.983Z","1","174","dns:0-apr-credit-cards-uk.pedez.co.uk","P","http://0-apr-credit-cards-uk.pedez.co.uk/","text/dns","#170","20091218082522021+89","sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF","-","-"
"2009-12-18T08:25:22.984Z","1","5","dns:0-60racing.co.uk","P","http://0-60racing.co.uk/","text/dns","#116","20091218082522037+52","sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG","-","-"
"2009-12-18T08:25:23.066Z","1","79","dns:0-addiction.metapress.com.wam.leeds.ac.uk","P","http://0-addiction.metapress.com.wam.leeds.ac.uk/","text/dns","#042","20091218082522076+20","sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R","-","-"

字段分隔符可以是单个或多个空格字符,同时具有固定宽度和可变宽度字段。这往往会使我找到的大多数CSV解析器感到困惑。

最终,我想将这些文件bcp到SQL Server中,但您只能指定单个字符作为字段分隔符(即“ ”),这会破坏固定长度字段。

目前为止 - 我正在使用PowerShell

gc -ReadCount 10 -TotalCount 200 .\crawl_sample.log | foreach { ([regex]'([\S]*)\s+').matches($_) } | foreach {$_.Groups[1].Value}

这将返回一个字段流:

2009-12-18T08:25:22.983Z
1
74
dns:0-apr-credit-cards-uk.pedez.co.uk
P
http://0-apr-credit-cards-uk.pedez.co.uk/
text/dns
#170
20091218082522021+89
sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF
-
-
2009-12-18T08:25:22.984Z
1
55
dns:0-60racing.co.uk
P
http://0-60racing.co.uk/
text/dns
#116
20091218082522037+52
sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG
-

但是我该如何将输出转换为CSV格式?

你可能想看看我的FOSS CSV处理工具http://code.google.com/p/csvfix,我认为它可以做到你想要的,但只能作为多阶段过程。 - anon
1个回答

4
回答自己的问题...
measure-command {
    $q = [regex]" +"
    $q.Replace( ([string]::join([environment]::newline, (Get-Content -ReadCount 1 \crawl_sample2.log))), "," ) > crawl_sample2.csv
}

而且速度很快!

观察:

  • 我使用\s+作为正则表达式分隔符,这会破坏换行符
  • Get-Content -ReadCount 1将单行数组流传输到正则表达式中
  • 然后将输出字符串管道传输到新文件中

更新

这个脚本可以工作,但在处理大文件时会使用大量RAM。那么,如何在不使用8GB RAM和交换的情况下完成相同的操作!

我认为这是由于join再次缓存所有数据所致.... 有什么想法吗?

更新2

好的——找到了更好的解决方案...

Get-Content -readcount 100 -totalcount 100000 .\crawl.log | 
    ForEach-Object { $_ } |
       foreach { $_ -replace " +", "," } > .\crawl.csv

一个非常方便的PowerShell指南 - PowerShell正则表达式


1
欢迎提供任何更好的解决方案或改进脚本! - Guy
1
你可以简化这个过程,通过去掉中间的Foreach-Object,因为-replace操作适用于字符串数组,例如'a b','c d','e f' -replace ' +',','。尝试使用以下命令:gc crawl.log -read 100 -total 100000 | %{$_ -replace ' +',','} > crawl.csv - Keith Hill
考虑到 -replace,它甚至可以更简单:(gc crawl.log ...) -replace ' +', ',' > crawl.csv(我的文章 操作符链 http://www.leporelo.eu/blog.aspx?id=powershell-tips-and-tricks-3-chain-of-operators ) - stej

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接