在Ruby中解析制表符分隔的文件的最佳(最有效)方法是什么?
Ruby CSV库可以让你指定字段分隔符。Ruby 1.9使用FasterCSV。你可以这样做:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
如果您想使用CSV库,您可以使用一个您不希望在文件中看到的随机引号字符(示例展示了这一点),但是您也可以使用更简单的方法,例如下面展示的StrictTsv类,以获得相同的效果,而无需担心字段引用。
# The main parse method is mostly borrowed from a tweet by @JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
@filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
使用CSV库还是更严格的格式取决于发送文件的人以及他们是否希望遵守严格的TSV标准。
有关TSV标准的详细信息可以在http://en.wikipedia.org/wiki/Tab-separated_values找到。
\d
在CSV解析器中的表现如此糟糕。 - a2f0line = 'boogie\ttime\tis "now"'
会导致一个带有双转义制表符的字符串,所以我认为失败可能是由于这个原因,但实际上是我的测试编写不正确。要获取预期的测试字符串,请使用 line = "boogie\ttime\tis \"now\""
或 "boogie\ttime\tis " + '"now"'
。您可以使用 puts
进行测试。前者的结果是 boogie\ttime\tis "now"
,而后两者的结果是 boogie time is "now"
(制表符在此处不易显示,但在您的控制台中将显示)。感谢您提供全面的答案。 - AaronCSV.parse("foo,bar,and \"baz\" quotes")
和 CSV.parse("foo\tbar\tand \"baz\" quotes", col_sep: "\t")
。看起来只有在引号围绕整个列内容时才是有效的,这样您就可以包括列分隔符字符。以下两个则可以正常解析:CSV.parse("foo\tbar\t\"and baz\tquotes\"", col_sep: "\t")
和 CSV.parse("foo,bar,\"and baz,quotes\"")
。 - Aaron实际上,有两种不同类型的TSV文件。
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv
gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1)
(note -1
, which prevents split
from removing empty trailing fields). If you want to use the csv
gem, simply set quote_char
to nil
:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
quote_char: nil
会导致 undefined method 'encode' for nil:NilClass (NoMethodError)
错误。另一个 SO 线程建议使用 "\0"
或 liberal_parsing: true
,这对我来说效果更好,但两者仍可能无法处理包含转义字符的 IANA TSV:https://stackoverflow.com/a/41644206/2960236 - wu-leesplit("\t", -1)
。 - Fravadona我喜欢mmmries的答案。然而,我讨厌Ruby在分割时会去掉任何空值。它也没有去掉行末的换行符。
另外,我有一个可能包含字段内换行符的文件。因此,我按照以下方式重写了他的“解析”:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
这将根据需要连接任何行以获取完整的数据行,并始终返回完整的数据集(不包括可能存在的末尾 nil 条目)。
"
替换为其他你不希望出现在文件中的字符。parsed_file = CSV.read("path-to-file.csv", { col_sep: "\t", quote_char: '}')
来源文档。 - Aaron Grayliberal_parsing
选项。parsed_file = CSV.read("path-to-file.csv", { col_sep: "\t", liberal_parsing: true)
,参见文档。 - Aaron Grayquote_char
设置为nil
:parsed_file = CSV.read("path-to-file.csv", col_sep: "\t", quote_char: nil)
。这比使用您认为不会出现在文件中的字符更加健壮和优雅。 - michau