我能否导入CSV文件并自动推断分隔符？

Question

我能否导入CSV文件并自动推断分隔符？

54

我想导入两种CSV文件，一些使用“;”作为分隔符，另一些使用“,”。到目前为止，我一直在下面两行之间切换：

reader=csv.reader(f,delimiter=';')

或者

reader=csv.reader(f,delimiter=',')

是否有可能不指定分隔符，让程序检查正确的分隔符？

以下解决方案（Blender 和 sharth）似乎适用于逗号分隔的文件（由Libroffice生成），但不适用于使用MS Office生成的分号分隔的文件。下面是一个分号分隔文件的前几行：

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

- rom

你好，更一般性的讨论（不是关于Python）也在以下链接中：https://dev59.com/0nE85IYBdhLWcg3waCtT - Lorenzo

6个回答

13

假设有一个项目需要处理逗号和竖线分隔的CSV文件，这些文件都是格式良好的。我尝试了以下方法（如https://docs.python.org/2/library/csv.html#csv.Sniffer中所述）：

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

然而，在一个以|为分隔符的文件中，会返回“无法确定分隔符”的异常。合理地推测，如果每行有相同数量的分隔符（不包括被引号包含的内容），则嗅探式启发式方法可能效果最佳。因此，我尝试读取前两行的全部内容，而不是仅读取文件的前1024个字节：

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

到目前为止，这对我来说运作良好。

- Andrew Basile

2

这对我非常有帮助！我在处理数据时遇到了问题，其中一个“固定”的值是带有逗号的数字，因此它无法成功。将其限制在前两行确实有所帮助。 - mauve

太好了，对我有用，适用于我的以竖线分隔的“csv”文件。谢谢 :) - 3isenHeim

9

为了解决这个问题，我创建了一个函数来读取文件的第一行（标题），并检测分隔符。

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"

- rom

13

如果分隔符是值的一部分，即使已经转义或用引号括起来，你的函数也无法正常工作。例如，像这样的一行："Hi Peter;", "How are you?", "Bye John!"，将返回;作为分隔符，这是错误的。 - tashuhka

7

如果您使用 DictReader，您可以这样做：

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

我用这个方法和 Python 3.5 配合使用，效果很好。

- Vladir Parrado Cruz

1

我在Python 2.7中使用了它。 - alvaro562003

2

我认为这个问题没有完美的通用解决方案（我使用,作为分隔符的原因之一是我的某些数据字段需要包含;...）。一个简单的启发式方法是简单地读取第一行（或更多），计算它包含多少个,和;字符（如果创建您的.csv文件的工具正确且一致地引用条目，则可能忽略引号内的内容），并猜测两者中出现频率更高的是正确的分隔符。

- twalberg

1

如果csv.Sniffer不符合您的需求，可以继续参考@twalberg的想法，下面有两种可能的实现方法来识别正确的分隔符，不仅仅是检查常见的逗号、分号和竖线分隔符，而是普遍识别csv类文件中的任何奇怪分隔符。

天真的方法

以下代码读取csv文件的前10行，获取任何非字母数字字符，并计算其频率。

这段代码完全基于大数定律，即最流行的非字母数字字符通常应该是分隔符。

import re
from collections import Counter
def delimiter_detector(file_path):
    sample_data = ""
    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                sample_data += file.readline()
                i += 1
            except StopIteration:
                break

    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    delimiters_frequency = Counter(non_alnum_chars)
    if len(delimiters_frequency) == 0:
        return None

    # Find and return the most common delimiter
    most_common_delimiter = delimiters_frequency.most_common(1)[0][0]
    return most_common_delimiter

print(delimiter_detector('test.csv'))

当然，如果我们假设有5列由“|”分隔（每行4次），但在接下来的9行中每行还有5个以上的“.”字符，那么这种方法就会失效。{‘|’：10*4，‘.’：9*5}

更加成熟的方法

因此，更好的方法是首先检查并计算标题/第一行中所有特殊字符的数量，然后在后续行中执行相同的操作。

在确定了第一行中的特殊字符之后，再检查这些字符中哪些在其余行中的频率最高。

继续上面的例子，即使在最坏的情况下，如果标题行中有4个“|”和4个“.”，这意味着“|”和“.”都可以作为分隔符，通过检查接下来的n行，通常“|”：4将是最频繁出现的，而“.”和其他特殊字符则会有所变化。

import re
from collections import Counter

def frequency_counter(sample_data):
    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    return dict(Counter(non_alnum_chars))
def delimiter_detector(file_path):
    possible_delimiters = []

    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                freqeunt_nonalpha = frequency_counter(file.readline().strip())
                possible_delimiters.append(freqeunt_nonalpha)
                i += 1
            except StopIteration:
                break


    if len(possible_delimiters) == 0:
        return None

    # Find the most common delimiter in the header row
    potential_delimiters = []
    header_row = possible_delimiters[0]
    # adding potential delimiter to the list if it's in the header row and the frequencies are equal
    for data_row in possible_delimiters[1:]:
        for data_row_delim in data_row:
            if data_row_delim in header_row:
                # if the header and data row frequencies are equal, it's a potential delimiter
                if header_row[data_row_delim] == data_row[data_row_delim]:
                    potential_delimiters.append(data_row_delim)

    # identify the most common potential delimiter
    most_common_delimiter = Counter(potential_delimiters).most_common()
    print(most_common_delimiter)
    return most_common_delimiter[0][0][0]

print(delimiter_detector('test.csv'))

这种方法会有效，而第一种天真的方法会失败。

c1|c2|c3|c4|c5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

唯一的情况是，如果其他特殊字符出现在标题行中，并且在其他行中也以完全相同的次数出现，那么这种情况下会失败。

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

在这种情况下，。和|都是有效的分隔符。然而，如果有一行的计数与标题行不完全相同，后一种方法将成功地识别出正确的分隔符。因此，增加要检查的行数可能是值得的。

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
a.|b.|c.|d..|e.

需要考虑的一些额外事项

上述方法假设CSV文件有标题行。如果文件没有标题，该方法会将第一行数据视为标题，可能导致分隔符检测错误。

对我来说经常引起问题的另一件事是文件编码。特别是在Windows上工作时，Excel喜欢使用Windows-1250编码保存文件。因此，请确保在读取文件时定义了正确的编码。要检测编码，您可以使用chardet库。

- valq

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bill Lynch · Accepted Answer

“csv”模块似乎建议使用csv sniffer解决此问题。

他们给出了以下示例，我已根据您的情况进行了适应。

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

让我们试试看。

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

以及我们的样本输入

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

如果我们执行示例程序：

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

值得一提的是，我正在使用的Python版本。

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2