使用.NET的CSV解析选项

Question

使用.NET的CSV解析选项

c#.netparsing

14

我正在研究使用MS堆栈（包括.NET）进行分隔文件（例如CSV，制表符分隔等）解析的选项。由于我已经知道SSIS无法满足我的需求，因此我排除了它。

所以我的选择似乎是：

我必须满足两个标准。第一个标准是在下面文件中，必须产生两个逻辑“行”，每个行都由三个字符串（或列）组成。第三个行/列字符串必须保留换行符！换句话说，解析器必须识别哪些行由于“未关闭”的文本限定符而“继续”到下一行：

101，Bob，“保持他的房子干净。需要处理洗衣。” 102，Amy，“聪明。积极进取。勤奋。”

第二个标准是分隔符和文本限定符必须是可配置的。这是来自不同文件的两个字符串，我必须能够解析它们：

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

对字符串 "first" 的正确解析应该是：

This
Is,A,Record
That "Cannot", they say,
_
_
be
rightly
parsed
at all

下划线 '_' 只是表示捕获了一个空格 - 我不希望出现字面上的下划线。

关于要解析的平面文件，可以做出一个重要的假设：每个文件中将有固定数量的列。

现在进入技术选项的深入探讨。

正则表达式

首先，许多回答者评论说正则表达式“不是实现目标的最佳方法”。但是，我找到了一位评论者提供的优秀的 CSV 正则表达式：

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

应用于字符串 "first" 的结果非常好:

"This"
"Is,A,Record"
"That ""Cannot"", they say,"
""
_
"be"
正确地
"parsed"
全部都不能被

希望引号可以被清理干净，但我可以轻松地在后处理步骤中处理。否则，只要将正则表达式针对波浪符和管道符号进行修改，此方法就可用于解析样本字符串“first”和“second”。太棒了！

但真正的问题与多行条件有关。在将正则表达式应用于字符串之前，我必须从文件中读取完整的逻辑“行”。不幸的是，除非我有一个正则表达式/状态机，否则我不知道要读取多少物理行才能完成逻辑行。

所以这成为了一个“先有鸡还是先有蛋”的问题。我的最佳选择是将整个文件读入内存作为一个巨大的字符串，让正则表达式处理多行（我没有检查以上正则表达式是否能够处理这种情况）。如果我有一个10GB的文件，这可能有些危险。

接下来是另一个选项。

TextFieldParser

只需三行代码，就可以发现此选项的问题:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

分隔符配置看起来不错。但是，"HasFieldsEnclosedInQuotes"则是"game over"。我很震惊分隔符可以任意配置，但与此形成对比的是，除了引号，我没有其他限定符选项。记住，我需要可配置文本限定符。所以，除非有人知道TextFieldParser配置技巧，否则这就是个失败的尝试。

OLEDB

一位同事告诉我，这个选项有两个主要缺陷。首先，对于大型文件（例如10GB），性能很差。其次，据我所知，它猜测输入数据的数据类型而不是让您自行指定。这样不好。

HELP

因此，如果我有任何错误的理解，请告诉我正确的情况，以及我可能遗漏的其他选项。也许有人知道如何强制TextFieldParser使用任意分隔符。也许OLEDB已经解决了所述问题（或者从未有过？）。

你怎么看？

- Brent Arias

你尝试过在https://dev59.com/T0XRa4cB1Zd3GeqPvNt9列出的选项吗？ - TrueWill

我同意@Appleman1234的观点，Filehelpers应该是你所需要的。 - Kane

Filehelpers（http://www.filehelpers.com/）是否符合您的要求？ - Appleman1234

可能是在C#中读取CSV文件的重复问题。 - Dour High Arch

3个回答

4

我一段时间以前写了一个简单、独立的CSV解析器。我认为它符合您的所有要求。尝试一下，但请知道它可能并不完美可靠。

如果它对您有用，请随意更改命名空间并无限制地使用。

namespace NFC.Portability
{
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.IO;
    using System.Linq;
    using System.Text;

    /// <summary>
    /// Loads and reads a file with comma-separated values into a tabular format.
    /// </summary>
    /// <remarks>
    /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas.
    /// </remarks>
    public unsafe class CsvReader
    {
        private const char SEGMENT_DELIMITER = ',';
        private const char DOUBLE_QUOTE = '"';
        private const char CARRIAGE_RETURN = '\r';
        private const char NEW_LINE = '\n';

        private DataTable _table = new DataTable();

        /// <summary>
        /// Gets the data contained by the instance in a tabular format.
        /// </summary>
        public DataTable Table
        {
            get
            {
                // validation logic could be added here to ensure that the object isn't in an invalid state

                return _table;
            }
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param>
        public CsvReader( string path )
        {
            if( path == null )
            {
                throw new ArgumentNullException( "path" );
            }

            FileStream fs = new FileStream( path, FileMode.Open );
            Read( fs );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="stream">The stream from which the instance will be populated.</param>
        public CsvReader( Stream stream )
        {
            if( stream == null )
            {
                throw new ArgumentNullException( "stream" );
            }

            Read( stream );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="bytes">The array of bytes from which the instance will be populated.</param>
        public CsvReader( byte[] bytes )
        {
            if( bytes == null )
            {
                throw new ArgumentNullException( "bytes" );
            }

            MemoryStream ms = new MemoryStream();
            ms.Write( bytes, 0, bytes.Length );
            ms.Position = 0;

            Read( ms );
        }

        private void Read( Stream s )
        {
            string lines;

            using( StreamReader sr = new StreamReader( s ) )
            {
                lines = sr.ReadToEnd();
            }

            if( string.IsNullOrWhiteSpace( lines ) )
            {
                throw new InvalidOperationException( "Data source cannot be empty." );
            }

            bool inQuotes = false;
            int lineNumber = 0;
            StringBuilder buffer = new StringBuilder( 128 );
            List<string> values = new List<string>();

            Action endSegment = () =>
            {
                values.Add( buffer.ToString() );
                buffer.Clear();
            };

            Action endLine = () =>
            {
                if( lineNumber == 0 )
                {
                    CreateColumns( values );
                    values.Clear();
                }
                else
                {
                    CreateRow( values );
                    values.Clear();
                }

                values.Clear();
                lineNumber++;
            };

            fixed( char* pStart = lines )
            {
                char* pChar = pStart;
                char* pEnd = pStart + lines.Length;

                while( pChar < pEnd ) // leave null terminator out
                {
                    if( *pChar == DOUBLE_QUOTE )
                    {
                        if( inQuotes )
                        {
                            if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER )
                            {
                                endSegment();
                                pChar++;
                            }
                            else if( !ApproachingNewLine( pChar, pEnd ) )
                            {
                                buffer.Append( DOUBLE_QUOTE );
                            }
                        }

                        inQuotes = !inQuotes;
                    }
                    else if( *pChar == SEGMENT_DELIMITER )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                        }
                        else
                        {
                            buffer.Append( SEGMENT_DELIMITER );
                        }
                    }
                    else if( AtNewLine( pChar, pEnd ) )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                            endLine();
                            //pChar++;
                        }
                        else
                        {
                            buffer.Append( *pChar );
                        }
                    }
                    else
                    {
                        buffer.Append( *pChar );
                    }

                    pChar++;
                }
            }

            // append trailing values at the end of the file
            if( values.Count > 0 )
            {
                endSegment();
                endLine();
            }
        }

        /// <summary>
        /// Returns the next character in the sequence but does not advance the pointer. Checks bounds.
        /// </summary>
        /// <param name="pChar">Pointer to current character.</param>
        /// <param name="pEnd">End of range to check.</param>
        /// <returns>
        /// Returns the next character in the sequence, or char.MinValue if range is exceeded.
        /// </returns>
        private char Peek( char* pChar, char* pEnd )
        {
            if( pChar < pEnd )
            {
                return *( pChar + 1 );
            }

            return char.MinValue;
        }

        /// <summary>
        /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool AtNewLine( char* pChar, char* pEnd )
        {
            if( *pChar == NEW_LINE )
            {
                return true;
            }

            if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE )
            {
                return true;
            }

            return false;
        }

        /// <summary>
        /// Determines if the next character represents a newline, or the start of a newline.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool ApproachingNewLine( char* pChar, char* pEnd )
        {
            if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE )
            {
                // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence
                return true;
            }

            return false;
        }

        private void CreateColumns( List<string> columns )
        {
            foreach( string column in columns )
            {
                DataColumn dc = new DataColumn( column );
                _table.Columns.Add( dc );
            }
        }

        private void CreateRow( List<string> values )
        {
            if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 )
            {
                return; // ignore rows which have no content
            }

            DataRow dr = _table.NewRow();
            _table.Rows.Add( dr );

            for( int i = 0; i < values.Count; i++ )
            {
                dr[i] = values[i];
            }
        }
    }
}

- Tim M.

1

请看我在这个问题中发布的代码：

https://dev59.com/iHI_5IYBdhLWcg3wEOrq#1544743

它涵盖了大多数您的要求，而且只需要很少的工作就可以更新以支持备用分隔符或文本限定符。

- Joel Coehoorn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dour High Arch · Accepted Answer

6

你是否尝试搜索已经存在的.NET CSV解析器？这个声称可以比OLEDB更快地处理多行记录。

- Dour High Arch

FastCSV 是一个相当受欢迎的库。 - Joel Coehoorn

1

是的，我做了一些调查 - 这就是为什么我提到了三个选项的原因。问题在于有很多其他选项，但它们是否符合我的标准是一个非常缓慢的过程。我希望有人已经知道正确的选择。 - Brent Arias

我根据 nuget 上的下载量（搜索链接 https://www.nuget.org/packages?q=csv）纯粹选择了 https://www.nuget.org/packages/CsvHelper/。 - xhafan