从CSV文件中删除换行符

Question

从CSV文件中删除换行符

5

我想删除CSV文件字段数据中的换行符。在SO /其他地方，多个人提出了同样的问题。但是，提供的解决方案都是脚本语言。我正在寻找编程语言（如PYTHON或Spark）的解决方案（不仅限于这两种语言），因为我的文件非常大。

以前关于同一主题的问题：

我有一个大约1GB的CSV文件，想要删除字段数据中的换行符。CSV文件的模式动态变化，所以我不能硬编码模式。换行符不总是出现在逗号之前，它甚至会在字段内随机出现。样例数据：

playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is 
Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is 
Cricket" 
,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is 
Team4",DET,AL,1
dykesji01,1933,5,"Game name is 
Hockey"
,"Team name 
Team5",CHA,AL,1

期望输出：

playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is Cricket" ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is Team4",DET,AL,1
dykesji01,1933,5,"Game name is Hockey","Team name Team5",CHA,AL,1

新行字符可以出现在任何字段的数据中。

编辑：根据代码的屏幕截图：

- data_addict

我看到你的字符串中大多数都有换行符。在Python中，当迭代行时，请尝试替换换行符。 a = "这里是 \n Team4" print(a) b = a.replace('\n','') print(b) - Sunnysinh Solanki

@SunnysinhSolanki，我尝试使用替换函数，但它在这里没有起作用。 - data_addict

5个回答

1

您可以按照以下方式使用re、pandas和io模块：

import re
import io
import pandas as pd

with open('data.csv','r') as f:
    data = f.read()
df = pd.read_csv(io.StringIO(re.sub('"\s*\n','"',data)))

for col in df.columns: #To replace all line breaks in all textual columns
    if df[col].dtype == np.object_:
        df[col] = df[col].str.replace('\n','');

In [78]: df
Out[78]:
    playerID    yearID  gameNum gameName               teamName        lgID GP  startingPos
0   gomezle01   1933    1       Cricket                Team1           NYA  AL  1
1   ferreri01   1933    2       Hockey                 This is Team2   BOS  AL  1
2   gehrilo01   1933    3       Game name is Cricket   Team3           NYA  AL  1
3   gehrich01   1933    4       Hockey  Here it is     Team4           DET  AL  1
4   dykesji01   1933    5       Game name is Hockey    Team name Team5 CHA  AL  1

如果您希望将此 DataFrame 作为输出 CSV 文件，请使用以下命令：

df.to_csv('./output.csv')

- O.Suleiman

我无法硬编码列名，因为模式会动态更改，并且换行符可能出现在任何列中。是否可以应用于所有列而不提及列名？ - data_addict

@user805，您可以使用for循环来实现，我已经更新了我的代码，现在它应该可以消除您的string中的所有换行符。 - O.Suleiman

0

这是一个基础的预处理程序，可以在读取csv文件之前进行简单的预处理。

import csv

def simple_sanitize(data):
    result = []
    for i, a in enumerate(data):
        if i + 1 != len(data) and data[i + 1][0] == ',':
            a = a.replace('\n', '')
            result.append(a + data[i + 1])
        elif a[0] != ',':
            result.append(a)
    return result

data = [line for line in open('test.csv', 'r')]
sdata = simple_sanitize(data)

with open('out.csv','w') as f:
    for row in sdata:
        f.write(row)

result = [list(val.replace('\n', '') for val in line) for line in csv.reader(open('out.csv', 'r'))]

print(result)

结果：

[['playerID', 'yearID', 'gameNum', 'gameName', 'teamName', 'lgID', 'GP', 'startingPos'], 
['gomezle01', '1933', '1', 'Cricket', 'Team1', 'NYA', 'AL', '1'], 
['ferreri01', '1933', '2', 'Hockey', 'This is Team2', 'BOS', 'AL', '1'], 
['gehrilo01', '1933', '3', 'Game name is Cricket ', 'Team3', 'NYA', 'AL', '1'], 
['gehrich01', '1933', '4', 'Hockey', 'Here it is Team4', 'DET', 'AL', '1'], 
['dykesji01', '1933', '5', 'Game name is Hockey', 'Team name Team5', 'CHA', 'AL', '1']]

- Reck

看看你的输出，你认为它们正确吗？每个数组都有8个字符串吗？ - Ramesh Maharjan

是的，现在注意到了。 - Reck

@Reck，你有没有其他可能的解决方案？因为你提供的代码输出结果不如预期。 - data_addict

另一个可能的解决方案是在使用csv.reader之前对csv进行清理。这里的清理是一个棘手的问题。 - Reck

0

代码有点混乱，但这里有一些可以实现你想要的功能的代码。适用于字段内和逗号前的换行符。如果需要更多的要求，可以进行一些调整：

import csv

with open('data.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    actual_rows = [next(reader)]
    length = len(actual_rows[0])
    real_row = []
    for row in reader:
        if len(row) < length:
            if real_row:
                real_row[-1] += row[0]
                real_row += row[1:]
            else:
                real_row = row
        else:
            real_row = row
        if len(real_row) == length:
            real_row = map(lambda s: s.replace('\n', ' '), real_row)
            # store real_row or use it as needed
            actual_rows.append(list(real_row))
            real_row = []

    print(actual_rows)

我正在将已更正的行存储在actual_rows中，但如果您不想加载到内存中，只需在每个被指出的循环中使用real_row变量即可。

- damores

任何可能的解决方案都可以，在任何位置（不要在逗号前面）插入换行符。 - data_addict

0

在这个解决方案中的基本思路是使用grouper recipe获取固定长度块(长度等于第一行中的列数)。由于它不会一次性读取整个文件，所以不会因为大文件而使内存使用过多。

$ cat a.py
import csv,itertools as it,operator as op

def grouper(iterable,n):return it.zip_longest(*[iter(iterable)]*n)

with open('in.csv') as inf,open('out.csv','w',newline='') as outf:
 r,w=csv.reader(inf),csv.writer(outf)
 hdr=next(r)
 w.writerow(hdr)
 for row in grouper(filter(bool,map(op.methodcaller('replace','\n',''),it.chain.from_iterable(r))),len(hdr)):
  w.writerow(row)

$ python3 a.py
$ cat out.csv
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,This is Team2,BOS,AL,1
gehrilo01,1933,3,Game name is Cricket ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,Here it is Team4,DET,AL,1
dykesji01,1933,5,Game name is Hockey,Team name Team5,CHA,AL,1

这里做出的一个假设是输入的 CSV 中不存在空单元格。

- apnkpr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ramesh Maharjan · Accepted Answer

如果您正在使用 pyspark，我建议您使用 sparkContext 的 wholeTextFiles 函数来读取文件，因为您的文件需要作为整个文本进行适当的解析。

使用 wholeTextFiles 读取后，您应该通过将换行符替换为逗号并进行一些额外的格式化，以便将整个文本分解为八个字符串组。

import re
rdd = sc.wholeTextFiles("path to your csv file")\
    .map(lambda x: re.sub(r'(?!(([^"]*"){2})*[^"]*$),', ' ', x[1].replace("\r\n", ",").replace(",,", ",")).split(","))\
    .flatMap(lambda x: [x[k:k+8] for k in range(0, len(x), 8)])

您应该获得如下输出：

[u'playerID', u'yearID', u'gameNum', u'gameName', u'teamName', u'lgID', u'GP', u'startingPos']
[u'gomezle01', u'1933', u'1', u'Cricket', u'Team1', u'NYA', u'AL', u'1']
[u'ferreri01', u'1933', u'2', u'Hockey', u'"This is Team2"', u'BOS', u'AL', u'1']
[u'gehrilo01', u'1933', u'3', u'"Game name is Cricket"', u'Team3', u'NYA', u'AL', u'1']
[u'gehrich01', u'1933', u'4', u'Hockey', u'"Here it is Team4"', u'DET', u'AL', u'1']
[u'dykesji01', u'1933', u'5', u'"Game name is Hockey"', u'"Team name Team5"', u'CHA', u'AL', u'1']

如果您想将所有数组rdd行转换为行字符串，则可以添加以下内容：

.map(lambda x: ", ".join(x))

你应该得到：

playerID, yearID, gameNum, gameName, teamName, lgID, GP, startingPos
gomezle01, 1933, 1, Cricket, Team1, NYA, AL, 1
ferreri01, 1933, 2, Hockey, "This is Team2", BOS, AL, 1
gehrilo01, 1933, 3, "Game name is Cricket", Team3, NYA, AL, 1
gehrich01, 1933, 4, Hockey, "Here it is Team4", DET, AL, 1
dykesji01, 1933, 5, "Game name is Hockey", "Team name Team5", CHA, AL, 1