基于标点符号列表替换数据帧中的标点符号

Question

基于标点符号列表替换数据帧中的标点符号

7

使用 Canopy 和 Pandas，我有一个数据框 a，它的定义如下：

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt是一个单列文件，包含了一系列包含文本、数字和标点符号的字符串。

假设数据框(df)如下:

test

%hgh&12

abc123!!!

porkyfries

我希望我的结果如下:

test

hgh12

abc123

porkyfries

目前为止的努力：

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

上述命令基本上只是返回了相同的数据集。感谢任何线索。

编辑：我使用Pandas的原因是数据很大，跨越约1M行，并且将来编码的使用将应用于长度为30M行的列表。长话短说，我需要以非常高效的方式清理大数据集中的数据。

- BernardL

所有的数据都是文本吗？还是有数字的？例如，如果有人将3.14输入为字符串，你真的想去掉句点吗？ - philshem

@philshem 是的，对于这个特定的情况是这样。数据量很大，可能跨越数百万行。这正是我打折扣的原因，那些有标点符号等的都不考虑。比如，“Paracetemol 50mg 10% Discount”应该只返回“Paracetemol”。同样地，拼写错误“Actife@4d”应该返回“Actife4d”，作为第一级过滤器。 - BernardL

3个回答

5

使用正确的正则表达式来使用 replace 更容易:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

使用正则表达式匹配非字母数字和空格的模式。

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

- EdChum

嗯，这很有道理。但是可能得弄清楚为什么没有省略掉 & 符号。 - BernardL

@user3288092 刚刚阅读了文档，strip 用于移除开头和结尾的字符，因此出现错误，你应该使用 replace。 - EdChum

Unicode标点符号怎么办，比如各种破折号？https://www.cs.tut.fi/~jkorpela/dashes.html#unidash - philshem

@user3288092 在弄清楚我认为正确的正则表达式模式后，我已更新了我的答案，它适用于这个有限的样本数据。 - EdChum

@EdChum 您好，感谢您的回答，我认为那个方法可以行得通。但是我希望能够将我需要不断替换的值存储在一个列表中，并且每次运行该命令时都能使用该列表。这个列表会随着筛选条件的增加而不断扩大。这种方法必须高效，并且易于更新。 - BernardL

@EdChum 目前这个可行！谢谢 - BernardL

1

翻译通常被认为是去除标点符号的最干净和最快的方法（来源）

import string
text = text.translate(None, string.punctuation.translate(None, '"'))

在加载到pandas之前，删除'a'中的标点可能会使其更有效。

- philshem

它返回了一个错误，称DataFrame没有'translate'属性。抱歉，还需要提到数据很大，因此正在尝试在Pandas上实现。 - BernardL

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aakash Saxena · Accepted Answer

如果您需要从数据框中的文本列中去除标点符号：

输入：

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

输出：

'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'

在：

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

输出：

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick %

在：

df['text'] = df['text'].str.replace(pattern, '')
df

你可以将模式替换为你想要的字符。例如 - 使用replace(pattern, '$')。

输出：

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick