从 Pandas 中删除文本中的 Unicode

Question

从 Pandas 中删除文本中的 Unicode

5

对于一个字符串，下面的代码可以去除 Unicode 字符和换行符/回车符：

t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"

t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))

但是当我尝试在pandas中编写一个函数以将其应用于列的每个单元格时，由于属性错误而失败，或者我会收到警告，即正在尝试在DataFrame的切片副本上设置值。

def clean_text(row):
    row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
    import sys
    sys.stdout.write(row.strip('\n\r'))
    return row

适用于我的数据框：

df["text"] = df.apply(clean_text, axis=1)

如何将此代码应用于Series的每个元素？

- user2476665

1

如果删除所有Unicode字符，你最终会得到一个空字符串... - Jongware

那么我该如何保留文本但去掉像\xE5\xCA和x89\xBD\x9D等字符呢？ - user2476665

你能否提供一个小的数据框或序列的例子，说明这种情况失败了吗？ - Scott

3个回答

8

实际上，我无法复现您的错误：以下代码在我的计算机上运行时没有出现错误或警告。

df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)

如果有帮助的话，我认为处理这种类型问题的更“pandas”的方法可能是使用正则表达式和其中一个DataFrame.str方法，例如：

df["text"] =  df.text.str.replace('[^\x00-\x7F]','')

- maxymoo

Python2还是Python3？OP没有具体说明，我认为在StackOverflow上默认/假设仍然是Python2，除非另有说明。 - GreenAsJade

2

类似这样，其中column_to_convert是您希望转换的列：

series = df['column_to_convert']
df["text"] =  [s.encode('ascii', 'ignore').strip()
               for s in series.str.decode('unicode_escape')]

- Alexander

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Anzel · Accepted Answer

问题似乎是您尝试访问和更改row['text']并在执行应用函数时返回行本身，当您对DataFrame执行apply时，它将应用于每个系列，因此如果更改为以下内容应该会有所帮助：

import pandas as pd

df = pd.DataFrame([t for _ in range(5)], columns=['text'])

df 
                                                text
0  We've������been invited to attend TEDxTeen, an ind...
1  We've������been invited to attend TEDxTeen, an ind...
2  We've������been invited to attend TEDxTeen, an ind...
3  We've������been invited to attend TEDxTeen, an ind...
4  We've������been invited to attend TEDxTeen, an ind...

def clean_text(row):
    # return the list of decoded cell in the Series instead 
    return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]

df['text'] = df.apply(clean_text)

df
                                                text
0  We'vebeen invited to attend TEDxTeen, an indep...
1  We'vebeen invited to attend TEDxTeen, an indep...
2  We'vebeen invited to attend TEDxTeen, an indep...
3  We'vebeen invited to attend TEDxTeen, an indep...
4  We'vebeen invited to attend TEDxTeen, an indep...

或者你可以像下面这样使用lambda，并直接应用于只有text列：

df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
                                          encode('ascii', 'ignore').\
                                          strip())