从pandas数据框中删除停用词

3
我有以下脚本,最后一行我试图从名为'response'的列中删除停用词 (stopwords)。
问题在于,不是将“A bit annoyed”变成“bit annoyed”,实际上字母都被删掉了。因为'a'是一个停用词。
有人能给我建议吗?
   import pandas as pd
   from textblob import TextBlob
   import numpy as np
   import os
   import nltk
   nltk.download('stopwords')
   from nltk.corpus import stopwords
   stop = stopwords.words('english')

   path = 'Desktop/fanbase2.csv'
   df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
   #remove punctuation
   df['response'] = df.response.str.replace("[^\w\s]", "")
   #make it all lower case
   df['response'] = df.response.apply(lambda x: x.lower())
   #Handle strange character in source
   df['response'] = df.response.str.replace("‰Ûª", "''")

   df['response'] = df['response'].apply(lambda x: [item for item in x if item not in stop])
1个回答

10
在列表推导式中(最后一行),你正在检查每个单词是否为停用词,如果该单词不在停用词中,则返回它。但是你正在向其传递一个字符串。你需要拆分字符串以使LC工作。
df = pd.DataFrame({'response':['This is one type of response!', 'Though i like this one more', 'and yet what is that?']})

df['response'] = df.response.str.replace("[^\w\s]", "").str.lower()

df['response'] = df['response'].apply(lambda x: [item for item in x.split() if item not in stop])


0    [one, type, response]
1      [though, like, one]
2                    [yet]
如果您想将响应作为字符串返回,请将最后一行更改为:

df['response'] = df['response'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))

0    one type response
1      though like one
2                  yet

1
谢谢,这个完美地解决了!抱歉问了这么个愚蠢的问题,但是 .split() 怎么知道要在空格处分割而不需要明确定义呢? - kikee1222
1
split 的默认分隔符是空格。如果您的字符串由其他分隔符分隔,则需要指定该分隔符,但这在句子中很少发生 :) - Vaishali
2
非常感谢你的所有帮助!:) :) :) - kikee1222

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接