我有以下脚本,最后一行我试图从名为'response'的列中删除停用词 (stopwords)。
问题在于,不是将“A bit annoyed”变成“bit annoyed”,实际上字母都被删掉了。因为'a'是一个停用词。
有人能给我建议吗?
问题在于,不是将“A bit annoyed”变成“bit annoyed”,实际上字母都被删掉了。因为'a'是一个停用词。
有人能给我建议吗?
import pandas as pd
from textblob import TextBlob
import numpy as np
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
path = 'Desktop/fanbase2.csv'
df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
#remove punctuation
df['response'] = df.response.str.replace("[^\w\s]", "")
#make it all lower case
df['response'] = df.response.apply(lambda x: x.lower())
#Handle strange character in source
df['response'] = df.response.str.replace("‰Ûª", "''")
df['response'] = df['response'].apply(lambda x: [item for item in x if item not in stop])