Python - NLTK分离标点符号

3

我对Python还不太熟悉,我想使用NLTK来去除文件中的停用词。 代码可以运行,但是它会将标点符号分开,如果我的文本是一条包含提及(@user)的推文,我会得到"@ user"。 稍后我需要做一个单词频率统计,我需要提及和标签正常工作。 我的代码:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
    stop_word = set(stopwords.words("portuguese"))
    word_tokens = word_tokenize(linha)
    filtered_sentence = [w for w in word_tokens if not w in stop_word]
    filtered_sentence = []
    for w in word_tokens:
       if w not in stop_word:
           filtered_sentence.append(w)
    fp = codecs.open("stopwords.txt", "a", "utf-8")
    for words in (filtered_sentence):
        fp.write(words + " ")
    fp.write("\n")
    linha= arquivo.readline()

编辑 不确定这是否是最好的方法,但我是这样修复它的:

for words in (filtered_sentence):
        fp.write(words)
        if words not in string.punctuation:
            fp.write(" ")
    fp.write("\n")
1个回答

3

与其使用 word_tokenize,你可以使用nltk提供的Twitter-aware tokenizer

from nltk.tokenize import TweetTokenizer

...
tknzr = TweetTokenizer()
...
word_tokens = tknzr.tokenize(linha)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接