从文件中删除单词

Question

从文件中删除单词

3

我正在尝试处理一个普通的文本文件，并从另一个包含要去除词汇的文件（停用词文件）中删除确定的单词，这些单词由回车符（"\n"）分隔。

目前我将两个文件都转换为列表，以便可以比较每个列表的元素。我有一个函数可以工作，但它不能删除我在停用词文件中指定的所有单词。非常感谢您的任何帮助。

def elimstops(file_str): #takes as input a string for the stopwords file location
  stop_f = open(file_str, 'r')
  stopw = stop_f.read()
  stopw = stopw.split('\n')
  text_file = open('sample.txt') #Opens the file whose stop words will be eliminated
  prime = text_file.read()
  prime = prime.split(' ') #Splits the string into a list separated by a space
  tot_str = "" #total string
  i = 0
  while i < (len(stopw)):
    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text
    else:
      pass
    i += 1
  # Creates a new string from the compilation of list elements 
  # with the stop words removed
  for v in prime:
    tot_str = tot_str + str(v) + " " 
  return tot_str

- user1765792

3个回答

0

我认为你的问题在于这一行：

    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

只会从prime中删除stopw[i]的第一个出现。要解决此问题，您应该这样做：

    while stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

然而，这将运行非常缓慢，因为in prime和prime.remove两个部分都需要迭代prime。这意味着您最终会在字符串长度上获得二次运行时间。如果您像F.J. 建议的那样使用生成器，则运行时间将是线性的，这要好得多。

- Sam Mussmann

0

我不懂python，但这里有一个常规的方法可以完成，时间复杂度为O(n)+O(m)——线性。

1：将停用词文件中的所有单词添加到一个映射表中。
2：读取您的普通文本文件并尝试将单词添加到列表中。在执行#2时，检查当前读取的单词是否在映射表中，如果是，则跳过它，否则将其添加到列表中。

最后，该列表应该包含您需要的所有单词——那些您想要删除的单词。

- Adrian

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrew Clark · Accepted Answer

这里有一个使用生成器表达式的替代解决方案，如下所示：

使用生成器表达式的另一种解决方案。

tot_str = ' '.join(word for word in prime if word not in stopw)

为了更高效，可以使用 stopw = set(stopw) 将 stopw 转换为一个 set。

如果你的 sample.txt 不是一个只包含空格分隔的文件，而是包含标点符号的普通句子，那么按空格分割会将标点符号与单词一起分割。为解决这个问题，你可以使用 re 模块来在空格或标点符号处分割字符串：

import re
prime = re.split(r'\W+', text_file.read())