如何从字符串列表中删除单词列表

Question

如何从字符串列表中删除单词列表

pythonregexlist-comprehensionstop-words

11

如果问题有点令人困惑，我很抱歉。这类似于此问题

我认为上面的问题接近我所想要的，但是用的是Clojure语言。

还有另一个问题

我需要类似那个问题中 '[br]' 的功能，不过我需要搜索并删除一个字符串列表中的所有字符串。

希望我表达清楚了。

我认为这是因为Python中的字符串不可变性导致的。

我有一个需要从字符串列表中删除的噪声词列表。

如果我使用列表推导式，我最终会反复搜索相同的字符串。因此只有 "of" 被删除了，而 "the" 没有被删除。因此我的修改后的列表看起来像这样:

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

我想知道我在做什么错误。

- prabhu

你没有表达清楚，把你的问题在这里陈述清楚，如果你认为有必要，可以在下面放置类似问题和答案的链接。 - Humphrey Bogart

4个回答

11

这是我的尝试。这里使用了正则表达式。

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

没有 lambda:

[pattern.sub("", phrase) for phrase in phrases]

更新

修复了由gnibbler 指出的错误 (感谢!)：

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu：以上更改避免了从“Spain”中剪切结尾的"in"。要验证，请针对短语“Spain has rain”运行正则表达式的两个版本。

- Manoj Govindan

谢谢。它以这种方式工作。我现在能够更清楚地理解lambda的概念，因为我有机会实现它。 - prabhu

1

这对于短语“西班牙有雨”无法正常工作。不过很容易修复。 - John La Rooy

@Gnibbler：感谢您指出。我正在相应地更改我的答案。 - Manoj Govindan

我在模式中添加了单词"max"，在某些情况下它会删除这个单词，而在其他情况下则不会。这很奇怪，有人应该测试一下看他们是否得到相同的结果。 - almost a beginner

4

>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']

- John La Rooy

哇！这是一种真正酷炫的做法，尽管我费了好大劲才想出来。:-) - prabhu

这似乎不能获取每个单词实例。例如，“of New York of”变成了“New York of”。 - Namey

1

@Namey，你可以使用类似'\\W?\\b(%s)\\W?'的东西。如果没有提供全面的测试用例，那就有点像打地鼠游戏了。 - John La Rooy

1

由于您想知道自己做错了什么，这行代码：

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

首先进行循环，然后开始遍历单词。首先检查是否为“of”。检查您的位置（例如“纽约的”）是否以“of”开头。它被转换（调用replace和strip函数），并添加到结果列表中。这里至关重要的是，结果永远不会再次被检查。对于您在推导式中遍历的每个单词，都会向结果列表中添加一个新结果。因此，下一个单词是“the”，而您的位置（“纽约的”）不以“the”开头，因此不会添加新结果。

我假设您最终得到的结果是您的位置变量的串联。更简单易读且易于理解的过程化版本如下（未经测试）：

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

请注意，replace()会从字符串中的任何位置删除单词，即使它作为简单子字符串出现。您可以通过使用类似于^the\b的模式的正则表达式来避免这种情况。

- wds

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tony Veijalainen · Accepted Answer

如果不使用正则表达式，你可以这样做：

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff