将字符串转换为单词列表？

Question

将字符串转换为单词列表？

93

我想用Python将一个字符串转换为单词列表。我希望能够处理以下内容：

string = 'This is a string, with words!'

然后将其转换为类似于这样的形式：

list = ['This', 'is', 'a', 'string', 'with', 'words']

注意省略了标点符号和空格。如何以最快的方式处理？

- rectangletangle

15个回答

106

试试这个：

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

如何工作：

从文档中得知：

re.sub(pattern, repl, string, count=0, flags=0)

返回通过将字符串中最左边的非重叠模式替换为替换项 repl 而获得的字符串。如果未找到该模式，则返回原始字符串。repl 可以是字符串或函数。

在我们的情况下:

模式是任何非字母数字字符。

[\w] 表示任何字母数字字符，等同于字符集 [a-zA-Z0-9_]

a 到 z，A 到 Z，0 到 9 和下划线。

所以我们匹配任何非字母数字字符，并用空格替换它。

然后我们用 split() 将其拆分为由空格分隔并转换为列表的字符串。

因此，'hello-world'

变成了 'hello world'

使用 re.sub

然后 ['hello'，'world']

在 split() 后

如果有疑问，请告诉我。

- Bryan

记得处理撇号和连字符，因为它们不包含在\w中。 - Brōtsyorfuzthrāx

2

您可能还想处理格式化的撇号和不间断连字符。 - Brōtsyorfuzthrāx

string.split() 更容易。 - Ege

38

要做到这一点是相当复杂的。对于你的研究，它被称为单词分词。如果你想看看其他人都做了什么，而不是从头开始，你应该看看 NLTK:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

- Tim McNamara

21

最简单的方法：

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

- JBernardo

15

使用string.punctuation可以保证完整性：

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

这也处理换行符。

- mtrw

9

好的，你可以使用

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

请注意，string和list都是内置类型的名称，因此您可能不想将它们用作变量名。

- Cameron

6

受@mtrw答案的启发，但改进了仅在单词边界处剥离标点符号：

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

- Paulo Freitas

4

个人认为，这比提供的答案稍微更加简洁。

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

- Akhil Cherian Verghese

3

对于单词的正则表达式可以给您最大的控制。您需要仔细考虑如何处理带有破折号或撇号的单词，例如"I'm"。

- tofutim

1

通过这种方式，您可以消除字母表以外的所有特殊字符:

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

我不确定这是否是编程的快速、最优或正确方式。

- BenyaR

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gilgamar · Accepted Answer

我认为以下是对于看到这篇文章并且收到迟回复的人最简单的方法：

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']