如何将字符串分割成单词列表？

Question

如何将字符串分割成单词列表？

638

如何将一个句子分割并将每个单词存储在列表中？例如：

"these are words"   ⟶   ["these", "are", "words"]

_{如果想要根据其他分隔符进行分割，请参见使用Python按分隔符拆分字符串。}

_{如果想要将字符串拆分为单个字符，请参见如何将字符串拆分为字符列表？。}

- Thanx

5

现有的代码将会对列表中的每个单词打印出完整的单词列表。我认为你想在最后一行使用print(word)。 - tgray

10个回答

479

将字符串 text 按任何连续的空格分割：

words = text.split()

要按自定义分隔符，如","来分割字符串text:

words = text.split(",")

words 变量将是一个 list，其中包含从 text 按分隔符分割得到的单词。

- zalew

93

使用 str.split()：

返回字符串中的单词列表，使用sep作为分隔符... 如果未指定sep或为None，则将应用不同的拆分算法：连续的空格被视为单个分隔符，并且如果字符串具有前导或尾随空格，则结果将不包含开头或结尾的空字符串。

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']

- gimel

@warvariuc - 应该链接到 https://docs.python.org/2/library/stdtypes.html#str.split - gimel

2

把单词“sentence”分成“s”、“e”、“n”、“t”……怎么样？ - curiouscheese

1

@xkderhaka 请查看 https://dev59.com/6W445IYBdhLWcg3wOnnW。但请记住，Stack Overflow 不是一个讨论论坛。 - Karl Knechtel

61

根据你计划对列表式语句做什么，你可能想要查看自然语言工具包。它专注于文本处理和评估，并可以用来解决你的问题。

import nltk
words = nltk.word_tokenize(raw_sentence)

这样做的另一个好处是可以将标点符号拆开。

示例：

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

这允许您过滤掉不想要的任何标点符号，并仅使用单词。

请注意，如果您不打算对句子进行复杂的操作，则使用string.split()的其他解决方案更好。

[已编辑]

- tgray

6

split() 函数依赖于空格作为分隔符，所以它不能正确地分离连字符连接的单词，长破折号分隔的短语也无法分割。如果句子中包含任何没有空格的标点符号，那么这些标点符号也不能被正确处理。对于任何实际的文本解析（比如这个评论），你提出的 nltk 建议要比 split() 更好。 - hobs

4

潜在有用，虽然我不会把这个过程称为“分离成‘单词’”。根据任何简明的英语定义，逗号“,”和所有格符号“'s”都不是单词。通常来说，如果你想按照标点符号的方式将上述句子分成“单词”，你需要去掉逗号并将“fox's”作为一个单词。 - Mark Amery

1

Python 2.7+ 截至2016年4月。 - AnneTheAgile

38

这个算法怎么样？按空格分割文本，然后修剪标点符号。这样可以仔细地从单词边缘去除标点符号，而不会损害单词内的撇号（如we're）。

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

- Colonel Panic

4

有些英文单词确实包含末尾标点符号，例如“e.g.”和“Mrs.”中的句点以及所有格“frogs'”（如“frogs' legs”中的所有格撇号），它们是单词的一部分，但会被这个算法去除。通过检测点分缩略词并使用特殊情况字典（如“Mr.”、“Mrs.”）可以大致正确处理缩写。区分所有格撇号与单引号要困难得多，因为需要解析所在句子的语法。 - Mark Amery

2

@MarkAmery 你说得对。我也意识到一些标点符号，比如破折号，可以在没有空格的情况下分隔单词。 - Colonel Panic

17

我希望我的Python函数可以分割一个句子（输入），并将每个单词存储在列表中。 str().split()方法可以实现此功能，它接收一个字符串并将其拆分成列表。

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

- dbr

16

如果你想把一个单词/句子的所有字符放到一个列表中，可以这样做：

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

- BlackBeard

这个答案应该放在 https://dev59.com/6W445IYBdhLWcg3wOnnW 上，虽然它可能是那里已有答案的重复。 - Karl Knechtel

15

shlex有一个.split()函数。它与str.split()的区别在于不保留引号，并将带引号的短语视为单个词：

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

注意：它适用于类Unix命令行字符串。它不适用于自然语言处理。

- Tarwin

1

请谨慎使用，特别是在自然语言处理方面。它会在单引号字符串（例如 "It's good."）上崩溃，并显示 ValueError: No closing quotation 错误。 - Igor

2

如果您想将一个字符串分割成单词列表，并且该字符串包含标点符号，那么最好先将它们删除。例如，使用str.split()将以下字符串分割为：

s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']

在编程中，Hi、words、also等单词后面都有标点符号。Python内置了一个string模块，其中包含一个标点符号的字符串属性（string.punctuation）。消除标点符号的一种方法是从每个单词中简单地将其删除：

import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']

另一个方法是制作一个完整的字符串字典以便移除。

table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split() 
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']

它无法处理像these're这样的单词，因此可以使用nltk.word_tokenize来处理tgray建议。只需过滤掉完全由标点符号组成的单词即可。

import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']

- cottontail

1

将单词拆分，不损害单词内的撇号。请找出输入_1和输入_2中的Moore's law。

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']

- thrinadhn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- nstehr · Accepted Answer

给定一个字符串 sentence，将其中每个单词存入名为 words 的列表中。

words = sentence.split()