在Python中对句子列表进行单词分词

16
我目前有一个文件,其中包含一个看起来像这样的列表。
example = ['Mary had a little lamb' , 
           'Jack went up the hill' , 
           'Jill followed suit' ,    
           'i woke up suddenly' ,
           'it was a really bad dream...']

"example"是这样的句子列表,我希望输出看起来像这样: mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill' ....] 等等。 我需要将句子分开,每个单词进行标记化,这样我就可以使用for循环逐个比较mod_example中的句子与参考句子的每个单词。
我尝试了这个方法:
for sentence in example:
    text3 = sentence.split()
    print text3 

并且得到以下输出:
['it', 'was', 'a', 'really', 'bad', 'dream...']

怎么才能对所有句子进行这样的操作? 它一直在覆盖之前的内容。是的,还请提及我的方法是否正确? 这个列表应该保持句子的形式,单词已经被分词。谢谢。

你能更详细地解释一下你的意思吗?我想逐个单词地比较mod_example句子中的每个单词(使用for循环),并将其与参考句子进行比较。 - embert
“"表示每个句子仍然是一个单独的实体。因此,我希望将单词标记化,而不是整个文本。例如:我不想要['mary''had''a''little''lamb'jack''went''up''the''hill']等等。它仍应该是一个列表,每个句子都有标记化的单词。” - Hypothetical Ninja
问题的提问与输入的分词毫无关系;它纯粹是关于如何将现有的代码应用于输入列表,并对每个元素重复执行的问题。 - undefined
8个回答

31
你可以在NLTK中使用单词分词器(http://nltk.org/api/nltk.tokenize.html),并配合列表推导式进行操作,详见http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
>>> from nltk.tokenize import word_tokenize
>>> example = ['Mary had a little lamb' , 
...            'Jack went up the hill' , 
...            'Jill followed suit' ,    
...            'i woke up suddenly' ,
...            'it was a really bad dream...']
>>> tokenized_sents = [word_tokenize(i) for i in example]
>>> for i in tokenized_sents:
...     print i
... 
['Mary', 'had', 'a', 'little', 'lamb']
['Jack', 'went', 'up', 'the', 'hill']
['Jill', 'followed', 'suit']
['i', 'woke', 'up', 'suddenly']
['it', 'was', 'a', 'really', 'bad', 'dream', '...']

3
我强烈不建议使用NLTK。虽然它因为是第一个有良好文档记录的Python自然语言处理包而受欢迎,但它已经过时了。此外,word_tokenize有一种将输入转换的习惯。 - Eli Korvigo
同意对输入进行转换,但在我看来,分词不应被视为一种转换,而是一种注释。注释是在数据之上添加信息层,而不是替换数据 =)(免责声明:我确实为NLTK做出了贡献) - alvas
1
此外,NLTK 中不止一种 tokenizer。虽然 NLP 社区广泛使用的原始 treebank tokenizer 已经过时,但并非适用于所有情况。自从 https://github.com/nltk/nltk/issues/1214 之后,NLTK 中已经收录/移植/封装了更多的 tokenizer,包括 Moses(来自机器翻译)、Toktok(来自语言模型)、REPP(来自语法工程)以及适用于多种语言的 Stanford CoreNLP tokenizer(https://github.com/nltk/nltk/pull/1735#issuecomment-306091826)。 - alvas
如果有人正在寻找速度快且可定制的分词器,请务必查看SpaCy分词器https://spacy.io/docs/usage/customizing-tokenizer。如果在NLTK中也有类似的贡献,那就太好了=) - alvas

8
我制作了这个脚本,让所有人都能理解如何进行分词,这样他们就可以自己构建自然语言处理引擎。
import re
from contextlib import redirect_stdout
from io import StringIO

example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'

def token_to_sentence(str):
    f = StringIO()
    with redirect_stdout(f):
        regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
        regex_of_sentence = [x for x in regex_of_sentence if x is not '']
        for i in regex_of_sentence:
            print(i)
        first_step_to_sentence = (f.getvalue()).split('\n')
    g = StringIO()
    with redirect_stdout(g):
        for i in first_step_to_sentence:
            try:
                regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
                print(regex_to_clear_sentence.group(1))
            except:
                print(i)
        sentence = (g.getvalue()).split('\n')
    return sentence

def token_to_words(str):
    f = StringIO()
    with redirect_stdout(f):
        for i in str:
            regex_of_word = re.findall('([\w]{0,})', i)
            regex_of_word = [x for x in regex_of_word if x is not '']
            for word in regex_of_word:
                print(regex_of_word)
        words = (f.getvalue()).split('\n')

我做了一个不同的过程,我重新从段落开始处理,以使每个人更加理解文字处理。 段落到过程如下:
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'

将段落分割为句子。
sentence = token_to_sentence(example)

英译中:

将产生结果:

['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']

将其分词:
words = token_to_words(sentence)

将会导致:
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']

我会解释这是如何工作的。
首先,我使用正则表达式搜索所有单词和空格,这些空格分隔单词并在找到标点符号之前停止。 正则表达式是:
([\w\s]{0,})[^\w\s]{0,}

因此,计算将考虑括号中的单词和空格。
'(Mary had a little lamb),( Jack went up the hill, Jill followed suit),( i woke up suddenly),( it was a really bad dream)...'

结果仍不明确,包含一些“None”字符。因此我使用了这个脚本来删除“None”字符:
[x for x in regex_of_sentence if x is not '']

所以该段落将被分解成句子,但是不是清晰的句子,结果为:
['Mary had a little lamb', ' Jack went up the hill', ' Jill followed suit', ' i woke up suddenly', ' it was a really bad dream']

如您所见,结果显示有些句子以空格开头。因此,为了使段落更清晰而不需要以空格开头,我使用了这个正则表达式:
\s([\w\s]{0,})

它将会生成一个清晰的句子,例如:
['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']

所以,我们必须进行两个步骤才能得到好的结果。
你问题的答案从这里开始...
为了将句子分词,我遍历整段话并使用正则表达式来捕获单词。正则表达式如下:
([\w]{0,})

再次使用以下方式清除空白字符:
[x for x in regex_of_word if x is not '']

因此,结果非常清晰,只有单词列表。
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']

在未来,要制作一个好的自然语言处理系统,你需要拥有自己的短语数据库,并搜索该短语是否出现在句子中,然后列出短语列表,剩下的单词就是普通单词。
通过这种方法,我可以建立我的本地语言(印度尼西亚语)的自然语言处理系统,因为它缺乏很多模块。
修改后:
我没有看到您想要比较单词的问题。所以您还有另一个句子可以比较…… 我会给你额外的奖励,不仅如此,我还会告诉你如何计算它。
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']

在这种情况下,你必须执行以下步骤: 1. 迭代mod_example 2. 将第一句话与mod_example中的单词进行比较。 3. 进行一些计算。
因此,脚本将是:
import re
from contextlib import redirect_stdout
from io import StringIO

example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']

def token_to_sentence(str):
    f = StringIO()
    with redirect_stdout(f):
        regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
        regex_of_sentence = [x for x in regex_of_sentence if x is not '']
        for i in regex_of_sentence:
            print(i)
        first_step_to_sentence = (f.getvalue()).split('\n')
    g = StringIO()
    with redirect_stdout(g):
        for i in first_step_to_sentence:
            try:
                regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
                print(regex_to_clear_sentence.group(1))
            except:
                print(i)
        sentence = (g.getvalue()).split('\n')
    return sentence

def token_to_words(str):
    f = StringIO()
    with redirect_stdout(f):
        for i in str:
            regex_of_word = re.findall('([\w]{0,})', i)
            regex_of_word = [x for x in regex_of_word if x is not '']
            for word in regex_of_word:
                print(regex_of_word)
        words = (f.getvalue()).split('\n')

def convert_to_words(str):
    sentences = token_to_sentence(str)
    for i in sentences:
        word = token_to_words(i)
    return word

def compare_list_of_words__to_another_list_of_words(from_strA, to_strB):
        fromA = list(set(from_strA))
        for word_to_match in fromA:
            totalB = len(to_strB)
            number_of_match = (to_strB).count(word_to_match)
            data = str((((to_strB).count(word_to_match))/totalB)*100)
            print('words: -- ' + word_to_match + ' --' + '\n'
            '       number of match    : ' + number_of_match + ' from ' + str(totalB) + '\n'
            '       percent of match   : ' + data + ' percent')



#prepare already make, now we will use it. The process start with script below:

if __name__ == '__main__':
    #tokenize paragraph in example to sentence:
    getsentences = token_to_sentence(example)

    #tokenize sentence to words (sentences in getsentences)
    getwords = token_to_words(getsentences)

    #compare list of word in (getwords) with list of words in mod_example
    compare_list_of_words__to_another_list_of_words(getwords, mod_example)

1

您可以使用nltk(如@alvas所建议的)和递归函数,该函数接受任何对象并对其中的每个字符串进行分词:

from nltk.tokenize import word_tokenize
def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str):
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj]
    else:
        return obj # Or throw an exception, or parse a dict...

使用方法:

data = [["Lorem ipsum dolor. Sit amet?", "Hello World!", None], ["a"], "Hi!", None, ""]
print(tokenize(data))

输出:

[[['Lorem', 'ipsum', 'dolor', '.', 'Sit', 'amet', '?'], ['Hello', 'World', '!'], None], [['a']], ['Hi', '!'], None, []]

1
使用列表推导式来访问你的句子并对其进行单词分割。
 from nltk import word_tokenize
 sentences = ['Mary had a little lamb' , 
        'Jack went up the hill' , 
        'Jill followed suit' ,    
        'i woke up suddenly' ,
        'it was a really bad dream...']

 sentences = [ word_tokenize ( sent ) for sent in sentences ]

 print(sentences)

输出:

 [['Mary', 'had', 'a', 'little', 'lamb'], ['Jack', 'went', 'up', 'the', 'hill'], ['Jill', 'followed', 'suit'], ['i', 'woke', 'up', 'suddenly'], ['it', 'was', 'a', 'really', 'bad', 'dream', '...']]

1
拆分列表“Example”
first_split = []

for i in example:

    first_split.append(i.split())

分解first_split列表的元素
second_split = []

for j in first_split:

    for k in j:

        second_split.append(k.split())

将second_split列表的元素分解并附加到最终列表中,程序员需要的输出方式。
final_list = []

for m in second_split:

    for n in m:

        if(n not in final_list):

            final_list.append(n)

print(final_list)   

我希望这将是一种最简单的方法。 - user11751084
只要试一次。 - user11751084

1
这也可以通过 pytorch 的 torchtext 完成。
from torchtext.data import get_tokenizer

tokenizer = get_tokenizer('basic_english')
example = ['Mary had a little lamb' , 
            'Jack went up the hill' , 
            'Jill followed suit' ,    
            'i woke up suddenly' ,
            'it was a really bad dream...']
tokens = []
for s in example:
    tokens += tokenizer(s)
# ['mary', 'had', 'a', 'little', 'lamb', 'jack', 'went', 'up', 'the', 'hill', 'jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream', '.', '.', '.']

0

对我来说,很难说出你想要做什么。

这样怎么样?

exclude = set(['Mary', 'Jack', 'Jill', 'i', 'it'])

mod_example = []
for sentence in example:
    words = sentence.split()
    # Optionally sort out some words
    for word in words:
        if word in exclude:
            words.remove(word)
    mod_example.append('\'' + '\' \''.join(words) + '\'')

print mod_example

输出什么?

["'had' 'a' 'little' 'lamb'", "'went' 'up' 'the' 'hill'", "'followed' 'suit'", 
"'woke' 'up' 'suddenly'", "'was' 'a' 'really' 'bad' 'dream...'"]
>>> 

编辑: 根据楼主提供的更多信息,另一个建议如下:

example = ['Area1 Area1 street one, 4454 hikoland' ,
           'Area2 street 2, 52432 hikoland, area2' ,
           'Area3 ave three, 0534 hikoland' ]

mod_example = []
for sentence in example:
    words = sentence.split()
    # Sort out some words
    col1 = words[0]
    col2 = words[1:]
    if col1 in col2:
        col2.remove(col1)
    elif col1.lower() in col2:
        col2.remove(col1.lower())
    mod_example.append(col1 + ': ' + ' '.join(col2))

输出

>>>> print mod_example
['Area1: street one, 4454 hikoland', 'Area2: street 2, 52432 hikoland,', 
'Area3: ave three, 0534 hikoland']
>>> 

这还是一个列表吗?这就是我想要的... 是的,我也想要每个句子的第一个单词... 谢谢... 我会检查一下的。 - Hypothetical Ninja
如果您能告诉我您要解决的基本问题,@Sword,那么这将会更容易。 - embert
@Sword 如果你询问精确操作,没有人能够展示解决潜在问题的替代方法。 - embert
好的,我来详细说明一下。假设我有一个 TSV 文件,其中第一列显示区域名称,第二列显示确切地址(如建筑名称、街道等)。在这种形式下有许多这样的地址:[杰克上山了,吉尔跟随着],逗号表示下一行。第一列使用自动填充,因此区域名称是正确的,但第二列可能会出现错误。区域名称可能会被错误地输入到第二列(确切地址)中。我需要做的是比较第一列和第二列,并在第二列中删除区域名称的重复项,如果它出现在第二列中的话。 - Hypothetical Ninja
如果逗号分隔行,那么列是用什么分隔的,@Sword? - embert
这是一个以制表符分隔的文件,我使用逗号来表示不同的行,以便您理解(实际上在单个行中有许多逗号,并且在这种情况下它们并不起任何作用)。例如: 绿色山区 | 小屋,第四大道路,靠近玉米农场。 常青树林 | 大屋,位于8英里路外,常青树林。正如您所看到的,我需要从第二列中删除重复项,即“常青树林”,因为它已经在第一列中出现了。有什么想法吗? - Hypothetical Ninja

0
在Spacy中,它将会非常简单:
import spacy

example = ['Mary had a little lamb' , 
           'Jack went up the hill' , 
           'Jill followed suit' ,    
           'i woke up suddenly' ,
           'it was a really bad dream...']

nlp = spacy.load("en_core_web_sm")

result = []

for line in example:
    sent = nlp(line)
    token_result = []
    for token in sent:
        token_result.append(token)
    result.append(token_result)

print(result)

输出结果将是:

[[Mary, had, a, little, lamb], [Jack, went, up, the, hill], [Jill, followed, suit], [i, woke, up, suddenly], [it, was, a, really, bad, dream, ...]]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接