从标注文本中提取多词表达。

Question

从标注文本中提取多词表达。

3

我有一个数据框中包含两列，一列是文本，另一列是每个文本中的的标注，指示的类型和包含该词的字符范围。例如，文本列:

Barack Obama was president of the United States in 2008.

注释列：

MWE_type 0 12

这表示从第0个字符到第12个字符，所以这个单词是巴拉克·奥巴马。而且，

MWE_type 34 47

因此，它是美国。

如何使用注释从文本中提取单词并将其保存在新列中（对于示例文本，将类似于[巴拉克·奥巴马、美国]）？

感谢您的时间！如果您需要更具体的信息，我很乐意添加一些信息！

- Radix

你尝试了多远？ - devReddit

是的，只是简单的字符串。我从多个txt文件创建列。 - Radix

所以，你的数据框看起来像这样：

{'text' : ['Barack Obama was president of the United States in 2008.'], 'annotation' : [['MWE_type 0 12','MWE_type 34 47']]}

而且annotation列的每一行都可以是一个字符串列表，对吧？ - devReddit

哦，我明白了，所以你会在单个行中获取MWE_type 0 12或MWE_type 34 47，而不是两者都获取，对吗？ - devReddit

你能否发布一个小样本数据框(df)？这将有助于澄清数据结构。 - fsimonjetz

显示剩余4条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- devReddit · Accepted Answer

如果我理解正确，根据您在主贴评论中的定义，我已经找到了一种可以完成任务的方法。

首先，根据您的说法，数据将如下所示：

data = {'text' : ['Barack Obama was president of the United States in 2008.'],
    'annotation' : ['MWE_type 0 12 MWE_type 34 47']}

我们将维护一个final_list，它基本上是一个列表的列表，其中每个内部列表将是每行的输出。

我们可以通过df.iterrows()迭代每一行，并从row['text']和使用row['annotation']来提取每一行的结果。

for index, row in df.iterrows():

我们可以通过正则表达式来提取索引对：

re.findall(r'\d+ \d+', row['annotation'])

我们可以遍历这个索引对列表，并将相应的子字符串附加到基于行的结果列表中。

for indexes in index_list:
        start, end = map(int, indexes.split())
        result.append(row['text'][start:end])

在迭代行结束时，我们可以将基于该行的结果列表附加到final_list中：

final_list.append(result)

最后，将final_list分配给df['result']:

df['result'] = final_list

整个程序如下所示：

import pandas as pd
import re

data = {'text' : ['Barack Obama was president of the United States in 2008.'],
    'annotation' : ['MWE_type 0 12 MWE_type 34 47']}

df = pd.DataFrame(data)

final_list = []

for index, row in df.iterrows():
    result = []
    index_list = re.findall(r'\d+ \d+', row['annotation'])
    for indexes in index_list:
        start, end = map(int, indexes.split())
        result.append(row['text'][start:end])
    final_list.append(result)

df['result'] = final_list

print(df)

你将得到：

                                                text  ...                         result
0  Barack Obama was president of the United State...  ...  [Barack Obama, United States]