以下提供的解决方案在运行时复杂度方面大约为O(n),其中n是每个句子中标记的数量。
对于500万个句子和您的concepts.txt
,它可以在约30秒内执行所需的操作,请参见第三部分的基本测试。
至于空间复杂度,您将需要保留一个嵌套的字典结构(现在让我们简化为这个),假设它是O(c*u),其中是某个概念(基于标记)的唯一标记,而c是概念的长度。
很难精确确定复杂度,但它与此类似(对于您提供的示例数据以及您提供的[concepts.txt
],这些复杂度是相当准确的,但我们将通过实现详细了解其内部细节)。
我假设您可以按空格拆分您的概念和句子。如果不是这种情况,我建议您查看spaCy,它提供了更智能的方法来对您的数据进行标记化。
1. 介绍
让我们以您的示例为例:
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
正如您所说,每个概念元素都必须映射到第一个元素,因此,在Pythonish中大致按以下方式进行:
for concept in concepts:
concept[1:] = concept[0]
如果所有概念的标记长度都等于1(这在这里并非如此),那么任务将变得容易,并且将是唯一的。让我们专注于第二种情况和一个特定的(稍加修改的)concept
示例来阐述我的观点:
["advanced data mining", "data something", "data"]
在这里,data
将被映射到高级数据挖掘
,但是data something
,其中包含data
,应该在其之前被映射。如果我理解您的意思正确,您想要这个句子:
"Here is data something and another data"
被映射到的内容:
"Here is advanced data mapping and another advanced data mining"
不要采用天真的方法:
"Here is advanced data mapping something and another advanced data mining"
注意,对于第二个示例,我们仅映射了 data
而不是 data something
.
为了优先处理 data something
(以及符合此模式的其他内容),我使用了一个填充有字典的数组结构,其中数组中较早的概念在标记方面更长。
继续我们的示例,这样的数组如下所示:
structure = [
{"data": {"something": "advanced data mining"}},
{"data": "advanced data mining"},
]
请注意,如果我们按照这个顺序(例如首先按照连续的标记遍历第一个字典,如果没有匹配,则转到第二个字典等等)进行标记化处理,我们将首先获得最长的概念。
2. 代码
好了,我希望你已经基本了解了这个想法(如果不是,请在下面发表评论,我会尝试更详细地解释不清楚的部分)。
声明:就代码而言,我并不特别自豪,但它能够完成工作,并且可能还有更糟糕的情况。
2.1 分层字典
首先,让我们逐标记获取最长的概念(除了第一个元素,因为它是我们的目标,我们永远不需要更改它):
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
使用这些信息,我们可以通过创建与不同概念长度相同的字典(在上面的示例中是2)来初始化我们的结构。对于您的所有数据,都应该如此。但是任何长度的概念都可以:
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
注意,我正在将每个概念的长度添加到数组中。在我的看法中,这样更容易遍历,尽管在实现上进行一些更改后,你也可以不这样做。
现在,当我们拥有了这些辅助函数之后,我们可以从概念列表创建结构:
def create_hierarchical_dictionaries(concepts: List[List[str]]):
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
这个函数将创建我们的分层字典, 有关说明请参阅源代码中的注释。您可能希望创建一个自定义类来保留此内容,以便更容易使用。
这个对象与1.介绍中描述的完全相同。
2.2 遍历字典
这部分比较困难,但这次我们采用自上而下的方法。我们会从简单的开始:
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
提供分层字典,它创建一个生成器,根据概念映射转换每个句子。
现在是traverse
函数:
def traverse(sentence: str, hierarchical_dictionaries):
tokens = sentence.split()
output_sentence = []
index = 0
while index < len(tokens):
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
if concept is not None:
output_sentence.extend(concept)
break
else:
output_sentence.append(tokens[index])
index += 1
return " ".join(output_sentence)
如果您不确定发生了什么,请再次发表评论。
使用这种方法,悲观地说,我们将执行O(n*c!)次检查,其中n是句子中的标记数,c是最长概念的标记长度及其阶乘。在实践中,这种情况非常罕见,每个句子中的标记都几乎完全适合最长的概念,并且所有较短的概念都必须是最短一个的前缀(例如超级数据挖掘
、超级数据
和数据
)。
对于任何实际问题来说,它很可能更接近O(n),如我先前所说,使用您在.txt文件中提供的数据,最坏情况为O(3*n),通常为O(2*n)。
遍历每个字典:
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
length, current_dictionary = hierarchical_dictionary_tuple
inner_index = index
for _ in range(length):
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
if current_dictionary is None or inner_index >= len(tokens):
return index, None
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
return inner_index, concept
这是我的解决方案的核心部分。
3. 结果
现在,为了简洁起见,整个源代码如下所示(concepts.txt
是您提供的内容):
import ast
import time
from typing import List
def get_longest(concepts: List[List[str]]):
return max(len(text.split()) for concept in concepts for text in concept[1:])
def init_hierarchical_dictionaries(longest: int):
return [(length, {}) for length in reversed(range(longest))]
def create_hierarchical_dictionaries(concepts: List[List[str]]):
longest = get_longest(concepts)
hierarchical_dictionaries = init_hierarchical_dictionaries(longest)
for concept in concepts:
for text in concept[1:]:
tokens = text.split()
current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
for token in tokens[:-1]:
current_dictionary[token] = {}
current_dictionary = current_dictionary[token]
current_dictionary[tokens[-1]] = concept[0].split()
return hierarchical_dictionaries
def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
length, current_dictionary = hierarchical_dictionary_tuple
inner_index = index
for _ in range(length):
current_dictionary = current_dictionary.get(tokens[inner_index])
inner_index += 1
if current_dictionary is None or inner_index >= len(tokens):
return index, None
concept = current_dictionary.get(tokens[inner_index])
if concept is None:
return index, None
return inner_index, concept
def traverse(sentence: str, hierarchical_dictionaries):
tokens = sentence.split()
output_sentence = []
index = 0
while index < len(tokens):
for hierarchical_dictionary_tuple in hierarchical_dictionaries:
index, concept = traverse_through_dictionary(
index, tokens, hierarchical_dictionary_tuple
)
if concept is not None:
output_sentence.extend(concept)
break
else:
output_sentence.append(tokens[index])
index += 1
return " ".join(output_sentence)
def embed_sentences(sentences: List[str], hierarchical_dictionaries):
return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)
def sanity_check():
concepts = [
["natural language processing", "text mining", "texts", "nlp"],
["advanced data mining", "data mining", "data"],
["discourse analysis", "learning analytics", "mooc"],
]
sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
targets = [
"advanced data mining and natural language processing",
"natural language processing is mainly used by discourse analysis community",
"advanced data mining in python is fun",
"discourse analysis advanced data mining analysis involves natural language processing",
"advanced data mining and advanced data mining are both very interesting",
]
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
results = list(embed_sentences(sentences, hierarchical_dictionaries))
if results == targets:
print("Correct results")
else:
print("Incorrect results")
def speed_check():
with open("./concepts.txt") as f:
concepts = ast.literal_eval(f.read())
initial_sentences = [
"data mining and text mining",
"nlp is mainly used by discourse analysis community",
"data mining in python is fun",
"mooc data analysis involves texts",
"data and data mining are both very interesting",
]
sentences = initial_sentences.copy()
for i in range(1_000_000):
sentences += initial_sentences
start = time.time()
hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
middle = time.time()
letters = []
for result in embed_sentences(sentences, hierarchical_dictionaries):
letters.append(result[0].capitalize())
end = time.time()
print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
print(f"Time for embedding {(end-middle) * 1000.0} ms")
print(f"Overall time elapsed {(end-start) * 1000.0} ms")
def main():
sanity_check()
speed_check()
if __name__ == "__main__":
main()
下面提供了速度检测的结果:
Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms
因此,对于500万个句子(将您提供的5个句子连接起来重复1百万次),以及您提供的概念文件(1.1 mb),执行概念映射大约需要30秒钟,我想这还不错。
字典应该最多占用与输入文件相同的内存(在本例中是concepts.txt
),但通常会更低/远低,因为它取决于概念长度和那些单词的唯一性。