更高效的Textacy / spacy 'subject_verb_object_triples'实现

3

我正在尝试在我的数据集上实现textacy中的“extract.subject_verb_object_triples”函数。然而,我编写的代码非常缓慢且占用内存大。是否有更有效的实现方式?

import spacy
import textacy

def extract_SVO(text):

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    tuples = textacy.extract.subject_verb_object_triples(doc)
    tuples_to_list = list(tuples)
    if tuples_to_list != []:
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'].apply(extract_SVO)
print(tuples_list)

样本数据(sp500news)

    date_publish  \
0       2013-05-14 17:17:05   
1       2014-05-09 20:15:57   
4       2018-07-19 10:29:54   
6       2012-04-17 21:02:54   
8       2012-12-12 20:17:56   
9       2018-11-08 10:51:49   
11      2013-08-25 07:13:31   
12      2015-01-09 00:54:17   

 title  
0       Italy will not dismantle Montis labour reform  minister                            
1       Exclusive US agency FinCEN rejected veterans in bid to hire lawyers                
4       Xis campaign to draw people back to graying rural China faces uphill battle        
6       Romney begins to win over conservatives                                            
8       Oregon mall shooting survivor in serious condition                                 
9       Polands PGNiG to sign another deal for LNG supplies from US CEO                    
11      Australias opposition leader pledges stronger economy if elected PM                
12      New York shifts into Code Blue to get homeless off frigid streets                  

请问您能提供一些样本数据吗? - Vivek Kalyanarangan
嗨@VivekKalyanarangan,我已添加示例数据。 - W.R
你能复制粘贴并格式化为代码吗?这比从图片中查看和打字更容易。 - Vivek Kalyanarangan
@VivekKalyanarangan -- 完成 - W.R
1个回答

5
这应该会稍微加快速度 -
import spacy
import textacy
nlp = spacy.load('en_core_web_sm')
def extract_SVO(text):
    tuples = textacy.extract.subject_verb_object_triples(text)
    if tuples:
        tuples_to_list = list(tuples)
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'] = sp500news['title'].apply(nlp)
_ = sp500news['title'].apply(extract_SVO)
print(tuples_list)

解释

在OP实现中,从函数内部调用nlp = spacy.load('en_core_web_sm')会导致每次都重新加载。我感觉这是最大的瓶颈。可以将其移除以加快速度。

此外,仅当元组不为空时才能将元组转换为列表。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接