如何使用nltk或spacy从括号解析字符串中获取解析NLP树对象?

3

我有一个句子:“你可以说他们经常淋浴,这增加了他们的兴奋和生活情趣。”,但我无法获得类似以下示例的NLP解析树:

(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))

我想复制这个问题的解决方案https://dev59.com/Mprga4cB1Zd3GeqPnHmV#39320379,但是我有一个字符串句子而不是NLP树。
顺便说一下,我正在使用Python 3。

请参见 https://stackoverflow.com/questions/45520228/parse-nltk-chunk-string-to-form-tree - Amarpreet Singh
2个回答

4
使用Tree.fromstring()方法:
>>> from nltk import Tree
>>> parse = Tree.fromstring('(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))')

>>> parse
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['You'])]), Tree('VP', [Tree('MD', ['could']), Tree('VP', [Tree('VB', ['say']), Tree('SBAR', [Tree('IN', ['that']), Tree('S', [Tree('NP', [Tree('PRP', ['they'])]), Tree('ADVP', [Tree('RB', ['regularly'])]), Tree('VP', [Tree('VB', ['catch']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('NN', ['shower'])]), Tree(',', [',']), Tree('SBAR', [Tree('WHNP', [Tree('WDT', ['which'])]), Tree('S', [Tree('VP', [Tree('VBZ', ['adds']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('PRP$', ['their']), Tree('NN', ['exhilaration'])]), Tree('CC', ['and']), Tree('NP', [Tree('FW', ['joie']), Tree('FW', ['de']), Tree('FW', ['vivre'])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> parse.pretty_print()
                                                       ROOT                                                             
                                                        |                                                                
                                                        S                                                               
  ______________________________________________________|_____________________________________________________________   
 |         VP                                                                                                         | 
 |     ____|___                                                                                                       |  
 |    |        VP                                                                                                     | 
 |    |     ___|____                                                                                                  |  
 |    |    |       SBAR                                                                                               | 
 |    |    |    ____|_______                                                                                          |  
 |    |    |   |            S                                                                                         | 
 |    |    |   |     _______|____________                                                                             |  
 |    |    |   |    |       |            VP                                                                           | 
 |    |    |   |    |       |        ____|______________                                                              |  
 |    |    |   |    |       |       |                   NP                                                            | 
 |    |    |   |    |       |       |         __________|__________                                                   |  
 |    |    |   |    |       |       |        |          |         SBAR                                                | 
 |    |    |   |    |       |       |        |          |      ____|____                                              |  
 |    |    |   |    |       |       |        |          |     |         S                                             | 
 |    |    |   |    |       |       |        |          |     |         |                                             |  
 |    |    |   |    |       |       |        |          |     |         VP                                            | 
 |    |    |   |    |       |       |        |          |     |     ____|____                                         |  
 |    |    |   |    |       |       |        |          |     |    |         PP                                       | 
 |    |    |   |    |       |       |        |          |     |    |     ____|_____________________                   |  
 |    |    |   |    |       |       |        |          |     |    |    |                          NP                 | 
 |    |    |   |    |       |       |        |          |     |    |    |          ________________|________          |  
 NP   |    |   |    NP     ADVP     |        NP         |    WHNP  |    |         NP               |        NP        | 
 |    |    |   |    |       |       |     ___|____      |     |    |    |     ____|_______         |    ____|____     |  
PRP   MD   VB  IN  PRP      RB      VB   DT       NN    ,    WDT  VBZ   TO  PRP$          NN       CC  FW   FW   FW   . 
 |    |    |   |    |       |       |    |        |     |     |    |    |    |            |        |   |    |    |    |  
You could say that they regularly catch  a      shower  ,   which adds  to their     exhilaration and joie  de vivre  . 

这实际上并没有解决我的问题,我需要知道如何获取您在Tree.fromstring()方法中指定为参数的值。我有很多字符串句子,大约有70k。我无法手动为每个句子指定NLP Tree。 - xzegga
1
我有点困惑 =) 你的意思是想从字符串中获取解析结果吗?还是想将已解析的字符串解析成 nltk.Tree 对象? - alvas

2
我会假设你需要以那种格式获取依赖解析树的充分理由。Spacy 使用卷积神经网络 (CNN) 来生成上下文无关文法 (CFG),效果很好,而且是生产就绪的,速度超快。你可以尝试以下代码来查看其效果(然后阅读前面链接中的文档):
import spacy

nlp = spacy.load('en')

text = 'You could say that they regularly catch a shower , which adds to their exhilaration and joie de vivre.'

for token in nlp(text):
    print(token.dep_, end='\t')
    print(token.idx, end='\t')
    print(token.text, end='\t')
    print(token.tag_, end='\t')
    print(token.head.text, end='\t')
    print(token.head.tag_, end='\t')
    print(token.head.idx, end='\t')
    print(' '.join([w.text for w in token.subtree]), end='\t')
    print(' '.join([w.text for w in token.children]))

现在,您可以编写一个算法来导航此树,并相应地打印出来(很抱歉我找不到一个快速的例子,但您可以看到索引以及如何遍历解析)。另一件事情是以某种方式提取CFG,然后使用NLTK来进行解析并以所需格式显示。这是来自NLTK playbook的内容(修改为与Python 3兼容):
import nltk
from nltk import CFG

grammar = CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  V -> "saw" | "ate"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "dog" | "cat" | "cookie" | "park"
  PP -> P NP
  P -> "in" | "on" | "by" | "with"
  """)

text = 'Mary saw Bob'

sent = text.split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for p in rd_parser.parse(sent):
    print(p)
# (S (NP Mary) (VP (V saw) (NP Bob)))

然而,你可以看到需要定义CFG(所以如果你尝试用原始文本替换示例中的文本,你会发现它无法理解CFG中未定义的标记)。
似乎使用斯坦福的NLP解析器是获得所需格式的最简单方法。从这个stackoverflow问题中获取(抱歉,我没有测试过)。
parser = StanfordParser(model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz')
parsed = parser.raw_parse('Jack payed up to 5% more for each unit')
for line in parsed:
    print(line, end=' ') # This will print all in one line, as desired

我没有测试这个,因为我没有时间安装斯坦福解析器,这可能相对于安装Python模块而言有点繁琐,也就是说,如果你正在寻找一个Python解决方案的话。希望这可以帮到你,很抱歉这不是一个直接的答案。

“StanfordParser”代码无法与最新版本的NLTK一起使用,因为它已被弃用。我建议使用“nltk.parse.corenlp.CoreNLPParser”。 - alvas
请参见 https://dev59.com/S2Yr5IYBdhLWcg3wLnZA。 - alvas
好的,说得对。无论你如何启动StanfordParser,它似乎都会输出所需的解析树格式。 - Eugene

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接