能否在NLTK中使用斯坦福分析器?(我不是指斯坦福词性标注。)
能否在NLTK中使用斯坦福分析器?(我不是指斯坦福词性标注。)
请注意,此答案适用于NLTK v3.0,而不适用于更近期的版本。
在Python中尝试以下代码:
import os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = '/path/to/standford/jars'
os.environ['STANFORD_MODELS'] = '/path/to/standford/jars'
parser = stanford.StanfordParser(model_path="/location/of/the/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences
# GUI
for line in sentences:
for sentence in line:
sentence.draw()
输出:
[Tree('ROOT',[Tree('S',[Tree('INTJ',[Tree('UH',['你好'])]),Tree(',',[',']),Tree('NP',[Tree('PRP $',['我的']),Tree('NN',['名字'])]),Tree('VP',[Tree('VBZ',['是']),Tree('ADJP',[Tree('JJ',['梅尔罗伊'])]])),Tree('。',['。']])]]),Tree('ROOT',[Tree('SBARQ',[Tree('WHNP',[Tree('WP',['什么'])]),Tree('SQ',[Tree('VBZ',['是']),Tree('NP',[Tree('PRP $',['你的']),Tree('NN',['名字'])]]),Tree('。',['?'])]])]
注1:在此示例中,解析器和模型jar包都位于同一个文件夹中。
注2:
注3:englishPCFG.ser.gz文件可以在models.jar文件(/ edu / stanford / nlp / models / lexparser / englishPCFG.ser.gz)中找到。请使用某个归档管理器解压缩models.jar文件。
注4:请确保您正在使用Java JRE(运行时环境)1.8,也称为Oracle JDK 8。否则,您将获得:Unsupported major.minor version 52.0。
从https://github.com/nltk/nltk下载NLTK v3。然后安装NLTK:
sudo python setup.py install
您可以使用Python使用NLTK下载器获取Stanford Parser:
import nltk
nltk.download()
尝试我的示例!(不要忘记更改jar路径并将模型路径更改为ser.gz位置)
或者:
下载并安装与上述相同的NLTK v3。
从以下网址下载最新版本(当前版本文件名为stanford-parser-full-2015-01-29.zip): http://nlp.stanford.edu/software/lex-parser.shtml#Download
解压缩standford-parser-full-20xx-xx-xx.zip。
创建一个新文件夹(在我的示例中为“jars”)。将提取的文件放入此jar文件夹中:stanford-parser-3.x.x-models.jar和stanford-parser.jar。
如上所示,您可以使用环境变量(STANFORD_PARSER& STANFORD_MODELS)指向这个'jars'文件夹。我正在使用Linux,因此如果您使用Windows,请使用类似于:C://folder//jars的内容。
使用归档管理器(7zip)打开stanford-parser-3.x.x-models.jar。
浏览jar文件内部; edu / stanford / nlp / models / lexparser。再次提取名为'englishPCFG.ser.gz'的文件。记住提取此ser.gz文件的位置。
创建StanfordParser实例时,可以将模型路径作为参数提供。这是模型的完整路径,在我们的情况下是/ location / of / englishPCFG.ser.gz。
尝试我的示例!(不要忘记更改jar路径并将模型路径更改为ser.gz位置)
AttributeError: 'StanfordParser' object has no attribute 'raw_batch_parse'
- Nick Retallack for sentence in line:
sentence.draw()```
您只能在Tree对象上执行draw() ;)
- Melroy van den BergSTANFORDTOOLSDIR
放入CLASSPATH
中,用于解析器 jar 文件和 parser_model jar 文件,例如 export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar
。 - alvas下面的答案已经被废弃,请使用位于https://dev59.com/S2Yr5IYBdhLWcg3wLnZA#51981566的解决方案,适用于NLTK v3.3及以上版本。
注意:以下答案仅适用于:
由于这两个工具变化非常快,API在3-6个月后可能会有很大不同。请将以下答案视为临时解决方案,而不是永久修复。
始终参考https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software获取有关如何使用NLTK接口Stanford NLP工具的最新说明!
cd $HOME
# Update / Install NLTK
pip install -U nltk
# Download the Stanford NLP tools
wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
# Extract the zip file.
unzip stanford-ner-2015-04-20.zip
unzip stanford-parser-full-2015-04-20.zip
unzip stanford-postagger-full-2015-04-20.zip
export STANFORDTOOLSDIR=$HOME
export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/stanford-postagger.jar:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/stanford-ner.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar
export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/models:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/classifiers
然后:
>>> from nltk.tag.stanford import StanfordPOSTagger
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> st.tag('What is the airspeed of an unladen swallow ?'.split())
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]
>>> from nltk.parse.stanford import StanfordDependencyParser
>>> dep_parser=StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> print [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")]
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])])]
首先,需要注意的是 Stanford NLP 工具是用 Java 编写的,而 NLTK 是用 Python 编写的。NLTK 通过命令行接口调用 Java 工具。
其次,自 NLTK 版本 3.1 以来,NLTK
API 与 Stanford NLP 工具发生了相当大的变化,因此建议将 NLTK 包更新到 v3.1。
第三,NLTK
API 与 Stanford NLP 工具包装了各个 NLP 工具,例如斯坦福 POS 标注器、斯坦福 NER 标注器、斯坦福解析器。
对于 POS 和 NER 标注器,它不会包装斯坦福 Core NLP 包。
对于斯坦福解析器,它是一个特殊情况,它同时包装了斯坦福解析器和斯坦福 Core NLP(个人未使用后者的 NLTK 版本,我更愿意遵循 @dimazest 在http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html上的演示)
请注意,在 NLTK v3.1 中,STANFORD_JAR
和 STANFORD_PARSER
变量已经被弃用并且不再使用。
假设您已经在操作系统上适当地安装了 Java。
现在,安装/更新您的 NLTK 版本(参见http://www.nltk.org/install.html):
sudo pip install -U nltk
sudo apt-get install python-nltk
对于 Windows(使用 32 位二进制安装):
(为什么不使用64位版本?请参见 https://github.com/nltk/nltk/issues/1079)
然后出于谨慎起见,在 python 中重新检查您的 nltk
版本:
from __future__ import print_function
import nltk
print(nltk.__version__)
或者在命令行上:
python3 -c "import nltk; print(nltk.__version__)"
确保您看到输出为3.1
。
为了更加谨慎,检查所有您喜爱的Stanford NLP工具API是否可用:
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer
(注意:上面的导入仅确保您使用包含这些API的正确NLTK版本。导入时不出现错误并不意味着您已成功配置NLTK API以使用Stanford工具)
现在,您已检查了NLTK的正确版本,其中包含必要的Stanford NLP工具接口。 您需要下载并提取所有必要的Stanford NLP工具。
简而言之,在Unix中:
cd $HOME
# Download the Stanford NLP tools
wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
# Extract the zip file.
unzip stanford-ner-2015-04-20.zip
unzip stanford-parser-full-2015-04-20.zip
unzip stanford-postagger-full-2015-04-20.zip
在Windows/Mac中:
设置环境变量,以便NLTK可以自动找到相关的文件路径。您必须设置以下变量:
将适当的Stanford NLP .jar
文件添加到CLASSPATH
环境变量中。
stanford-ner-2015-04-20/stanford-ner.jar
stanford-postagger-full-2015-04-20/stanford-postagger.jar
stanford-parser-full-2015-04-20/stanford-parser.jar
和解析器模型jar文件stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar
将适当的模型目录添加到STANFORD_MODELS
变量中(即可以找到预训练模型的目录)
stanford-ner-2015-04-20/classifiers/
中。stanford-postagger-full-2015-04-20/models/
中。在代码中,请查看它是否在STANFORD_MODELS
目录中搜索模型名称。此外,请查看API是否自动尝试在操作系统环境中搜索`CLASSPATH`。
请注意,从NLTK v3.1开始,STANFORD_JAR
变量已弃用且不再使用。以下Stackoverflow问题中的代码片段可能无法正常工作:
在Ubuntu上的第三步TL;DR
export STANFORDTOOLSDIR=/home/path/to/stanford/tools/
export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/stanford-postagger.jar:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/stanford-ner.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar
export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-04-20/models:$STANFORDTOOLSDIR/stanford-ner-2015-04-20/classifiers
(对于 Windows 操作系统:请查看 https://dev59.com/qHA65IYBdhLWcg3wqQY8#17176423,了解如何设置环境变量)
在启动 Python 之前,您必须按照上述方式设置变量,然后执行以下操作:
>>> from nltk.tag.stanford import StanfordPOSTagger
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> st.tag('What is the airspeed of an unladen swallow ?'.split())
[(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
[(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]
或者,您可以尝试在Python代码中添加环境变量,如之前的回答所建议的那样,但您也可以直接告诉解析器/标记器初始化为您保存.jar
文件和模型的直接路径。
如果使用以下方法,则无需设置环境变量,但是当API更改其参数名称时,您需要相应地进行更改。这就是为什么设置环境变量比修改Python代码以适应NLTK版本更加明智。
例如(不设置任何环境变量):
# POS tagging:
from nltk.tag import StanfordPOSTagger
stanford_pos_dir = '/home/alvas/stanford-postagger-full-2015-04-20/'
eng_model_filename= stanford_pos_dir + 'models/english-left3words-distsim.tagger'
my_path_to_jar= stanford_pos_dir + 'stanford-postagger.jar'
st = StanfordPOSTagger(model_filename=eng_model_filename, path_to_jar=my_path_to_jar)
st.tag('What is the airspeed of an unladen swallow ?'.split())
# NER Tagging:
from nltk.tag import StanfordNERTagger
stanford_ner_dir = '/home/alvas/stanford-ner/'
eng_model_filename= stanford_ner_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'
my_path_to_jar= stanford_ner_dir + 'stanford-ner.jar'
st = StanfordNERTagger(model_filename=eng_model_filename, path_to_jar=my_path_to_jar)
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
# Parsing:
from nltk.parse.stanford import StanfordParser
stanford_parser_dir = '/home/alvas/stanford-parser/'
eng_model_path = stanford_parser_dir + "edu/stanford/nlp/models/lexparser/englishRNN.ser.gz"
my_path_to_models_jar = stanford_parser_dir + "stanford-parser-3.5.2-models.jar"
my_path_to_jar = stanford_parser_dir + "stanford-parser.jar"
parser=StanfordParser(model_path=eng_model_path, path_to_models_jar=my_path_to_models_jar, path_to_jar=my_path_to_jar)
从NLTK v3.3开始,用户应该避免使用来自nltk.tag
的Stanford NER或POS标注器,以及nltk.tokenize
的Stanford分词/分段器。
取而代之,使用新的nltk.parse.corenlp.CoreNLPParser
API。
请参见https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK
(避免只有链接的答案,我已将NLTK github wiki中的文档粘贴在下面)
首先,更新您的NLTK。
pip3 install -U nltk # Make sure is >=3.3
然后下载必要的CoreNLP软件包:
cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27
# Get the Chinese model
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties
# Get the Arabic model
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties
# Get the French model
wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties
# Get the German model
wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties
# Get the Spanish model
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties
在 stanford-corenlp-full-2018-02-27
目录下,启动服务器:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 &
然后在Python中:
>>> from nltk.parse import CoreNLPParser
# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')
# Parse tokenized text.
>>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Parse raw string.
>>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Neural Dependency Parser
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
>>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
[[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]
# Tokenizer
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
# POS Tagger
>>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
# NER Tagger
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
以略有不同的方式启动服务器,仍然从`stanford-corenlp-full-2018-02-27`目录开始:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001 -port 9001 -timeout 15000
在Python中:
>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'我家没有电脑。'))
['我家', '没有', '电脑', '。']
>>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
[Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]
启动服务器:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005 -port 9005 -timeout 15000
在Python中:
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9005')
>>> text = u'انا حامل'
# Parser.
>>> parser.raw_parse(text)
<list_iterator object at 0x7f0d894c9940>
>>> list(parser.raw_parse(text))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
>>> list(parser.parse(parser.tokenize(text)))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
# Tokenizer / Segmenter.
>>> list(parser.tokenize(text))
['انا', 'حامل']
# POS tagg
>>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
>>> list(pos_tagger.tag(parser.tokenize(text)))
[('انا', 'PRP'), ('حامل', 'NN')]
# NER tag
>>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
>>> list(ner_tagger.tag(parser.tokenize(text)))
[('انا', 'O'), ('حامل', 'O')]
启动服务器:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-french.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9004 -port 9004 -timeout 15000
在Python中:>>> parser = CoreNLPParser('http://localhost:9004')
>>> list(parser.parse('Je suis enceinte'.split()))
[Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
>>> pos_tagger.tag('Je suis enceinte'.split())
[('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
启动服务器:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9002 -port 9002 -timeout 15000
在Python中:>>> parser = CoreNLPParser('http://localhost:9002')
>>> list(parser.raw_parse('Ich bin schwanger'))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> list(parser.parse('Ich bin schwanger'.split()))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
>>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
[('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
启动服务器:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-spanish.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9003 -port 9003 -timeout 15000
在Python中:
>>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
>>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
>>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
>>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]
list(parser.raw_parse(text))
或 list(parser.parse(parser.tokenize(text)))
。修正了示例 ;) - alvas以下答案已经被弃用,请使用https://dev59.com/S2Yr5IYBdhLWcg3wLnZA#51981566中适用于NLTK v3.3及以上版本的解决方案。
截至当前Stanford parser(2015-04-20)的默认输出为lexparser.sh
,因此下面的脚本将无法使用。
但出于传统目的,仍将保留这个答案,它仍然可以使用http://nlp.stanford.edu/software/stanford-parser-2012-11-12.zip。
我建议你不要搞乱Jython、JPype之类的东西。让Python处理Python的事情,让Java处理Java的事情,在控制台上获取Stanford Parser的输出结果。
安装了Stanford Parser后,只需使用此Python代码即可获得扁平化分析结果:
import os
sentence = "this is a foo bar i want to parse."
os.popen("echo '"+sentence+"' > ~/stanfordtemp.txt")
parser_out = os.popen("~/stanford-parser-2012-11-12/lexparser.sh ~/stanfordtemp.txt").readlines()
bracketed_parse = " ".join( [i.strip() for i in parser_out if i.strip()[0] == "("] )
print bracketed_parse
len(i.strip()) > 0
,否则我会得到一个索引错误。我猜我的解析器输出至少有一行是纯空格。 - Frank Riccobonoimport os
from nltk.parse import stanford
os.environ['STANFORD_PARSER'] = 'd:/stanford-parser'
os.environ['STANFORD_MODELS'] = 'd:/stanford-parser'
os.environ['JAVAHOME'] = 'c:/Program Files/java/jre7/bin'
parser = stanford.StanfordParser(model_path="d:/stanford-grammars/englishPCFG.ser.gz")
sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
print sentences
<?php
exec( "java -cp /pathTo/stanford-parser.jar -mx100m edu.stanford.nlp.process.DocumentPreprocessor /pathTo/fileToParse > /pathTo/resultFile 2>/dev/null" );
?>
f = open(sys.argv[1]+".output"+".30"+".stp", "r")
parse_trees_text=[]
tree = ""
for line in f:
if line.isspace():
parse_trees_text.append(tree)
tree = ""
elif "(. ...))" in line:
#print "YES"
tree = tree+')'
parse_trees_text.append(tree)
tree = ""
else:
tree = tree + line
parse_trees=[]
for t in parse_trees_text:
tree = nltk.Tree(t)
tree.__delitem__(len(tree)-1) #delete "(. .))" from tree (you don't need that)
s = traverse(tree)
parse_trees.append(tree)
stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar'
parser = StanfordParser(path_to_jar=stanford_parser_jar,
path_to_models_jar=stanford_model_jar)
通过这种方式,您不再需要担心路径问题。
对于那些在Ubuntu上无法正确使用它或无法在Eclipse中运行代码的人来说。