NLTK - WordNet:长单词列表

6

我想在WordNet中找到至少18个字符长的单词。我尝试了以下代码:

from nltk.corpus import wordnet as wn
sorted(w for w in wn.synset().name() if len(w)>18)

我收到了以下错误信息:

sorted(w for w in wn.synset().name() if len(w)>18)

TypeError: synset() missing 1 required positional argument: 'name'

我正在使用Python 3.4.3版本。 请问如何修复我的代码?
3个回答

5
使用wn.all_lemma_names()获取所有词元的列表。我相信这是Wordnet提供的所有单词,所以不需要迭代synsets(但如果您有兴趣,可以调用每个词元的synsets)。 您可能希望按长度对搜索结果进行排序:
longwords = [ n for n in wn.all_lemma_names() if len(n) > 18 ]
longwords.sort(key=len, reverse=True)

1
在回答之前,您需要了解NLTK中Wordnet接口的工作方式,请参见http://www.nltk.org/howto/wordnet.html
Wordnet由可以用不同单词表示的概念索引,并包含语义信息。而NLTK中的Wordnet接口允许您搜索一个单词可以表示的概念,例如:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> for ss in wn.synsets('dog'):
...     print ss, ss.definition()
... 
Synset('dog.n.01') a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Synset('frump.n.01') a dull unattractive unpleasant girl or woman
Synset('dog.n.03') informal term for a man
Synset('cad.n.01') someone who is morally reprehensible
Synset('frank.n.02') a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
Synset('pawl.n.01') a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
Synset('andiron.n.01') metal supports for logs in a fireplace
Synset('chase.v.01') go after with the intent to catch

访问WordNet中的所有同义词集:

wn.all_synsets()

对于每个同义词集,您可以查找与同义词集相关的不同功能,例如:

>>> ss = wn.synsets('dog')[0] # First synset for the word 'dog'
>>> ss.definition()
u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
>>> ss.hypernyms()
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
>>> ss.hyponyms()
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
>>> ss.name()
u'dog.n.01'
>>> ss.lemma_names() # Other words that can represent this concept.
[u'dog', u'domestic_dog', u'Canis_familiaris']

因此,您可以用一行代码做到这一点,但它不太易读:

sorted(ss.name() for ss in wn.all_synsets() if len(ss.name())>18)

请注意,这只会给你一个由Synsets的索引组成的词元名称列表。同时,在检查len(ss.name()) > 18时,您包含了POS标记和索引ID(即synset的索引名称中的.s.01)。

因此,您需要使用lemma_names()而不是name()

>>> from itertools import chain
>>> sorted(lemma for lemma in chain(*(ss.lemma_names() for ss in wn.all_synsets())) if len(lemma) > 18)

或者,在链接和排序之前收集词元时,您可以检查其长度:

>>> sorted(chain(*([lemma for lemma in ss.lemma_names() if len(lemma)>18] for ss in wn.all_synsets())))

注意:通过迭代同义词集并获取lemma_names(),您将获得重复的lemma_names(),以及首字母大写的lemma_names()与非首字母大写的lemma_names()。
当然,您不需要遍历所有这些麻烦,因为有一个内置函数。
>>> sorted(lemma for lemma in wn.all_lemma_names() if len(lemma) > 18)

-1

此函数仅为您提供由“word”索引的同义词集的引理名称。另外,Synset.name仅会给出索引引理名称。 ;P - alvas

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接