将nltk Tree转换为JSON表示形式

Question

将nltk Tree转换为JSON表示形式

3

我可以帮您将以下nltk树形表示转换为JSON格式：

{ "label": "S", "children": [ { "label": "NP", "children": [ { "label": "DT", "value": "The" }, { "label": "NN", "value": "cat" } ] }, { "label": "VP", "children": [ { "label": "VBD", "value": "sat" }, { "label": "PP", "children": [ { "label": "IN", "value": "on" }, { "label": "NP", "children": [ { "label": "DT", "value": "the" }, { "label": "NN", "value": "mat" } ] } ] } ] } ] }

希望这个输出结果能符合您的需求。

{
    "scores": {
        "filler": [
            [
                "scores"
            ],
            [
                "for"
            ]
        ],
        "extent": [
            "highest"
        ],
        "team": [
            "India"
        ]
    }
}

- Raj

这不是一个有效的JSON：在同一对象中有两个“team”名称。 JSON对象是无序的名称/值对集合。不同的JSON解析器可能会产生不同的结果：解析器可能仅保留第一个'team'，或仅保留最后一个'team'对，或（不太可能）创建列表["India"，"Pakistan"]。 - jfs

参见 rfc 7159：“如果对象中的名称不唯一，则接收此类对象的软件的行为是不可预测的。许多实现仅报告最后一个名称/值对。其他实现报告错误或无法解析对象，而有些实现报告所有名称/值对，包括重复项。” - jfs

源代码树中再次包含重复名称 ('filler', 'filler')。为什么要从输出中删除它们？ - jfs

在构建字典时，它被自动移除了。由于填充信息在输出中不是必需的，因此将其移除并没有问题。 - Raj

我的意思是，根据我的使用情况，这并不是必需的。我认为你的答案是正确的！ - Raj

显示剩余4条评论

4个回答

2

这段代码将把树转换为以树标签为键的字典，然后您可以使用JSON dumps轻松地将其转换为JSON。

    import nltk.tree.Tree

    def tree_to_dict(tree):
        tree_dict = dict()
        leaves = []
        for subtree in tree:
            if type(subtree) == nltk.tree.Tree:
                tree_dict.update(tree_to_dict(subtree))
            else:
                (expression,tag) = subtree
                leaves.append(expression)
        tree_dict[tree.label()] = " ".join(leaves)

        return tree_dict

- Pranjal Gupta

作为比较的参考，对于句子“Tom Brady Plays for the Patriots.”，此代码输出{'ORGANIZATION': 'Patriots', 'PERSON': 'Brady', 'S': 'plays for the .'}。 - Alex Moore-Niemi

2

将树形结构转换为字典，然后再转换为JSON格式。

def tree_to_dict(tree):
    tdict = {}
    for t in tree:
        if isinstance(t, nltk.Tree) and isinstance(t[0], nltk.Tree):
            tdict[t.node] = tree_to_dict(t)
        elif isinstance(t, nltk.Tree):
            tdict[t.node] = t[0]
    return tdict

def dict_to_json(dict):
    return json.dumps(dict)

output_json = dict_to_json({tree.node: tree_to_dict(tree)})

- Raj

2

将tree转换为dict，并使用json.dump(result_dict, sys.stdout, indent=2)代替手动生成JSON文本。 - jfs

谢谢。我会再次查看它。 - Raj

@J.F.Sebastian，如何将树转换为字典？我应该使用哪种方法？ - Raj

t.node has to be switched to t.label() now. for the sentence "Tom Brady plays for the Patriots." the output was: {'ORGANIZATION': ('Patriots', 'NNP'), 'PERSON': ('Brady', 'NNP')} - Alex Moore-Niemi

0

一个相关的替代方案。对于我的目的，我不需要保留完全相同的树形结构，而是想将实体作为键提取出来，并将标记作为值列表。对于句子“Tom and Larry play for the Patriots.”，我想要以下JSON：

{
  "PERSON": [
    "Tom",
    "Larry"
  ],
  "ORGANIZATION": [
    "Patriots"
  ]
}

这样可以保留令牌的顺序（按实体类型），同时不会“践踏”为实体键设置的值。您可以在其他答案中重复使用相同的json.dump代码，将此字典返回为JSON。

from nltk import tag,chunk,tokenize

def prep(sentence):
    return chunk.ne_chunk(tag.pos_tag(tokenize.word_tokenize(sentence)))

t = prep("Tom and Larry play for the Patriots.")

def tree_to_dict(tree):
    tree_dict = dict()
    for st in tree:
        # not everything gets a NE tag,
        # so we can ignore untagged tokens
        # which are stored in tuples
        if isinstance(st, nltk.Tree):
            if st.label() in tree_dict:
                tree_dict[st.label()] = tree_dict[st.label()] + [st[0][0]]
            else:
                tree_dict[st.label()] = [st[0][0]]
    return tree_dict

print(tree_to_dict(t))
# {'PERSON': ['Tom', 'Larry'], 'ORGANIZATION': ['Patriots']}

- Alex Moore-Niemi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfs · Accepted Answer

看起来输入的树可能包含具有相同名称的子节点。为了支持一般情况，您可以将每个Tree转换为一个字典，将其名称映射到其子节点列表：

from nltk import Tree # $ pip install nltk

def tree2dict(tree):
    return {tree.node: [tree2dict(t)  if isinstance(t, Tree) else t
                        for t in tree]}

例子：

import json
import sys

tree = Tree('scores',
            [Tree('extent', ['highest']),
             Tree('filler',
                  [Tree('filler', ['scores']),
                   Tree('filler', ['for'])]),
             Tree('team', ['India'])])
d = tree2dict(tree)
json.dump(d, sys.stdout, indent=2)

输出：

{
  "scores": [
    {
      "extent": [
        "highest"
      ]
    }, 
    {
      "filler": [
        {
          "filler": [
            "scores"
          ]
        }, 
        {
          "filler": [
            "for"
          ]
        }
      ]
    }, 
    {
      "team": [
        "India"
      ]
    }
  ]
}