如何使用BeautifulSoup提取HTML注释标签中的JSON数据?

3
我想使用BeautifulSoup提取HTML注释标签中的json内容。
<script data_id ="dfsfre2323" data_key="23424sfsfsfdafd", type="application/json"><!--
{"employee": {"name":"sonoo", "salary":56000, "married":true}}--></script>]

输出应如下所示。
Name: sonoo
Salary: 56000
Married: True

我尝试了以下方法:

from bs4 import BeautifulSoup, Comment
import json
soup = BeautifulSoup(webpage, "html.parser")
data = soup.find("script", {"type":"application/json", data_id ="dfsfre2323" data_key="23424sfsfsfdafd"})                                                                                                       
comment = soup.find(text=lambda text:isinstance(data, Comment))

我在评论中没有得到任何东西。

提前感谢任何帮助!

1个回答

1

<script>标签内的内容不会被BeautifulSoup解析,因此您的.find(text=...)将找不到任何内容。在使用.find()之前,将脚本字符串转换为BeautifulSoup:

import json
from bs4 import BeautifulSoup, Comment


txt = '''
<script data_id ="dfsfre2323" data_key="23424sfsfsfdafd" type="application/json"><!--
    {"employee": {"name":"sonoo", "salary":56000, "married":true}}
--></script>'''

soup = BeautifulSoup(txt, "html.parser")
data = soup.find("script", {"type":"application/json", 'data_id':"dfsfre2323", 'data_key':"23424sfsfsfdafd"})
comment = BeautifulSoup(data.string, "html.parser").find(text=lambda t: isinstance(t, Comment))

data = json.loads(comment)

print(json.dumps(data, indent=4))

输出:

{
    "employee": {
        "name": "sonoo",
        "salary": 56000,
        "married": true
    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接