如何使用Python解析LD+JSON

Question

如何使用Python解析LD+JSON

11

我一直在尝试一些网络爬虫，发现这个标签中有一些有趣的数据：

<script type="application/ld+json">

我已经使用Beautiful Soup成功地分离出了那个标签。

html = urlopen(url)
soup = BeautifulSoup(html, "lxml")

p = soup.find('script', {'type':'application/ld+json'})
print p

但是我无法使用数据或从该标签中提取任何数据。

如果我尝试使用正则表达式来获取其中的一些内容，我会得到：

TypeError: expected string or buffer

如何从那个script标签中获取数据并像使用字典或字符串一样使用它？顺便说一句，我正在使用Python 2.7。

- wessells

3个回答

6

您应该阅读HTML并进行解析。

html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
p = soup.find('script', {'type':'application/ld+json'})
print p.contents

- Pavan Kumar T S

我收到一个错误，来自"html/read()"。它说：Traceback (most recent call last): File "test.py", line 20, in <module> get_price() File "test.py", line 16, in get_price soup = BeautifulSoup(html, "html.read()") File "C:\PYTHON27\lib\site-packages\bs4_init_.py", line 165, in init % ",".join(features)) bs4.FeatureNotFound: 找不到符合您请求的功能的树构建器：html.read()。您需要安装解析器库吗？ - wessells

如果需要的话，可以使用lxml代替。 - Pavan Kumar T S

@wessells 如果只需要其中的文本，请使用 print p.find(text=True)。 - Pavan Kumar T S

0

上面的评论没有帮助（虽然感谢）

最终我使用了：

p = str(soup.find('script', {'type':'application/ld+json'}))

我将它强制转换为一个字符串，虽然不是很美观，但它完成了工作。我知道可能有更好的方法，但这对我起作用。

- wessells

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Chackerian · Accepted Answer

你应该使用json.loads来读取JSON并将其转换为字典。

import json

import requests
from bs4 import BeautifulSoup

def get_ld_json(url: str) -> dict:
    parser = "html.parser"
    req = requests.get(url)
    soup = BeautifulSoup(req.text, parser)
    return json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))

join / contents组合可去除脚本标签。