如何使用Beautiful Soup从脚本标记中提取JSON?

7
我想使用Beautiful Soup从script标签中提取reviewCount。尝试了不同的方法,但没有成功。
<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>

尝试了不同的方法但没有成功。你能分享一下这些尝试吗?从你分享的标签来看,似乎你所需要做的就是获取标签的内容并解析结果。如果你在提取元素内容方面遇到困难,那么这与使用BeautifulSoup提取标签内的内容是重复的。如果问题在于解析JSON,则这与如何在Python中解析JSON?是重复的。 - AMC
3个回答

9

这个方法应该可行,但我相信有更加优雅的解决方案:

import json
from bs4 import BeautifulSoup

html = '''
<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>
'''

soup = BeautifulSoup(html, 'html.parser')
res = soup.find('script')
json_object = json.loads(res.contents[0])

for language in json_object['languages']:
    print('{}: {}'.format(language['displayName'], language['reviewCount']))

输出:

Toutes les langues: 573
français: 567
English: 6

谢谢James。我尝试了你上面提到的方法。我的主要问题是获取reviewCount数量。 - free_123
类型错误:'Response'对象没有len()函数。 - GGEv
json.loads(res.text) 对我也起作用了。 - AK91

3

引入 json 并将数据加载到其中,然后遍历以获取所有的 reviewCount

import json
html='''<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>'''

soup=BeautifulSoup(html,"html.parser")
item=soup.select_one('script[data-initial-state="review-filter"]').text
jsondata=json.loads(item)
for item in jsondata['languages']:
    print(item['reviewCount'])

输出:

573
567
6

2
import re

html = '''<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>'''


match = [item.group(1) for item in re.finditer('reviewCount":"(.+?)"', html)]

print(match)

输出:

['573', '567', '6']

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接