如何使用Beautiful Soup从HTML中提取数据

Question

如何使用Beautiful Soup从HTML中提取数据

4

我正在尝试爬取一个网页并将结果存储在csv/excel文件中。我使用beautiful soup进行此操作。

我正在尝试使用find_all函数从soup中提取数据，但是我不确定如何捕获字段名称或标题中的数据。

HTML文件具有以下格式。

<h3 class="font20">
 <span itemprop="position">36.</span> 
 <a class="font20 c_name_head weight700 detail_page" 
 href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank" 
 title="Nimblechapps Pvt. Ltd."> 
     <span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>

这是我目前的代码。不确定如何从这里继续。

from bs4 import BeautifulSoup as BS
import requests 
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone? 
page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all(class_ = 'font20 c_name_head weight700 detail_page')
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 
detail_page'})

我尝试使用以下方法 -

Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>

我想提取公司名称 - “Nimblechapps Pvt. Ltd.”

- Keshav c

2

请发布您尝试过的代码以及具体问题所在。 - Scott Hunter

@ScottHunter完成了！请检查问题的编辑版本。 - Keshav c

1

你想要 cont.h3.a.span.text 吗？ - drec4s

@drec4s 是的，我想要 cont.h3.a.span.text。但是我需要它适用于网页上提供的所有列表！我无法返回列表。 - Keshav c

1

简单易懂，选择每个元素的文本，例如：for tag in cont.find_all("span", itemprop="name"): print(tag.text) - t.m.adam

显示剩余2条评论

3个回答

1

使用后代选择器" "将类型选择器a与属性值选择器[itemprop="name"]组合起来，实现相同的效果。

names = [item.text for item in cont.select('a [itemprop="name"]')]

- QHarr

1

尽量不要在脚本中使用复合类，因为它们容易出错。以下脚本也可以获取所需的内容。

import requests
from bs4 import BeautifulSoup

link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"

res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
    names = items.find(class_='detail_page').text
    print(names)

- SIM

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- drec4s · Accepted Answer

您可以使用列表推导式来实现这一点：

from bs4 import BeautifulSoup as BS
import requests

page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])

您将获得：

['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']