爬取 href 链接

Question

爬取 href 链接

3

尝试使用正确的关键词在这个页面上找到特定的链接，目前我有：

from bs4 import BeautifulSoup
import random
url = 'http://www.thenextdoor.fr/en/4_adidas-originals'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
raw = soup.findAll('a', {'class':'add_to_compare'})
links = raw['href']
keyword1 = 'adidas'
keyword2 = 'thenextdoor'
keyword3 = 'uncaged'
for link in links:
    text = link.text
    if keyword1 in text and keyword2 in text and keyword3 in text:

我正在尝试提取这个链接：this link。

- ColeWorld

2个回答

1

或者，您可以使用函数作为find_all()的href属性值，一次性完成。

keywords = ['adidas', 'thenextdoor', 'Uncaged']
links = soup.find_all('a',
                      class_='add_to_compare',
                      href=lambda href: all(keyword in href for keyword in keywords))
for link in links:  
    print(link["href"])

- alecxe

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mohammad Yusuf · Accepted Answer

您可以使用all()来检查是否全部存在，并且可以使用any()检查是否存在任意一个。

from bs4 import BeautifulSoup
import requests

res = requests.get("http://www.thenextdoor.fr/en/4_adidas-originals").content
soup = BeautifulSoup(res)

atags = soup.find_all('a', {'class':'add_to_compare'})
links = [atag['href'] for atag in atags]
keywords = ['adidas', 'thenextdoor', 'Uncaged']

for link in links:  
    if all(keyword in link for keyword in keywords):
        print link

输出：

http://www.thenextdoor.fr/en/clothing/2042-adidas-originals-Ultraboost-Uncaged-2303002052017.html
http://www.thenextdoor.fr/en/clothing/2042-adidas-originals-Ultraboost-Uncaged-2303002052017.html