特别是,当我使用
urllib2.urlopen(request)
来读取页面内容时,它不会显示由JavaScript代码添加的任何内容,因为该代码在任何地方都没有被执行。通常它会被网络浏览器运行,但这不是我的程序的一部分。我该如何在我的Python代码中访问这个动态内容呢?
另请参阅Can scrapy be used to scrape dynamic content from websites that are using AJAX?以获取与Scrapy相关的具体答案。
urllib2.urlopen(request)
来读取页面内容时,它不会显示由JavaScript代码添加的任何内容,因为该代码在任何地方都没有被执行。通常它会被网络浏览器运行,但这不是我的程序的一部分。编辑于2021年9月: phantomjs
已不再维护
编辑于2017年12月30日: 这个答案在谷歌搜索中排名靠前,所以我决定更新它。旧的答案仍在结尾处。
dryscape已经不再维护,dryscape开发者推荐的库仅支持Python 2。我发现使用Selenium的Python库以Phantom JS作为Web驱动程序足够快且易于完成工作。
安装Phantom JS后,请确保phantomjs
二进制文件在当前路径中可用:
phantomjs --version
# result:
2.1.1
#示例 为了举例说明,我创建了一个包含以下HTML代码的样本页面。(链接):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
如果没有启用JavaScript,它会显示:无法支持JavaScript
,而有了JavaScript,它会显示:耶!支持JavaScript
#在没有JS支持的情况下爬取:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
#使用JS支持进行爬取:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
你也可以使用Python库dryscrape来爬取使用javascript的网站。
#支持JS的爬取:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
Selenium对PhantomJS的支持已被弃用,请改为使用Chrome或Firefox的无头版本
。也许@sytech是在谈论它的Selenium支持? - jpmc26由于任何由JavaScript生成的内容需要在DOM上呈现,因此我们未能获得正确的结果。当我们获取HTML页面时,我们会获取未经JavaScript修改的初始DOM。
因此,在爬取页面之前,我们需要呈现JavaScript内容。
由于本主题中已经多次提到了Selenium(以及它有时变得很慢),我将列出另外两种可能的解决方案。
解决方案1:这是一个非常好的教程,介绍如何使用Scrapy爬取由JavaScript生成的内容,我们将遵循这个教程。
我们需要什么:
Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Install Splash following the instruction listed for our corresponding OS.
Quoting from splash documentation:
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
Essentially we are going to use Splash to render Javascript generated content.
Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash
.
Install the scrapy-splash plugin: pip install scrapy-splash
Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py
:
Then go to your scrapy project’s
settings.py
and set these middlewares:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
SPLASH_URL = 'http://localhost:8050'
And finally you need to set these values too:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Finally, we can use a SplashRequest
:
In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.
该库旨在使解析HTML(例如从网络上获取信息)尽可能简单和直观。
Install requests-html: pipenv install requests-html
Make a request to the page's url:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(a_page_url)
Render the response to get the Javascript generated bits:
r.html.render()
最后,该模块似乎提供网络爬虫功能。
或者,我们可以尝试使用BeautifulSoup的常见用法与刚刚生成的r.html
对象。
r.html.html
对象中的JavaScript注入到页面中的所有iframe。 - fIwJlxSzApHEZIl也许Selenium可以做到。
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
Requests
模块,我最近发现开发者创建了一个名为Requests-HTML
的新模块,该模块现在也具备渲染JavaScript的能力。Requests-HTML
模块,以下示例(在上述链接中显示)展示了如何使用该模块来爬取网站并渲染其中包含的JavaScript。from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>' #This is the result.
我最近从一个YouTube视频中了解到这个。点击这里!观看YouTube视频,展示了模块的工作原理。
这似乎也是一个很好的解决方案,取自一篇优秀的博客文章
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process
# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links
QtWebKit
已被弃用,请使用QtWebEngineWidgets
。 - cardsSelenium是最适用于抓取JS和Ajax内容的工具。
查看此文章,了解使用Python从web中提取数据的方法:https://likegeeks.com/python-web-scraping/
$ pip install selenium
然后下载Chrome浏览器的驱动程序。
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)
很容易,对吧?
您还可以使用Webdriver执行JavaScript代码。
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
或将值存储在变量中
result = driver.execute_script('var text = document.title ; return text')
driver.title
属性。 - Corey Goldberg我个人喜欢使用Scrapy、Selenium并将它们分别运行在Docker容器中。这样您可以最小化安装工作来爬取现代网站,这些网站几乎都包含某种形式的JavaScript。以下是一个示例:
使用scrapy startproject
创建您的爬虫并编写爬虫代码,框架骨架可以非常简单,就像这样:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
真正的魔法发生在middlewares.py中。以以下方式覆盖下载器中间件中的两个方法__init__
和process_request
:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Dockerfile
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
最后,在docker-compose.yaml
中将所有内容整合在一起:
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
docker-compose up -d
。如果您是第一次这样做,它将花费一些时间来获取最新的selenium/standalone-chrome并构建您的爬虫镜像。docker ps
检查容器是否在运行,并检查Selenium容器的名称是否与我们传递给爬虫容器的环境变量的名称相匹配(在这里,它是SELENIUM_LOCATION=samplecrawler_selenium_1
)。docker exec -ti YOUR_CONTAINER_NAME sh
进入您的爬虫容器,我的命令是docker exec -ti samplecrawler_my_scraper_1 sh
,然后进入正确的目录并使用scrapy crawl my_spider
运行您的爬虫。对我来说,BeautifulSoup和Selenium的结合效果非常好。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element
html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
else:
print("Couldnt locate element")
提示:你可以在这里找到更多等待条件。