使用Scrapy进行身份验证的LinkedIn爬取

Question

使用Scrapy进行身份验证的LinkedIn爬取

11

我已经阅读了使用Scrapy进行认证会话爬虫的内容，并且遇到了困难，我99%确定我的解析代码是正确的，但我不相信登录重定向并成功。

我还有一个问题，check_login_response()检查的页面我不确定是哪个.. 但"退出"可能是有道理的。

====== 更新 ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items

问题通过在self.initialized()前添加'Return'来解决。

谢谢！ -Mark

- Gates

当您运行上述代码时会发生什么？ - Acorn

'request_depth_max': 1,   'scheduler/memory_enqueued': 3,   'start_time': datetime.datetime(2012, 6, 8, 18, 31, 48, 252601)} 2012-06-08 14:31:49-0400 [LinkedPy] INFO: Spider closed (finished) 2012-06-08 14:31:49-0400 [scrapy] INFO: Dumping global stats:{}

- Gates

2

这种信息应该放在您的原始问题中，而不是评论中。 - Acorn

@Acorn，我现在会更新我上面的帖子，看看我们能不能弄清楚发生了什么事情。 - Gates

@Gates，你从哪里得到了那个 linkedpy 库？ - Vipul

显示剩余4条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Acorn · Accepted Answer

class LinkedPySpider(BaseSpider):

should be:

class LinkedPySpider(InitSpider):

此外，您不应该覆盖parse函数，正如我在这里提到的回答一样：https://dev59.com/Y2025IYBdhLWcg3wsIMU#5857202/crawling-with-an-authenticated-session-in-scrapy 如果您不了解如何定义提取链接的规则，请仔细阅读文档：
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors