使用Scrapy进行身份验证的LinkedIn爬取

11

我已经阅读了使用Scrapy进行认证会话爬虫的内容,并且遇到了困难,我99%确定我的解析代码是正确的,但我不相信登录重定向并成功。

我还有一个问题,check_login_response()检查的页面我不确定是哪个.. 但"退出"可能是有道理的。




====== 更新 ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items

问题通过在self.initialized()前添加'Return'来解决。
谢谢! -Mark

当您运行上述代码时会发生什么? - Acorn
'request_depth_max': 1, 'scheduler/memory_enqueued': 3, 'start_time': datetime.datetime(2012, 6, 8, 18, 31, 48, 252601)} 2012-06-08 14:31:49-0400 [LinkedPy] INFO: Spider closed (finished) 2012-06-08 14:31:49-0400 [scrapy] INFO: Dumping global stats:{} - Gates
2
这种信息应该放在您的原始问题中,而不是评论中。 - Acorn
@Acorn,我现在会更新我上面的帖子,看看我们能不能弄清楚发生了什么事情。 - Gates
@Gates,你从哪里得到了那个 linkedpy 库? - Vipul
显示剩余4条评论
1个回答

4
class LinkedPySpider(BaseSpider):

should be:

class LinkedPySpider(InitSpider):

此外,您不应该覆盖parse函数,正如我在这里提到的回答一样:https://dev59.com/Y2025IYBdhLWcg3wsIMU#5857202/crawling-with-an-authenticated-session-in-scrapy 如果您不了解如何定义提取链接的规则,请仔细阅读文档:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

确实有帮助。我看到了成功的日志记录。但是,我不确定 def parse(self, response): 是否真正运行。我尝试在其中放置了一个 self.log(),但没有返回任何内容。 - Gates
似乎 parse() 应该改为 parse_item() - Gates
很有可能问题与上述内容和allow=r'-\w+.html$'有关,因为我不知道这是什么。 - Gates
请提供需要翻译的具体内容。 - Gates

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接