使用Selenium进行Linkedin网站数据抓取

3

我是一个新手,对于Web开发和爬虫技术都比较陌生,我正在尝试挑战自己,想要爬取一些像LinkedIn这样的网站。 因为它们具有动态更改的id,所以要正确地进行爬取会有一定难度。

我正在尝试使用以下代码来爬取LinkedIn个人档案中的“经历”部分:

experience = driver.find_element_by_xpath('//section[@id = "experience-section"]/ul/li[@class="position"]')

司机获得了整个领英个人资料页面。我想要所有在“经历”部分下的职位。错误信息如下:
无法定位元素: {"method":"xpath","selector":"//section[@id = "experience-section"]/ul/li/div[@class="position"]"}
我可以爬取领英上的其他内容,但“经历”部分对我来说是一个大挑战。XPath是否有误?如果是,请问我应该怎么改正?
谢谢

<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
  <h2 class="pv-profile-section__card-heading t-20 t-black t-normal">
    Experience
  </h2>

<!----></header>

<ul id="ember1620" class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-no-more ember-view"><li id="ember1622" class="pv-profile-section__sortable-item pv-profile-section__section-info-item relative pv-profile-section__list-item sortable-item ember-view"><div id="ember1623" class="pv-entity__position-group-pager ember-view">            <li id="392598211" class="pv-profile-section__sortable-card-item pv-profile-section pv-position-entity ember-view"><!----><a data-control-name="background_details_company" href="/company/8736/" id="ember1626" class="ember-view">      <div class="pv-entity__logo company-logo">
  <img class="lazy-image pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 loaded" alt="Bill &amp; Melinda Gates Foundation" src="https://media.licdn.com/dms/image/C560BAQHvFIyUvuKtQA/company-logo_400_400/0?e=1556755200&amp;v=beta&amp;t=Qhh8_KnrE-OiuXAutFyeI69tgUF3c1ptC9N12siDO4o">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Co-chair</h3>

  <h4 class="t-16 t-black t-normal">
    <span class="visually-hidden">Company Name</span>
    <span class="pv-entity__secondary-title">Bill &amp; Melinda Gates Foundation</span>
  </h4>

    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates Employed</span>
      <span>2000 – Present</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item-v2">19 yrs</span>
      </h4>
  </div>

<!---->
</div>

</a>
<!---->
</li>


</div>
</li><li id="ember1630" class="pv-profile-section__sortable-item pv-profile-section__section-info-item relative pv-profile-section__list-item sortable-item ember-view"><div id="ember1631" class="pv-entity__position-group-pager ember-view">            <li id="392599749" class="pv-profile-section__sortable-card-item pv-profile-section pv-position-entity ember-view"><!----><a data-control-name="background_details_company" href="/company/1035/" id="ember1634" class="ember-view">      <div class="pv-entity__logo company-logo">
  <img class="lazy-image pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 loaded" alt="Microsoft" src="https://media.licdn.com/dms/image/C4D0BAQEko6uLz7XylA/company-logo_400_400/0?e=1556755200&amp;v=beta&amp;t=XQhwV5ruWfGBfjgQylV9gkeXD8VnQRBHGd1bOfTs2tw">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Co-founder</h3>

  <h4 class="t-16 t-black t-normal">
    <span class="visually-hidden">Company Name</span>
    <span class="pv-entity__secondary-title">Microsoft</span>
  </h4>

    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates Employed</span>
      <span>1975 – Present</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item-v2">44 yrs</span>
      </h4>
  </div>

<!---->
</div>

</a>
<!---->
</li>


</div>
</li>
</ul>
<!----></section>

---- 更新: 我使用了Sers提供的解决方案。

driver.get('https://www.linkedin.com/in/williamhgates/')
experience = driver.find_elements_by_xpath('//section[@id = "experience-section"]/ul//li')
for item in experience:
    print(item.text)
    print("")

我发现结果出现了两次:

Co-chair
Company Name
Bill & Melinda Gates Foundation
Dates Employed
2000 – Present
Employment Duration
19 yrs

Co-chair
Company Name
Bill & Melinda Gates Foundation
Dates Employed
2000 – Present
Employment Duration
19 yrs

Co-founder
Company Name
Microsoft
Dates Employed
1975 – Present
Employment Duration
44 yrs

Co-founder
Company Name
Microsoft
Dates Employed
1975 – Present
Employment Duration
44 yrs


1
请您能否发布出现问题的HTML代码。 - KunduK
添加了HTML代码 - meecrob
我没有看到任何@class值等于"position"的元素。你要定位哪个节点?你是不是想测试它是否包含 @class中的"position"? - Mads Hansen
1个回答

1
你的xpath问题在于li不直接位于ul下面,请尝试以下xpath:
//section[@id = "experience-section"]/ul//li

更新
driver.get('https://www.linkedin.com/in/williamhgates/')
experience = driver.find_elements_css_selector('#experience-section .pv-profile-section')
for item in experience:
    print(item.text)
    print("")

我知道这是一些基础的东西。它起作用了!非常感谢。但是,如果我按照以下方式使用它:experience = driver.find_elements_by_xpath('//section[@id = "experience-section"]/ul//li'),当我打印结果时会出现重复:for item in experience: print(item.text) - meecrob
我的结果中两次获得了联合主席角色。可能出了什么问题? - meecrob
@meecrob 分享你的代码。同时,随意接受答案。 - Sers
我已经更新了原始帖子。我会将你的答案标记为已解决。 - meecrob
@meecrob,请检查我的更新,我已将XPath更改为CSS选择器。 - Sers
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接