网络爬虫:使用Python爬取多个网站

3
from bs4 import BeautifulSoup
import requests

url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
  pg = url + '?page=' + str(pg)
  soup = BeautifulSoup(page.content, 'lxml')
  for paragraph in soup.find_all('p'):
     print(paragraph.text)

我想从https://uk.trustpilot.com/review/thread.com抓取排名、评论和评论日期,但我不知道如何从多个页面中抓取并生成pandas DataFrame的抓取结果。

你在提到哪个排名? - Bitto
嗨 Bitto,排名是星级数字。 - user10196527
3个回答

1
嗨,您需要向每个页面发送请求,然后处理响应。由于某些项目不直接作为标签内的文本可用,因此您必须从JavaScript(我使用JSON加载获取日期)或类名(我这样获取评级)中获取它。
from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     title=paragraph.find('h2',class_='review-content__title').text.strip()
     content=paragraph.find('p',class_='review-content__text').text.strip()
     datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
     date=datedata['publishedDate'].split('T')[0]
     rating_class=paragraph.find('div',class_='star-rating')['class']
     rating=rating_class[1].split('-')[-1]
     final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

输出

                                                Title                                            Content        Date Rating
0                      I ordered a jacket 2 weeks ago  I ordered a jacket 2 weeks ago.  Still hasn't ...  2019-01-13      1
1              I've used this service for many years…  I've used this service for many years and get ...  2018-12-31      4
2                                       Great website  Great website, tailored recommendations, and e...  2018-12-19      5
3              I was excited by the prospect offered…  I was excited by the prospect offered by threa...  2018-12-18      1
4       Thread set the benchmark for customer service  Firstly, their customer service is second to n...  2018-12-12      5
5                                    It's a good idea  It's a good idea.  I am in between sizes and d...  2018-12-02      3
6                             Great experience so far  Great experience so far. Big choice of clothes...  2018-10-31      5
7                    Absolutely love using Thread.com  Absolutely love using Thread.com.  As a man wh...  2018-10-31      5
8                 I'd like to give Thread a one star…  I'd like to give Thread a one star review, but...  2018-10-30      2
9            Really enjoying the shopping experience…  Really enjoying the shopping experience on thi...  2018-10-22      5
10                         The only way I buy clothes  I absolutely love Thread. I've been surviving ...  2018-10-15      5
11                                  Excellent Service  Excellent ServiceQuick delivery, nice items th...  2018-07-27      5
12             Convenient way to order clothes online  Convenient way to order clothes online, and gr...  2018-07-05      5
13                Superb - would thoroughly recommend  Recommendations have been brilliant - no more ...  2018-06-24      5
14                    First time ordering from Thread  First time ordering from Thread - Very slow de...  2018-06-22      1
15          Some of these criticisms are just madness  I absolutely love thread.com, and I can't reco...  2018-05-28      5
16                                       Top service!  Great idea and fantastic service. I just recei...  2018-05-17      5
17                                      Great service  Great service. Great clothes which come well p...  2018-05-05      5
18                                          Thumbs up  Easy, straightforward and very good costumer s...  2018-04-17      5
19                 Good idea, ruined by slow delivery  I really love the concept and the ordering pro...  2018-04-08      3
20                                      I love Thread  I have been using thread for over a year. It i...  2018-03-12      5
21      Clever simple idea but.. low quality clothing  Clever simple idea but.. low quality clothingL...  2018-03-12      2
22                      Initially I was impressed....  Initially I was impressed with the Thread shop...  2018-02-07      2
23                                 Happy new customer  Joined the site a few weeks ago, took a short ...  2018-02-06      5
24                          Style tips for mature men  I'm a man of mature age, let's say a "baby boo...  2018-01-31      5
25            Every shop, every item and in one place  Simple, intuitive and makes online shopping a ...  2018-01-28      5
26                     Fantastic experience all round  Fantastic experience all round.  Quick to regi...  2018-01-28      5
27          Superb "all in one" shopping experience …  Superb "all in one" shopping experience that i...  2018-01-25      5
28  Great for time poor people who aren’t fond of ...  Rally love this company. Super useful for thos...  2018-01-22      5
29                            Really is worth trying!  Quite cautious at first, however, love the way...  2018-01-10      4
30           14 days for returns is very poor given …  14 days for returns is very poor given most co...  2017-12-20      3
31                  A great intro to online clothes …  A great intro to online clothes shopping. Usef...  2017-12-15      5
32                           I was skeptical at first  I was skeptical at first, but the service is s...  2017-11-16      5
33            seems good to me as i hate to shop in …  seems good to me as i hate to shop in stores, ...  2017-10-23      5
34                          Great concept and service  Great concept and service. This service has be...  2017-10-17      5
35                                      Slow dispatch  My Order Dispatch was extremely slow compared ...  2017-10-07      1
36             This company sends me clothes in boxes  This company sends me clothes in boxes! I find...  2017-08-28      5
37          I've been using Thread for the past six …  I've been using Thread for the past six months...  2017-08-03      5
38                                             Thread  Thread, this site right here is literally the ...  2017-06-22      5
39                                       good concept  The website is a good concept in helping buyer...  2017-06-14      3

注意: 虽然我能够“黑客式”地获取此网站的结果,但最好使用Selenium来爬取动态页面。
编辑:自动查找页面数的代码
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     try:
         title=paragraph.find('h2',class_='review-content__title').text.strip()
         content=paragraph.find('p',class_='review-content__text').text.strip()
         datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
         date=datedata['publishedDate'].split('T')[0]
         rating_class=paragraph.find('div',class_='star-rating')['class']
         rating=rating_class[1].split('-')[-1]
         final_list.append([title,content,date,rating])
     except AttributeError:
        pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

这是一个完美的解决方案,Bitto。你知道如何自动为所有页面制作它而不需要输入range()吗? - user10196527
怎么说呢?你做得真是太棒了!我可以问一下,你是怎么想到使用soup.find('h2',class_="header--inline").text的吗? - user10196527
@ZakkYang 你可以查看源代码。这是评论数量文本的类。 - Bitto
@ZakkYang 是的。一些帖子被版主删除了。所以它们不再有h2标签,因此查找它会导致AttributeError。我已经编辑了最后的代码来尝试捕获异常。试试看吧。 - Bitto
你是最棒的! - user10196527
显示剩余2条评论

0

该网站是动态的,虽然您可以使用BeautifulSoup查找评论的某些元素,但需要使用selenium来访问动态生成的内容:

from bs4 import BeautifulSoup as soup
from selenium import webdriver
import re, time
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://uk.trustpilot.com/review/thread.com')
def scrape_review(_d:soup) -> dict:
  return {'date':_d.find('time').text, 
          'ranking':re.findall('(?<=star\-rating\-)\d+', str(_d.find('div', {'class':re.compile('^star-rating')})['class']))[0],
           'review':_d.find('p', {'class':'review-content__text'}).text
          }

_final_results, page = {}, 1
d1 = soup(d.page_source, 'html.parser')
_final_results[page] = list(map(scrape_review, d1.find_all('div', {'class':'review-content'})))
while d1.find('a', {'class':re.compile('\snext\-page')}):
    page += 1
    d.get("https://uk.trustpilot.com"+d1.find('a', {'class':re.compile('\snext\-page')})['href'])
    d1 = soup(d.page_source, 'html.parser')
    _final_results[page] = list(map(scrape_review, d1.find_all('div', {'class':'review-content'})))
    time.sleep(2)

输出(第一页):

{1: [{'date': 'Updated 2 hours ago', 'ranking': '1', 'review': '\n            I ordered a sweatshirt on Dec.21st.  Today is Jan 14th and there is no tracking label because they haven\'t even sent it out.  No way to contact anyone by phone, and their responses to my countless emails never address anything...they just state every time "we will investigate".  Investigate for 3 weeks???  At this point I feel I have no option but to try to recoup my money through Paypal.  BUYER BEWARE!!!  SCAM???\n        '}, {'date': 'A day ago', 'ranking': '1', 'review': "\n            I ordered a jacket 2 weeks ago.  Still hasn't shipped.  No response to my email.  No 800 cutomer service number.  I think I just got scammed out of $200.\n        "}, {'date': '31 Dec 2018', 'ranking': '4', 'review': "\n            I've used this service for many years and get almost all of my clothes from here. It's very efficient compared to shopping in the shops and far more convenient than shopping across many online stores...I find the recommendations a little wanting. They used to be far better, now I find the recommendations a little repetitive week after week.The ability to order so many brands and return them when unwanted all in one place is an excellent convenience factor.The range of clothes should be a little better on the formal side, but for casual and smart casual it's very good.\n        "}, {'date': '19 Dec 2018', 'ranking': '5', 'review': '\n            Great website, tailored recommendations, and even sales. Great to have fast-fashion site dedicated to men.The delivery and return service is very easy - would recommend. Keep it up Thread!\n        '}, {'date': '18 Dec 2018', 'ranking': '1', 'review': '\n            I was excited by the prospect offered by thread.  I thought it was an interesting concept, and one which I needed.  At the time, I couldn\'t find clothes that I was really happy with and I thought the concept of an "online personal shopper" was just what I needed.  However, having spent an age filling in all the forms, my request for the very first thing that I\'d said I was looking for - just trousers, nothing out of the ordinary - was completely ignored.  All of my expressed preferences were ignored, to the extent that styles that I had specifically stated that I didn\'t like were the very styles offered.  I asked for trousers and was offered endless amount of shoes, which I said I didn\'t need.  It became very clear that the personal shopper was either not listening or was a bot.  Thread\'s messages became simply spam.  Never again.\n        '}, {'date': '12 Dec 2018', 'ranking': '5', 'review': "\n            Firstly, their customer service is second to none! To cut a long story short, I had a question about my order and the person I emailed was extremely helpful and resolved the matter in minutes.Secondly, when my parcel arrived, it was well packaged and looked fantastic. The products were also great quality - and fit perfect as described.I genuinely cannot find a fault with anything. They have however done something damaging - I will not be buying my clothes from anywhere else now, other than thread. Simply because I was made to feel like a person as opposed to just another order number. I'm sincerely impressed and will be telling people about this. Well done Thread!\n        "}, {'date': '2 Dec 2018', 'ranking': '3', 'review': "\n            It's a good idea.  I am in between sizes and don't have a good eye for what looks good on me.But the execution of the idea lets Thread down.I mostly get recommendations that scream Debenhams mid-age wardrobe.  Despite me clicking on several brands I dislike, Thread kept recommending.Price point isn't its selling point: you'd better go elsewhere if you're after a discount.  You can get 10-15% off things.  But in fairness to it, it doesn't set out to be a cost-saving enterprise.I'd use Thread more if it started working more with a wider range of suppliers. Currently it seems like it's Debenhams with a few extras here and there. Particularly true of accessories that were recommended to me.\n        "}, {'date': '31 Oct 2018', 'ranking': '5', 'review': '\n            Great experience so far. Big choice of clothes in different styles, option to pay in 30 days gives a lot of flexibility. Up to 10 outfit ideas a week. And the fact that you have a dedicated stylist you can ask pretty much anything is game-changing.\n        '}, {'date': '31 Oct 2018', 'ranking': '5', 'review': "\n            Absolutely love using Thread.com.  As a man who doesn't like to go shopping and is quite lazy about buying new clothes, this has been a revelation.  The style recommendations are great and you know outfits are going to work together.  I probably keep 60-70% of things I order but returns are super easy.  Since using Thread.com I probably add 2-3 new pieces to my wardrobe each month and my friends and co-workers have all commented that I'm dressing sharper!\n        "}, {'date': '30 Oct 2018', 'ranking': '2', 'review': "\n            I'd like to give Thread a one star review, but their behaviour has never been rude, so two stars it isTheir 'personalised' recommendations aren't Their 'genius' AI isn't Their stylists din't give any useful advice or assistance, rarely respond to emails, and when they do don't answer even straightforwards questionsIf you reject item criteria (e.g. No polyester) or even whole item categories (e.g. No jeans) these still crop up week after weekAvoid\n        "}, {'date': 'Updated 22 Oct 2018', 'ranking': '5', 'review': '\n            Really enjoying the shopping experience on this site. I added a view items to my wishlist, and got an email when one of the items it the sale. Speedy delivery, and some lovehearts for free to top it off.\n        '}, {'date': '15 Oct 2018', 'ranking': '5', 'review': "\n            I absolutely love Thread. I've been surviving on dribs and drabs of new clothes for yonks. I hate shopping, never feel like I can browse a good range and get frustrated within an hour or so. With Thread I'm spending more time looking around in smaller bursts (10 mins on the site every so often). The personalised suggestions are great and after a few weeks of customising (liking and disliking suggestions) I'm getting mostly things I like. I'm finally buying new clothes semi-regularly and I look less like a scruffy git and more like someone who's putting some effort in. Today I received no fewer than 6 complements for my new jumper I'm happy with the delivery times. It's not next day but I don't feel it needs to be. I'm buying steadily, once every month or two and don't feel next day would add anything. Returns are incredibly easy, they provide a pre-printed Collect+ returns form in the box!If I had one criticism it would be, on behalf of my fiancée, that they don't do women's wear. Yet.\n        "}, {'date': '26 Jul 2018', 'ranking': '5', 'review': "\n            Excellent ServiceQuick delivery, nice items that fitted perfectly and were wrapped really nice. Loved the personal note that was inside the package - perfection! Will be shopping here more often - wish there were more merchants that went above their customer's satisfaction :)\n        "}, {'date': '5 Jul 2018', 'ranking': '5', 'review': '\n            Convenient way to order clothes online, and great for discovering nice things that are a little bit outside my comfort zone. Easy returns process.\n        '}, {'date': '24 Jun 2018', 'ranking': '5', 'review': '\n            Recommendations have been brilliant - no more looking through pages and pages of unsuitable clothes.  Delivery has been spot on and returns was simple and quick.\n        '}, {'date': '22 Jun 2018', 'ranking': '1', 'review': '\n            First time ordering from Thread - Very slow delivery, only to receive a completely different product than the one ordered! First impressions, not good!\n        '}, {'date': '28 May 2018', 'ranking': '5', 'review': "\n            I absolutely love thread.com, and I can't recommend them enough.  Great service and competitive prices (they're normally price match if you ask). The few times I've had cause to contact them they've always been really friendly and helpful.I think I'm a fairly typical guy in that i hate shopping, but thread removes all the pain from the experience and I've picked up a lot of great stuff.Some of the criticisms on here are just mad though.  Complaining about the quality of the clothes? Thread dont make the clothes,  they're just reselling brands, so if you dont like the quality spend a little more and get something better.Similar the complaints about sizes and colour.  Again, they dont make the clothes and most of the time they reuse images from the original brand's website.  Returns are free, so just buy a couple of sizes if you're not sure. The delivery time is somewhat fair - sure, they don't usually do next day delivery. If you understand how they make their money/reduce costs by buying in bulk, I don't think its an unreasonable trade off given the huge value-add they're offering. Overall, I just dont understand what more people want from them - they offer free advice on what to buy, sell it at the cheapest possible price, deliver it free, and take it back free (and no questions) if you decide you don't want it.  You don't even have to write the return label!In summary,  great service and dont listen to the naysayers. 5 stars \n        "}, {'date': '17 May 2018', 'ranking': '5', 'review': '\n            Great idea and fantastic service. I just received my first order and couldn’t be happier with the quality of items, their price range, and excellent delivery times. A lot of care also went into packaging, and  they really look after subscribers by recommending styles that match your stats and body shape, amongst other things. Highly recommended!\n        '}, {'date': '5 May 2018', 'ranking': '5', 'review': '\n            Great service. Great clothes which come well packaged on delivery. Prompt credit back to account of any items returned.\n        '}, {'date': '17 Apr 2018', 'ranking': '5', 'review': '\n            Easy, straightforward and very good costumer service.\n        '}], }

import pandas as pd
result = pd.DataFrame(list(_final_results.values()))

你好Ajax,你的解决方案只能查看第一页,有什么办法可以让它自动适用于所有页面,并将评论主题添加到数据框中吗?谢谢。 - user10196527
@ZakkYang 请查看我的最近编辑。我使用了一个 while 循环来不断查找下一页,因为每个分页栏只显示最多六个页面结果。 - Ajax1234

0
你可以从包含 JSON 的脚本标签中提取信息。这也允许你计算页面数量,因为有总评论数,你可以计算每页评论数。
import requests
from bs4 import BeautifulSoup as bs
import json
import math
import pandas as pd

def getInfo(url):
    res=requests.get(url)
    soup = bs(res.content, 'lxml')
    data = json.loads(soup.select_one('[type="application/ld+json"]').text.strip()[:-1])[0]
    return data

def addItems(data):
    result = []
    for item in data['review']:

        review = {    
                  'Headline': item['headline'] ,
                  'Ranking': item['reviewRating']['ratingValue'],
                  'Review': item['reviewBody'],
                  'ReviewDate': item['datePublished']
                }

        result.append(review)
    return result

url = 'https://uk.trustpilot.com/review/thread.com?page={}'
results = []
data = getInfo(url.format(1))
results.append(addItems(data))  
totalReviews = int(data['aggregateRating']['reviewCount'])
reviewsPerPage = len(data['review'])
totalPages = math.ceil(totalReviews/reviewsPerPage)

if totalPages > 1:
    for page in range(2, totalPages + 1):
        data = getInfo(url.format(page))
        results.append(addItems(data)) 

final = [item for result in results for item in result]
df = pd.DataFrame(final)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接