一种可能的解决方案是爬取数据,但这不起作用,因为当您开始向Google发送数百个请求时,它会阻止您。
有任何想法吗?
顺便说一句,“Google Play开发者API”不是一个选择,因为它只允许您访问您自己的应用程序详细信息。
如果在 requests
库中默认使用 user-agent
作为请求头,那么请求可能会被阻止。python-requests
.
另一个额外的步骤是轮换 user-agent
, 例如,在 PC、移动设备和平板电脑之间以及浏览器之间(如 Chrome、Firefox、Safari、Edge 等)进行切换。可以将用户代理旋转与代理旋转(最好是住宅代理)+验证码解决程序结合使用。
目前,Google Play 商店已经进行了大规模的重新设计,现在几乎完全是动态的。但是,所有数据都可以从内联 JSON 中提取。
对于爬取动态网站,selenium
或 playwright
webdriver 是很好的选择。然而,在我们的情况下,使用 BeautifulSoup 和 正则表达式 可以更快地从页面源代码中提取数据。
我们必须使用 正则表达式
从 HTML 中的所有 <script>
元素中提取特定的 <script>
元素,并使用 json.loads()
将其转换为 dict
:
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
from bs4 import BeautifulSoup
import requests, re, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"id": "com.nintendo.zara", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# where all app data will be stored
app_data = {
"basic_info":{
"developer":{},
"downloads_info": {}
}
}
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
# https://regex101.com/r/6Reb0M/1
additional_basic_info = re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>",
str(soup.select("script")), re.M|re.DOTALL).group(1)
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "") # replace new line character to nothing
app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1) # 4.287856 -> 4.3
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]
app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)
# https://regex101.com/r/Y2mWEX/1 (a few matches but re.search always matches the first occurence)
app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)
# https://regex101.com/r/7yxDJM/1
app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)
# https://regex101.com/r/jjsdUP/1
# [2:] skips 2 PEGI logo thumbnails and extracts only app images
app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]
try:
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
except:
app_data["basic_info"]["video_trailer"] = None
print(json.dumps(app_data, indent=2, ensure_ascii=False))
示例输出:
[
{
"basic_info": {
"developer": {
"name": "Nintendo Co., Ltd.",
"url": "https://supermariorun.com/",
"email": "supermariorun-support@nintendo.co.jp"
},
"downloads_info": {
"long_form_not_formatted": "100,000,000+",
"long_form_formatted": "100000000",
"as_displayed_short_form": "100M+",
"actual_downloads": "213064462"
},
"name": "Super Mario Run",
"type": "SoftwareApplication",
"url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_GB&gl=US",
"description": "Control Mario with just a tap!",
"application_category": "GAME_ACTION",
"operating_system": "ANDROID",
"thumbnail": "https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
"content_rating": "Everyone",
"rating": 4.0,
"reviews": "1645926",
"price": "0",
"release_date": "22 Mar 2017",
"images": [
"https://play-lh.googleusercontent.com/yT8ZCQHNB_MGT9Oc6mC5_mQS5vZ-5A4fvKQHHOl9NBy8yWGbM5-EFG_uISOXmypBYQ6G",
"https://play-lh.googleusercontent.com/AvRrlEpV8TCryInAnA__FcXqDu5d3i-XrUp8acW2LNmzkU-rFXkAKgmJPA_4AHbNjyY",
"https://play-lh.googleusercontent.com/AESbAa4QFa9-lVJY0vmAWyq2GXysv5VYtpPuDizOQn40jS9Z_ji8HXHA5hnOIzaf_w",
"https://play-lh.googleusercontent.com/KOCWy63UI2p7Fc65_X5gnIHsErEt7gpuKoD-KcvpGfRSHp-4k8YBGyPPopnrNQpdiQ",
"https://play-lh.googleusercontent.com/iDJagD2rKMJ92hNUi5WS2S_mQ6IrKkz6-G8c_zHNU9Ck8XMrZZP-1S_KkDsA6KDJ9No",
# ...
]
以下为 SerpApi 简单代码示例:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # parsing engine
"store": "apps", # app page
"gl": "us", # country of the search
"product_id": "com.nintendo.zara", # low review count example to show it exits the while loop
"all_reviews": "true" # shows all reviews
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict()
print(json.dumps(results["product_info"], indent=2, ensure_ascii=False))
print(json.dumps(results["media"], indent=2, ensure_ascii=False))
# other data
Output exactly the same as in the previous solution.
如果您需要更多的代码说明,可以参考这篇Python中爬取Google Play商店应用程序的博客文章。
免责声明,我在SerpApi工作。
Android Marketing API被用来从Google商店获取所有应用程序的详细信息,你可以在这里查看: https://code.google.com/p/android-market-api/
很遗憾,Google Play(以前称为Android Market)没有公开的API。
要获取所需的数据,您可以开发自己的HTML爬虫,解析页面并提取所需的应用程序元数据。这个话题已经在其他问题中讨论过,例如这里。
如果您不想自己实现所有这些(正如您提到的,这是一个复杂的项目),您可以使用第三方服务通过基于JSON的API访问Android应用程序元数据。
例如,42matters.com(我所在的公司)提供了Android和iOS的API,您可以在这里查看更多详细信息。
端点范围从“查找”(获取一个应用程序的元数据,可能是您需要的)到“搜索”,但我们还公开了来自领先应用商店的“排名历史记录”和其他统计信息。我们为所有支持的功能提供了广泛的文档,您可以在左侧面板中找到它们:42matters docs
我希望这可以帮助您,否则请随时与我联系。我非常了解这个行业,可以指引您正确的方向。
此致
敬礼
Andrea