如何使用Python循环遍历分页API

23
我需要从REST API检索500部最受欢迎的电影,但结果每页限制20个,我每10秒只能进行40次调用(https://developers.themoviedb.org/3/getting-started/request-rate-limiting)。 我无法动态地遍历分页结果,以便将500个最受欢迎的结果放在单个列表中。
我可以成功返回前20个最受欢迎的电影(如下所示)并枚举电影编号,但我卡在了通过循环允许我分页浏览前500个而不因API速率限制超时。
import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

#Returning all drama films >= 2004 in popularity desc
discover_api = requests.get(discover_api).json()

most_popular_films = discover_api['results']

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])



Sample response:

{
  "page": 1,
  "total_results": 101685,
  "total_pages": 5085,
  "results": [
    {
      "vote_count": 13,
      "id": 280960,
      "video": false,
      "vote_average": 5.2,
      "title": "Catarina and the others",
      "popularity": 130.491,
      "poster_path": "/kZMCbp0o46Tsg43omSHNHJKNTx9.jpg",
      "original_language": "pt",
      "original_title": "Catarina e os Outros",
      "genre_ids": [
        18,
        9648
      ],
      "backdrop_path": "/9nDiMhvL3FtaWMsvvvzQIuq276X.jpg",
      "adult": false,
      "overview": "Outside, the first sun rays break the dawn.  Sixteen years old Catarina can't fall asleep.  Inconsequently, in the big city adults are moved by desire...  Catarina found she is HIV positive. She wants to drag everyone else along.",
      "release_date": "2011-03-01"
    },
    {
      "vote_count": 9,
      "id": 531309,
      "video": false,
      "vote_average": 4.6,
      "title": "Brightburn",
      "popularity": 127.582,
      "poster_path": "/roslEbKdY0WSgYaB5KXvPKY0bXS.jpg",
      "original_language": "en",
      "original_title": "Brightburn",
      "genre_ids": [
        27,
        878,
        18,
        53
      ],

我需要使用Python循环将分页结果追加到单个列表中,直到我捕获了500部最受欢迎的电影。


Desired Output:

Movie_ID  Movie_Title
280960    Catarina and the others
531309    Brightburn
438650    Cold Pursuit
537915    After
50465     Glass
457799    Extremely Wicked, Shockingly Evil and Vile


API 响应中不包含下一个 URL 字段吗? - AdamGold
这取决于API。由于响应包括一些分页字段("page": 1, "total_pages": 5085, ...}),我希望它能够接受一个 page=n 字段。 - Serge Ballesta
据我所知,@AdamGold,似乎没有下一个URL字段,但有一个“page”参数,需要一个整数值,并且如果没有输入值,则默认为“1”。话虽如此,我认为你的第一个解决方案可能可行,但似乎没有任何逻辑来限制循环到前n个(在我的情况下是500)结果。 - izzy84
@SergeBallesta,该API将接受一个“page=n”的参数,但我还需要在这些页面内限制结果为前n个(在此情况下为500个)。我不一定想循环遍历所有结果,仅需直到达到前500个结果即可。我还需要确保能够处理每10秒40个请求的API速率限制。 - izzy84
1个回答

47

大多数API都包括一个next_url字段,帮助您循环遍历所有结果。让我们检查一些情况。

1. 没有next_url字段

您可以循环遍历所有页面直到results字段为空:

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

most_popular_films = []
new_results = True
page = 1
while new_results:
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    new_results = discover_api.get("results", [])
    most_popular_films.extend(new_results)
    page += 1

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

2. 依赖于 total_pages 字段

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api_url).json()
most_popular_films = discover_api["results"]
for page in range(2, discover_api["total_pages"]+1):
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

3. next_url 字段存在!太好了!

同样的思路,只是现在我们检查 next_url 字段是否为空 - 如果它是空的,那么就是最后一页。

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api).json()
most_popular_films = discover_api["results"]
while discover_api["next_url"]:
    discover_api = requests.get(discover_api["next_url"]).json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

我尝试使用您的第一个解决方案,但是收到了错误消息(请参见下文)。此外,如果可能的话,我想将循环限制为仅返回前n个结果,而不是遍历所有内容并潜在地返回更多结果。 - izzy84
收到的错误是什么?顺便说一下,我会选择第二个解决方案。它更加优雅。如果你想要限制结果数量,可以获取前N页(使用range(2, 26)获取前25页),或者检查most_popular_films的长度是否超过N。 - AdamGold
我使用了你更新后的解决方案#1,但收到了另一个错误消息:KeyError Traceback (most recent call last) in 4 while new_results: 5 discover_api = requests.get(discover_api_url + f"&page={page}").json() ----> 6 new_results = discover_api["results"] 7 most_popular_films.extend(new_results) 8 page += 1 KeyError: 'results' - izzy84
已修复并更新。再次建议您使用第二个解决方案。 - AdamGold
API响应显示总共有5087页和101729个结果。加入逻辑到while循环中,判断是否while pages >= 50:,这样我们就不需要遍历所有页面了,是不是很合理?无论如何,我都需要处理API每10秒40次的速率限制的逻辑,对吧? - izzy84
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接