如何使用Python循环遍历分页API

Question

如何使用Python循环遍历分页API

23

我需要从REST API检索500部最受欢迎的电影，但结果每页限制20个，我每10秒只能进行40次调用（https://developers.themoviedb.org/3/getting-started/request-rate-limiting）。我无法动态地遍历分页结果，以便将500个最受欢迎的结果放在单个列表中。

我可以成功返回前20个最受欢迎的电影（如下所示）并枚举电影编号，但我卡在了通过循环允许我分页浏览前500个而不因API速率限制超时。

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

#Returning all drama films >= 2004 in popularity desc
discover_api = requests.get(discover_api).json()

most_popular_films = discover_api['results']

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])


Sample response:

{
  "page": 1,
  "total_results": 101685,
  "total_pages": 5085,
  "results": [
    {
      "vote_count": 13,
      "id": 280960,
      "video": false,
      "vote_average": 5.2,
      "title": "Catarina and the others",
      "popularity": 130.491,
      "poster_path": "/kZMCbp0o46Tsg43omSHNHJKNTx9.jpg",
      "original_language": "pt",
      "original_title": "Catarina e os Outros",
      "genre_ids": [
        18,
        9648
      ],
      "backdrop_path": "/9nDiMhvL3FtaWMsvvvzQIuq276X.jpg",
      "adult": false,
      "overview": "Outside, the first sun rays break the dawn.  Sixteen years old Catarina can't fall asleep.  Inconsequently, in the big city adults are moved by desire...  Catarina found she is HIV positive. She wants to drag everyone else along.",
      "release_date": "2011-03-01"
    },
    {
      "vote_count": 9,
      "id": 531309,
      "video": false,
      "vote_average": 4.6,
      "title": "Brightburn",
      "popularity": 127.582,
      "poster_path": "/roslEbKdY0WSgYaB5KXvPKY0bXS.jpg",
      "original_language": "en",
      "original_title": "Brightburn",
      "genre_ids": [
        27,
        878,
        18,
        53
      ],

我需要使用Python循环将分页结果追加到单个列表中，直到我捕获了500部最受欢迎的电影。


Desired Output:

Movie_ID  Movie_Title
280960    Catarina and the others
531309    Brightburn
438650    Cold Pursuit
537915    After
50465     Glass
457799    Extremely Wicked, Shockingly Evil and Vile

- izzy84

API 响应中不包含下一个 URL 字段吗？ - AdamGold

这取决于API。由于响应包括一些分页字段（"page": 1, "total_pages": 5085, ...}），我希望它能够接受一个 page=n 字段。 - Serge Ballesta

据我所知，@AdamGold，似乎没有下一个URL字段，但有一个“page”参数，需要一个整数值，并且如果没有输入值，则默认为“1”。话虽如此，我认为你的第一个解决方案可能可行，但似乎没有任何逻辑来限制循环到前n个（在我的情况下是500）结果。 - izzy84

@SergeBallesta，该API将接受一个“page=n”的参数，但我还需要在这些页面内限制结果为前n个（在此情况下为500个）。我不一定想循环遍历所有结果，仅需直到达到前500个结果即可。我还需要确保能够处理每10秒40个请求的API速率限制。 - izzy84

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AdamGold · Accepted Answer

大多数API都包括一个next_url字段，帮助您循环遍历所有结果。让我们检查一些情况。

1. 没有`next_url`字段

您可以循环遍历所有页面直到results字段为空：

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

most_popular_films = []
new_results = True
page = 1
while new_results:
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    new_results = discover_api.get("results", [])
    most_popular_films.extend(new_results)
    page += 1

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

2. 依赖于 `total_pages` 字段

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api_url = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api_url).json()
most_popular_films = discover_api["results"]
for page in range(2, discover_api["total_pages"]+1):
    discover_api = requests.get(discover_api_url + f"&page={page}").json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

3. `next_url` 字段存在！太好了！

同样的思路，只是现在我们检查 next_url 字段是否为空 - 如果它是空的，那么就是最后一页。

import requests #to make TMDB API calls

#Discover API url filtered to movies >= 2004 and containing Drama genre_ID: 18
discover_api = 'https://api.themoviedb.org/3/discover/movie? 
api_key=['my api key']&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_year=>%3D2004&with_genres=18'

discover_api = requests.get(discover_api).json()
most_popular_films = discover_api["results"]
while discover_api["next_url"]:
    discover_api = requests.get(discover_api["next_url"]).json()
    most_popular_films.extend(discover_api["results"])

#printing movie_id and movie_title by popularity desc
for i, film in enumerate(most_popular_films):
    print(i, film['id'], film['title'])

如何使用Python循环遍历分页API

1. 没有next_url字段

2. 依赖于 total_pages 字段

3. next_url 字段存在！太好了！

1. 没有`next_url`字段

2. 依赖于 `total_pages` 字段

3. `next_url` 字段存在！太好了！