Python BeautifulSoup 表格抓取不正确

Question

Python BeautifulSoup 表格抓取不正确

3

我有以下代码，试图爬取此页面上的主表格。我需要获取第二列和第四列的NORAD ID和发射日期。但是我无法通过ID找到该表格。

import requests
from bs4 import BeautifulSoup

data = []

URL = 'https://www.n2yo.com/satellites/?c=52&srt=2&dir=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("table", id="categoriestab")
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

print(data)

- Luke Prior

3个回答

0

更改

soup = BeautifulSoup(page.content, 'html.parser')

到

soup = BeautifulSoup(page.content, 'lxml')

- dimay

0

如果你打印 soup 并搜索，你将无法在输出中找到你要查找的id。这很可能意味着该页面是由JavaScript渲染的。您可以尝试使用PhantomJS或selenium。我曾经遇到过类似问题，使用了selenium来解决。你需要下载chrome driver: https://chromedriver.chromium.org/downloads。以下是我使用的代码。

driver = webdriver.Chrome(executable_path=<YOUR PATH>, options=options)
driver.get('YOUR URL')
driver.implicitly_wait(1)
soup_file = BeautifulSoup(driver.page_source, 'html.parser')

这段代码的作用是设置驱动程序连接到URL，等待其加载，获取所有代码并将其放入BeautifulSoup对象中。

希望这能帮到你！

- Shrey

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Humayun Ahmad Rajib · Accepted Answer

要获取卫星的 NORAD ID 和 发射日期，可以尝试以下方法：

import pandas as pd

url = "https://www.n2yo.com/satellites/?c=52&srt=2&dir=0"
df = pd.read_html(url)

data = df[2].drop(["Name", "Int'l Code", "Period[minutes]", "Action"], axis=1)
print(data)

输出将是：