NOAA的历史天气数据

10
我正在进行一项数据挖掘项目,希望收集历史天气数据。我可以通过他们在http://www.ncdc.noaa.gov/cdo-web/search提供的Web界面获取历史数据。但是,我想通过API编程方式访问此数据。根据我在StackOverflow上阅读的内容,这些数据应该是公共领域的,但我只能在像Wunderground这样的非免费服务中找到它。如何免费访问这些数据?

很好的问题。没有API,我只能退而求其次采用(尊重的)爬虫策略。NOAA数据是一个很好的资源,但需要一些QA/QC。请查看与此文章相关的这个资源 - metasequoia
另一种选择是使用GHCN-D的ftp页面 - metasequoia
3个回答

10

1
我在令牌方面遇到了麻烦,这是我的curl请求:curl -H“Authorization:<token>”http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets其中<token>是通过电子邮件发送给我的令牌,但它返回错误{"status":"400","message":"Token parameter is required."} - azrosen92
1
жҲ‘еҸӘжүҫеҲ°дәҶйҖҡиҝҮcurl()иҝҷж ·зҡ„ж–№жі• -> curl_setopt($init, CURLOPT_URL, 'http://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&startdate='.$startDate.'&enddate='.$endDate.'&datatypeid=TMAX&datatypeid=TMIN&stationid=GHCND:'.$city_id.'&limit='.$limit);//'http://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&stationid=GHCND:ZI000067964&limit=31'); curl_setopt($init, CURLOPT_HEADER, false); curl_setopt($init, CURLOPT_HTTPHEADER, array('token:<token here>')); curl_setopt($init, CURLOPT_RETURNTRANSFER, 1); - Jurijs Nesterovs
1
azrosen92:curl -H“token:<token>”http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets - Brian
API已更新,文档可在以下链接中找到: https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation (是的,尽管版本号较低,但确实是一次更新) - RobinReborn

0
据我所知,所有的NOAA历史天气数据都可以通过upgini python库免费获取:https://upgini.com
然而,如果您没有训练机器学习算法的任务,您将无法下载这些数据。upgini的一个特点是用仅包含相关数据列的数据帧进行数据丰富。在这种情况下,相关性被理解为数据列(例如温度)对某个目标事件的预测的重要性。
如果您有这样的任务,请尝试使用upgini进行数据丰富,以免费获取NOAA历史天气数据。
%pip install upgini

from upgini import FeaturesEnricher, SearchKey
enricher = FeaturesEnricher (search_keys={'rep_date': SearchKey.DATE, 'country': SearchKey.COUNTRY, 'postal_code': SearchKey.POSTAL_CODE})
enricher.fit(X_train, Y_train)

0

依赖关系

  1. pip install selenium
  2. 下载 Chrome 驱动程序('chromedriver.exe')#适用于 Windows 操作系统 https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_win32.zip

下载驱动程序和库后,我们需要通过点击地图来找到所需位置的代码。(来源网站:https://www.weather.gov/wrh/climate

#Keys for required states

# RECAP NAME                   CLICK ON MAP                SELECT UNDER 1. LOCATION
# Dallas                       Fort Worth (fwd)               Dallas Area
# Florida                      Miami  (mfl)                   Miami Area
# New York                     New York  (okx)                NY-Central Park Area
# Minneapolis                  Minneapolis (mpx)              Minneapolis Area
# California                   Los Angeles(lox)               LA Downtown Area

state_code_dict = {'Dallas':['fwd',3],'Florida':['mfl',1],
                   'New York':['okx',24],'Minneapolis':['mpx',1],
                   'California':['lox',2]}

state_code_dict中的数字是给定下拉菜单中所需区域的位置。例如:对于佛罗里达州,代码是'mfl',在佛罗里达州,迈阿密地区在下拉列表中排在第一位。
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

options = Options()
options.add_argument("start-maximized")

webdriver_service = Service('chromedriver.exe')

df_ = pd.DataFrame() #(columns = ['Date','Average','Recap_name'])
for i in state_code_dict.keys():
    
    #Load the driver with webpage
    driver = webdriver.Chrome(options=options, service=webdriver_service)
    wait = WebDriverWait(driver, 30)
    print("Running for: ",i)
    ## Below url redirects to the data page
    ## source site is (https://www.weather.gov/wrh/climate)
    url = "https://nowdata.rcc-acis.org/" + state_code_dict[i][0] + "/"
    select_location = "/html/body/div[1]/div[3]/select/option[" + str(state_code_dict[i][1]) + "]"
    select_date = "tDatepicker"
    
    ## Give desired date/month in 'yyyy-mm' format, as it pulls the complete month data at once.
    set_date = "'2023-07'"
    date_freeze = "arguments[0].value = "+ set_date
    
    #X_PATH of go button to click for next window to open. X_PATH can be found from inspect element in chrome.
    click_go = "//*[@id='go']"
    wait_table_span = "//*[@id='results_area']/table[1]/caption/span"
    enlarge_click = "/html/body/div[5]/div[1]/button[1]"
    
    #Get the temprature table from the appearing html using below X_Path 
    get_table = '//*[@id="results_area"]'
    try:
        driver.get(url)
        # wait 10 seconds before looking for element
        element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH,select_location)))
        element.click()
        element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID,select_date)))
        driver.execute_script(date_freeze, element)
        element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH,click_go)))
        element.click()
        element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH,wait_table_span)))
        element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH,enlarge_click)))
        element.click()
        data = driver.find_element(By.XPATH,get_table).get_attribute("innerHTML")
        df = pd.read_html(data)
        df[0].columns = df[0].columns.droplevel(0)
        df_all = df[0][['Date','Average']] 
        df_all['Recap_name'] = i
    finally:
        driver.quit()
    df_ = df_.append(df_all)
    
## Write different states data to different sheets in excel    
with pd.ExcelWriter("avg_temp.xlsx") as writer:
    for i in state_code_dict.keys():
        df_write = df_[df_.Recap_name == i]
        df_write.to_excel(writer, sheet_name=i, index=False)
    print("--------Finished----------")

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接