从NETCDF文件中提取数据的有效方法

Question

从NETCDF文件中提取数据的有效方法

7

我有约20000个坐标需要从多个NetCDF文件中提取数据，每个文件大约有30000个时间步长（未来气候情景）。使用这里的解决方案效率低下的原因是在每个i,j处将“dsloc”转换为“dataframe”所花费的时间（请查看下面的代码）。 ** 可以从这里下载示例NetCDF文件 **

import pandas as pd
import xarray as xr
import time

#Generate some coordinates
coords_data = [{'lat': 68.04, 'lon': 15.20, 'stid':1},
    {'lat':67.96, 'lon': 14.95, 'stid': 2}]
crd= pd.DataFrame(coords_data)
lat = crd["lat"]
lon = crd["lon"]
stid=crd["stid"]

NC = xr.open_dataset(nc_file)
point_list = zip(lat,lon,stid)
start_time = time.time()
for i,j,id in point_list:
    print(i,j)
    dsloc = NC.sel(lat=i,lon=j,method='nearest')
    print("--- %s seconds ---" % (time.time() - start_time))
    DT=dsloc.to_dataframe()
    DT.insert(loc=0,column="station",value=id)
    DT.reset_index(inplace=True)
    temp=temp.append(DT,sort=True)
    print("--- %s seconds ---" % (time.time() - start_time))

结果是：

68.04 15.2
--- 0.005853414535522461 seconds ---
--- 9.02660846710205 seconds ---
67.96 14.95
--- 9.028568267822266 seconds ---
--- 16.429600715637207 seconds ---

这意味着每个i，j大约需要9秒钟进行处理。考虑到有很多坐标和具有大时间步长的netcdf文件，我想知道是否有一种Pythonic的方式可以优化代码。我也可以使用CDO和NCO操作符，但我发现它们也存在类似的问题。

- Seji

很好的问题，我无法解决，但也许你可以尝试一下：https://examples.dask.org/xarray.html。我想到这个主意是因为我知道对于数据框架，“dask”比“pandas”更快，然后我只是谷歌了一下“xarray dask”，结果发现它确实存在。 - Jeremy

或者你可以先将坐标分成组，然后在Python中使用“map”运行并行作业。 - Jeremy

请注意，Dask 不一定比 Pandas 或 Xarray 更快；实际上，对于内存问题，它总是更慢（在 CPU 时间方面）。Dask 非常有用，可以利用多个核心或机器，但它并不是加速事情的万能药。看起来 OP 想将此代码应用于许多数据集，因此需要调整此代码以避免使用 Dask，这样完整的 Dask 工作流程（跨多个文件）将更快。 - Michael Delgado

话虽如此，我强烈建议使用dask来处理你的大量文件，例如将该代码封装为一个函数，然后使用client.map在数据上应用它。或者，你可以使用xr.open_mfdataset读取所有文件，然后使用我下面提出的索引方法。 - Michael Delgado

1

如果这是您需要重复执行的操作，使用 nccopy 对文件进行分块以进行时间序列访问可能是值得的。请参阅数据分块：为什么很重要。 - Robert Davy

2个回答

0

我有一个潜在的解决方案。这个想法是首先将xarray数据数组转换为pandas，然后基于lat/lon条件从pandas数据框中获取子集。

# convert xarray data to a pandas dataframe
def xr_to_df(data):
    data = data.to_dataframe()
    data.reset_index(inplace=True)
    return data

# convert your xarray data to a pandas dataframe
full_df = xr_to_df(full_xarray)

# create a 2 columns pandas dataframe containing your target coordinates
points = pd.DataFrame({'lat':target_lat, 'lon':target_lon})

# get the values at your target points only via merging on the left
subset = pd.merge(points,full_df)

我不确定你的数据大小，这个速度会有多快。但至少，这避免了循环。我猜应该会更快？

我注意到你的点是随机分布的（不在网格中心上）。为了解决这个问题，你可以先编写自己的代码将它们重新网格化到netcdf分辨率上，使用像np.argmin(abs(lat - lat_netcdf))这样的东西来找到最近的纬度和经度。

- Jeremy

原问题明确表示将其转换为pandas不是一个可接受的答案。这种方法会非常缓慢且内存效率低下。 - Michael Delgado

我也测试了@Jeremy的答案，它确实提高了时间效率，但是将nc文件转换为数据框存在两个问题...一个是时间，另一个是内存...如果我按年份将nc文件分割，内存问题就会解决，但每个文件仍需要大约8分钟的转换时间...所以可能不是最佳解决方案。 - Seji

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Michael Delgado · Accepted Answer

这是使用xarray的DataArray索引的高级索引的完美应用案例。

# Make the index on your coordinates DataFrame the station ID,
# then convert to a dataset.
# This results in a Dataset with two DataArrays, lat and lon, each
# of which are indexed by a single dimension, stid
crd_ix = crd.set_index('stid').to_xarray()

# now, select using the arrays, and the data will be re-oriented to have
# the data only for the desired pixels, indexed by 'stid'. The
# non-indexing coordinates lat and lon will be indexed by (stid) as well.
NC.sel(lon=crd_ix.lon, lat=crd_ix.lat, method='nearest')

数据中的其他维度将被忽略，因此如果您的原始数据具有维度(lat, lon, z, time)，则您的新数据将具有维度(stid, z, time)。