有没有一种方法可以释放 xarray.Dataset 的文件锁定？

Question

有没有一种方法可以释放 xarray.Dataset 的文件锁定？

8

我有一个进程，每隔5分钟使用netcdf4.Dataset（fn，mode = a）来增加NetCDF文件fn的大小。同时，我使用xarray.Dataset创建了一个bokeh服务器可视化该NetCDF文件的对象（因为这样非常方便）。

问题在于，如果我的bokeh服务器正在使用fn文件，则尝试向其添加新数据时NetCDF更新进程将失败。

ds = xarray.open_dataset(fn)

如果我使用选项autoclose

ds = xarray.open_dataset(fn, autoclose=True)

当bokeh服务器应用程序中的ds处于“打开”状态时，使用其他进程更新fn是有效的，但从fn获取时间片的bokeh图的更新速度非常缓慢。

我的问题是：在使用xarray.Dataset时，是否有另一种释放NetCDF文件锁定的方法？

如果仅在重新加载整个bokeh服务器应用程序后一致地更新xarray.Dataset的形状，则我不会介意。

谢谢！

这是一个最小工作示例：

将以下内容放入文件中并运行：

import time
from datetime import datetime

import numpy as np
import netCDF4

fn = 'my_growing_file.nc'

with netCDF4.Dataset(fn, 'w') as nc_fh:
    # create dimensions
    nc_fh.createDimension('x', 90)
    nc_fh.createDimension('y', 90)
    nc_fh.createDimension('time', None)

    # create variables
    nc_fh.createVariable('x', 'f8', ('x'))
    nc_fh.createVariable('y', 'f8', ('y'))
    nc_fh.createVariable('time', 'f8', ('time'))
    nc_fh.createVariable('rainfall_amount',
                         'i2',
                         ('time', 'y', 'x'),
                         zlib=False,
                         complevel=0,
                         fill_value=-9999,
                         chunksizes=(1, 90, 90))
    nc_fh['rainfall_amount'].scale_factor = 0.1
    nc_fh['rainfall_amount'].add_offset = 0

    nc_fh.set_auto_maskandscale(True)

    # variable attributes
    nc_fh['time'].long_name = 'Time'
    nc_fh['time'].standard_name = 'time'
    nc_fh['time'].units = 'hours since 2000-01-01 00:50:00.0'
    nc_fh['time'].calendar = 'standard'

for i in range(1000):
    with netCDF4.Dataset(fn, 'a') as nc_fh:
        current_length = len(nc_fh['time'])

        print('Appending to NetCDF file {}'.format(fn))
        print(' length of time vector: {}'.format(current_length))

        if current_length > 0:
            last_time_stamp = netCDF4.num2date(
                nc_fh['time'][-1],
                units=nc_fh['time'].units,
                calendar=nc_fh['time'].calendar)
            print(' last time stamp in NetCDF: {}'.format(str(last_time_stamp)))
        else:
            last_time_stamp = '1900-01-01'
            print(' empty file, starting from scratch')

        nc_fh['time'][i] = netCDF4.date2num(
            datetime.utcnow(),
            units=nc_fh['time'].units,
            calendar=nc_fh['time'].calendar)
        nc_fh['rainfall_amount'][i, :, :] = np.random.rand(90, 90)

    print('Sleeping...\n')
    time.sleep(3)

然后，前往IPython并通过以下方式打开增长的文件：

ds = xr.open_dataset('my_growing_file.nc')

这将导致向NetCDF附加的进程失败，并输出以下内容：

Appending to NetCDF file my_growing_file.nc
 length of time vector: 0
 empty file, starting from scratch
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 1
 last time stamp in NetCDF: 2018-04-12 08:52:39.145999
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 2
 last time stamp in NetCDF: 2018-04-12 08:52:42.159254
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 3
 last time stamp in NetCDF: 2018-04-12 08:52:45.169516
Sleeping...

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-17-9950ca2e53a6> in <module>()
     37 
     38 for i in range(1000):
---> 39     with netCDF4.Dataset(fn, 'a') as nc_fh:
     40         current_length = len(nc_fh['time'])
     41 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

IOError: [Errno -101] NetCDF: HDF error: 'my_growing_file.nc'

If using

ds = xr.open_dataset('my_growing_file.nc', autoclose=True)

没有错误，但是通过xarray访问时间当然会变慢，这正是我的问题所在，因为我的仪表盘可视化非常卡顿。我可以理解这可能不是xarray的预期用途，如果需要，我将退回到netCDF4提供的更低级别接口（希望它支持并发文件访问，至少支持读取），但我想保留xarray的方便性。

- cchwala

你能否添加一个最小化、完整化和可验证的示例（https://stackoverflow.com/help/mcve）？我怀疑你的问题的答案将取决于你具体的实现。例如，不清楚你是否在任何时候调用了 ds.close()。 - jhamman

@jhamman 感谢您的快速回复。我会更新我的帖子并提供一个 MCVE，但由于其他工作的原因，这可能需要几天时间。我认为可能有一个简单明确的答案，比如“永远不要写入已打开为 xarray.Dataset 的文件”。 - cchwala

我尝试使用netcdf4和h5py进行了一些实验。使用netcdf4时，我没有成功地让一个文件在两个进程中同时打开。对于h5py.File，有一个标志swmr（单写多读），它特别指示我可以在另一个进程保持文件打开以进行写入时进行读取。这样做是可行的，尽管与我的示例略有不同，因为增长的文件必须始终保持打开状态，不能在for循环中关闭并重新打开。我是否遗漏了什么，或者使用h5py是唯一的方法？ - cchwala

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cchwala · Accepted Answer

我在回答自己的问题，因为我找到了一个解决方案，或者更好地说，是绕过了Python中NetCDF文件锁定问题的方法。

如果您想要在保持数据集对于实时可视化开放的同时不断增加文件大小，那么使用zarr而不是NetCDF文件是一个很好的解决方案。

幸运的是，现在xarray也可以轻松地使用append_dim关键字参数沿着选定的维度将数据添加到现有的zarr文件中，这得益于最近合并的PR。

使用zarr的代码，与我在问题中使用NetCDF的代码相比，如下所示：


import dask.array as da
import xarray as xr
import pandas as pd
import datetime
import time

fn = '/tmp/my_growing_file.zarr'

# Creat a dummy dataset and write it to zarr
data = da.random.random(size=(30, 900, 1200), chunks=(10, 900, 1200))
t = pd.date_range(end=datetime.datetime.utcnow(), periods=30, freq='1s')
ds = xr.Dataset(
    data_vars={'foo': (('time', 'y', 'x'), data)},
    coords={'time': t},
)
#ds.to_zarr(fn, mode='w', encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue':-9999}})
#ds.to_zarr(fn, mode='w', encoding={'time': {'_FillValue': -9999}})
ds.to_zarr(fn, mode='w')

# Append new data in smaller chunks
for i in range(100):
    print('Sleeping for 10 seconds...')
    time.sleep(10)

    data = 0.01 * i + da.random.random(size=(7, 900, 1200), chunks=(7, 900, 1200))
    t = pd.date_range(end=datetime.datetime.utcnow(), periods=7, freq='1s')
    ds = xr.Dataset(
        data_vars={'foo': (('time', 'y', 'x'), data)},
        coords={'time': t},
    )
    print(f'Appending 7 new time slices with latest time stamp {t[-1]}')
    ds.to_zarr(fn, append_dim='time')

您可以打开另一个Python进程，例如IPython，并执行以下操作：

 ds = xr.open_zarr('/tmp/my_growing_file.zarr/')

一遍又一遍地进行操作而不会导致写入进程崩溃。

我在这个示例中使用了xarray版本0.15.0和zarr版本2.4.0。

一些额外的说明：

请注意，此示例中的代码故意以小块追加，这些小块不均匀地分割了zarr文件中的块大小，以查看它对块的影响。从我的测试中，我可以说最初选择的zarr文件的块大小得到了保留，这非常好！

还要注意，由于xarray将datetime64数据编码并存储为整数以符合NetCDF的CF约定，因此代码在追加时生成警告。这对于zarr文件也有效，但目前似乎_FillValue没有自动设置。只要您的时间数据中没有NaT，这就无关紧要。

免责声明：我尚未尝试使用较大的数据集和长时间运行的过程来增加文件，因此我无法评论如果zarr文件或其元数据从此过程中被分段可能会出现的性能降低或其他问题。