R箭头: 错误: 不支持编解码器 'snappy'

8

我一直在使用最新版本的Rarrow包(arrow_2.0.0.20201106),该包支持直接读写AWS S3(这太棒了)。

当我写入并读取自己的文件时似乎没有问题(请参见下文):

write_parquet(iris, "iris.parquet")
system("aws s3 mv iris.parquet s3://myawsbucket/iris.parquet")
df <- read_parquet("s3://myawsbucket/iris.parquet")

但是当我尝试读取其中一个示例的 R arrow 文件时,会得到以下错误:

df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
Error in parquet___arrow___FileReader__ReadTable1(self) : 
  IOError: NotImplemented: Support for codec 'snappy' not built

当我检查编解码器是否可用时,它似乎不可用:

codec_is_available(type="snappy")
[1] FALSE

有人知道如何使“snappy”编解码器可用吗?

谢谢, 迈克

###########

跟进

感谢@Neal的答案。这是为我安装了所有必需依赖项的代码。

Sys.setenv(ARROW_S3="ON")
Sys.setenv(NOT_CRAN="true")
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
2个回答

6

我必须运行

Sys.setenv(ARROW_WITH_SNAPPY = "ON")

在运行install.packages之前.

使用Sys.setenv(ARROW_R_DEV = TRUE)获取详细的构建输出。
有关编译和链接选项的完整列表,请参见以下内容:

-- Compile and link options:
--
--   ARROW_CXXFLAGS="" [default=""]
--       Compiler flags to append when compiling Arrow
--   ARROW_BUILD_STATIC=ON [default=ON]
--       Build static libraries
--   ARROW_BUILD_SHARED=OFF [default=ON]
--       Build shared libraries
--   ARROW_PACKAGE_KIND="" [default=""]
--       Arbitrary string that identifies the kind of package
--       (for informational purposes)
--   ARROW_GIT_ID="" [default=""]
--       The Arrow git commit id (if any)
--   ARROW_GIT_DESCRIPTION="" [default=""]
--       The Arrow git commit description (if any)
--   ARROW_NO_DEPRECATED_API=OFF [default=OFF]
--       Exclude deprecated APIs from build
--   ARROW_USE_CCACHE=ON [default=ON]
--       Use ccache when compiling (if available)
--   ARROW_USE_LD_GOLD=OFF [default=OFF]
--       Use ld.gold for linking on Linux (if available)
--   ARROW_USE_PRECOMPILED_HEADERS=OFF [default=OFF]
--       Use precompiled headers when compiling
--   ARROW_SIMD_LEVEL=SSE4_2 [default=NONE|SSE4_2|AVX2|AVX512]
--       Compile-time SIMD optimization level
--   ARROW_RUNTIME_SIMD_LEVEL=MAX [default=NONE|SSE4_2|AVX2|AVX512|MAX]
--       Max runtime SIMD optimization level
--   ARROW_ARMV8_ARCH=armv8-a [default=armv8-a|armv8-a+crc+crypto]
--       Arm64 arch and extensions
--   ARROW_ALTIVEC=ON [default=ON]
--       Build with Altivec if compiler has support
--   ARROW_RPATH_ORIGIN=OFF [default=OFF]
--       Build Arrow libraries with RATH set to $ORIGIN
--   ARROW_INSTALL_NAME_RPATH=ON [default=ON]
--       Build Arrow libraries with install_name set to @rpath
--   ARROW_GGDB_DEBUG=ON [default=ON]
--       Pass -ggdb flag to debug builds
--
-- Test and benchmark options:
--
--   ARROW_BUILD_EXAMPLES=OFF [default=OFF]
--       Build the Arrow examples
--   ARROW_BUILD_TESTS=OFF [default=OFF]
--       Build the Arrow googletest unit tests
--   ARROW_ENABLE_TIMING_TESTS=ON [default=ON]
--       Enable timing-sensitive tests
--   ARROW_BUILD_INTEGRATION=OFF [default=OFF]
--       Build the Arrow integration test executables
--   ARROW_BUILD_BENCHMARKS=OFF [default=OFF]
--       Build the Arrow micro benchmarks
--   ARROW_BUILD_BENCHMARKS_REFERENCE=OFF [default=OFF]
--       Build the Arrow micro reference benchmarks
--   ARROW_TEST_LINKAGE=static [default=shared|static]
--       Linkage of Arrow libraries with unit tests executables.
--   ARROW_FUZZING=OFF [default=OFF]
--       Build Arrow Fuzzing executables
--   ARROW_LARGE_MEMORY_TESTS=OFF [default=OFF]
--       Enable unit tests which use large memory
--
-- Lint options:
--
--   ARROW_ONLY_LINT=OFF [default=OFF]
--       Only define the lint and check-format targets
--   ARROW_VERBOSE_LINT=OFF [default=OFF]
--       If off, 'quiet' flags will be passed to linting tools
--   ARROW_GENERATE_COVERAGE=OFF [default=OFF]
--       Build with C++ code coverage enabled
--
-- Checks options:
--
--   ARROW_TEST_MEMCHECK=OFF [default=OFF]
--       Run the test suite using valgrind --tool=memcheck
--   ARROW_USE_ASAN=OFF [default=OFF]
--       Enable Address Sanitizer checks
--   ARROW_USE_TSAN=OFF [default=OFF]
--       Enable Thread Sanitizer checks
--   ARROW_USE_UBSAN=OFF [default=OFF]
--       Enable Undefined Behavior sanitizer checks
--
-- Project component options:
--
--   ARROW_BUILD_UTILITIES=OFF [default=OFF]
--       Build Arrow commandline utilities
--   ARROW_COMPUTE=ON [default=OFF]
--       Build the Arrow Compute Modules
--   ARROW_CSV=ON [default=OFF]
--       Build the Arrow CSV Parser Module
--   ARROW_CUDA=OFF [default=OFF]
--       Build the Arrow CUDA extensions (requires CUDA toolkit)
--   ARROW_DATASET=ON [default=OFF]
--       Build the Arrow Dataset Modules
--   ARROW_FILESYSTEM=ON [default=OFF]
--       Build the Arrow Filesystem Layer
--   ARROW_FLIGHT=OFF [default=OFF]
--       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
--   ARROW_GANDIVA=OFF [default=OFF]
--       Build the Gandiva libraries
--   ARROW_HDFS=OFF [default=OFF]
--       Build the Arrow HDFS bridge
--   ARROW_HIVESERVER2=OFF [default=OFF]
--       Build the HiveServer2 client and Arrow adapter
--   ARROW_IPC=ON [default=ON]
--       Build the Arrow IPC extensions
--   ARROW_JEMALLOC=ON [default=ON]
--       Build the Arrow jemalloc-based allocator
--   ARROW_JNI=OFF [default=OFF]
--       Build the Arrow JNI lib
--   ARROW_JSON=ON [default=OFF]
--       Build Arrow with JSON support (requires RapidJSON)
--   ARROW_MIMALLOC=ON [default=OFF]
--       Build the Arrow mimalloc-based allocator
--   ARROW_PARQUET=ON [default=OFF]
--       Build the Parquet libraries
--   ARROW_ORC=OFF [default=OFF]
--       Build the Arrow ORC adapter
--   ARROW_PLASMA=OFF [default=OFF]
--       Build the plasma object store along with Arrow
--   ARROW_PLASMA_JAVA_CLIENT=OFF [default=OFF]
--       Build the plasma object store java client
--   ARROW_PYTHON=OFF [default=OFF]
--       Build the Arrow CPython extensions
--   ARROW_S3=ON [default=OFF]
--       Build Arrow with S3 support (requires the AWS SDK for C++)
--   ARROW_TENSORFLOW=OFF [default=OFF]
--       Build Arrow with TensorFlow support enabled
--   ARROW_TESTING=OFF [default=OFF]
--       Build the Arrow testing libraries
--
-- Thirdparty toolchain options:
--
--   ARROW_DEPENDENCY_SOURCE=BUNDLED [default=AUTO|BUNDLED|SYSTEM|CONDA|VCPKG|BREW]
--       Method to use for acquiring arrow's build dependencies
--   ARROW_VERBOSE_THIRDPARTY_BUILD=OFF [default=OFF]
--       Show output from ExternalProjects rather than just logging to files
--   ARROW_DEPENDENCY_USE_SHARED=ON [default=ON]
--       Link to shared libraries
--   ARROW_BOOST_USE_SHARED=OFF [default=ON]
--       Rely on boost shared libraries where relevant
--   ARROW_BROTLI_USE_SHARED=ON [default=ON]
--       Rely on Brotli shared libraries where relevant
--   ARROW_BZ2_USE_SHARED=ON [default=ON]
--       Rely on Bz2 shared libraries where relevant
--   ARROW_GFLAGS_USE_SHARED=ON [default=ON]
--       Rely on GFlags shared libraries where relevant
--   ARROW_GRPC_USE_SHARED=ON [default=ON]
--       Rely on gRPC shared libraries where relevant
--   ARROW_LZ4_USE_SHARED=ON [default=ON]
--       Rely on lz4 shared libraries where relevant
--   ARROW_OPENSSL_USE_SHARED=ON [default=ON]
--       Rely on OpenSSL shared libraries where relevant
--   ARROW_PROTOBUF_USE_SHARED=ON [default=ON]
--       Rely on Protocol Buffers shared libraries where relevant
--   ARROW_THRIFT_USE_SHARED=ON [default=ON]
--       Rely on thrift shared libraries where relevant
--   ARROW_UTF8PROC_USE_SHARED=ON [default=ON]
--       Rely on utf8proc shared libraries where relevant
--   ARROW_SNAPPY_USE_SHARED=ON [default=ON]
--       Rely on snappy shared libraries where relevant
--   ARROW_UTF8PROC_USE_SHARED=ON [default=ON]
--       Rely on utf8proc shared libraries where relevant
--   ARROW_ZSTD_USE_SHARED=ON [default=ON]
--       Rely on zstd shared libraries where relevant
--   ARROW_USE_GLOG=OFF [default=OFF]
--       Build libraries with glog support for pluggable logging
--   ARROW_WITH_BACKTRACE=ON [default=ON]
--       Build with backtrace support
--   ARROW_WITH_BROTLI=OFF [default=OFF]
--       Build with Brotli compression
--   ARROW_WITH_BZ2=OFF [default=OFF]
--       Build with BZ2 compression
--   ARROW_WITH_LZ4=OFF [default=OFF]
--       Build with lz4 compression
--   ARROW_WITH_SNAPPY=true [default=OFF]
--       Build with Snappy compression
--   ARROW_WITH_ZLIB=ON [default=OFF]
--       Build with zlib compression
--   ARROW_WITH_ZSTD=OFF [default=OFF]
--       Build with zstd compression
--   ARROW_WITH_UTF8PROC=ON [default=ON]
--       Build with support for Unicode properties using the utf8proc library
--       (only used if ARROW_COMPUTE is ON)
--   ARROW_WITH_RE2=ON [default=ON]
--       Build with support for regular expressions using the re2 library
--       (only used if ARROW_COMPUTE or ARROW_GANDIVA is ON)
--
-- Parquet options:
--
--   PARQUET_MINIMAL_DEPENDENCY=OFF [default=OFF]
--       Depend only on Thirdparty headers to build libparquet.
--       Always OFF if building binaries
--   PARQUET_BUILD_EXECUTABLES=OFF [default=OFF]
--       Build the Parquet executable CLI tools. Requires static libraries to be built.
--   PARQUET_BUILD_EXAMPLES=OFF [default=OFF]
--       Build the Parquet examples. Requires static libraries to be built.
--   PARQUET_REQUIRE_ENCRYPTION=OFF [default=OFF]
--       Build support for encryption. Fail if OpenSSL is not found
--
-- Gandiva options:
--
--   ARROW_GANDIVA_JAVA=OFF [default=OFF]
--       Build the Gandiva JNI wrappers
--   ARROW_GANDIVA_STATIC_LIBSTDCPP=OFF [default=OFF]
--       Include -static-libstdc++ -static-libgcc when linking with
--       Gandiva static libraries
--   ARROW_GANDIVA_PC_CXX_FLAGS="" [default=""]
--       Compiler flags to append when pre-compiling Gandiva operations
--
-- Advanced developer options:
--
--   ARROW_EXTRA_ERROR_CONTEXT=OFF [default=OFF]
--       Compile with extra error context (line numbers, code)
--   ARROW_OPTIONAL_INSTALL=OFF [default=OFF]
--       If enabled install ONLY targets that have already been built. Please be
--       advised that if this is enabled 'install' will fail silently on components
--       that have not been built

3
我假设你正在使用Linux,因为macOS和Windows的二进制包都支持snappy--是这样吗?通常,如果你已经安装了带有S3支持的Linux包,那么你也已经构建了所有的压缩库,但是也可以在没有压缩库的情况下构建S3。你究竟是如何安装这个包的? https://arrow.apache.org/docs/r/articles/install.html 可能会是一个有用的参考。
另外请注意:你可以直接write_parquet(iris, "s3://myawsbucket/iris.parquet"),不需要先将其写入本地文件并调用shell命令将其复制到S3。

1
我使用以下命令安装了该软件包:Sys.setenv(ARROW_S3 = "ON"),然后运行 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")。我在Ubuntu 18和Ubuntu 20上以这种方式安装了它。Cran版本在我的本地Mac机器上运行良好。我使用夜间构建版本以利用我在Linux上遇到的Cmake版本问题。 - Mike.Gahan
2
好的。压缩库默认情况下不会被构建。ARROW_S3=ON打开S3支持,但不会改变其他任何内容。设置NOT_CRAN=true可能会给你想要的结果;https://arrow.apache.org/docs/r/articles/install.html#installation-basics解释了一些其他选项。 - Neal Richardson

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接