Spark：分区列是否区分大小写

Question

Spark：分区列是否区分大小写

3

我将尝试使用hiveContext将一个数据框（orc格式）写入分区键中：

df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");

然而，我正在尝试进行分区的列具有区分大小写的值，这会在写入时引发错误：

Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc

event_type列的值有两个： searchFired 和 SearchFired。但是，如果我从数据框中删除其中一个，我就可以成功写入。我该如何解决这个问题？

- nish

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sim · Accepted Answer

通常依赖文件系统中大小写的差异并不是一个好主意。

解决方案是将大小写不同的值合并到同一分区中，可以使用类似以下的Scala DSL：

df
  .withColumn("par_event_type", expr("lower(event_type)"))
  .write
  .partitionBy("par_event_type")
  .mode(SaveMode.Overwrite)
  .orc("/path")

这将添加一个额外的分区列。如果这会导致问题，您可以在读取数据时使用drop来删除它。