如何将Dataset [(String，Seq [String])]转换为Dataset [(String，String)]？

Question

如何将Dataset [(String，Seq [String])]转换为Dataset [(String，String)]？

scalaapache-sparkapache-spark-sql

3

可能这是一个简单的问题，但我正在开始我的spark之旅。

问题：我想在spark中获得以下结构（预期结果）。现在我有以下结构。

title1，{word11，word12，word13 ...}
title2，{word12，word22，word23 ...}

数据存储在Dataset [(String，Seq [String])]中。

预期结果 我想要得到Tuple [word，title]

word11，{title1}
word12，{title1}

我该怎么做
1. 创建（title，seq[word1，word2，word,3]）

docs.mapPartitions { iter =>
  iter.map {
     case (title, contents) => {
        val textToLemmas: Seq[String] = toText(....)
        (title, textToLemmas)
     }
  }
}

我尝试使用.map将我的结构转换为元组，但无法实现。
我尝试遍历所有元素，但是这样我就无法返回类型。

谢谢您的回答。

- meernet

3个回答

2

另一种解决方案是这样调用explode函数:

import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]

希望这能帮到您，最好的问候。

- Haroun Mohammedi

2

我很惊讶没有人提供使用Scala的for-comprehension（在编译时会被“解糖”为flatMap和map，就像Yuval Itzchakov的答案中所示）的解决方案。

当你看到一系列的flatMap和map（可能还包括filter）时，那就是Scala的for-comprehension。

因此，下面的内容可以这样写：

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

以下是等价的内容：

val result = for {
  (title, words) <- dataSet
  w <- words
} yield (w, title)

毕竟，这就是我们喜欢Scala灵活性的原因，不是吗？

- Jacek Laskowski

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Yuval Itzchakov · Accepted Answer

这应该可以工作：

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }