我有一个大的数据框(大约1.2GB),其结构如下:
+---------+--------------+------------------------------------------------------------------------------------------------------+ | country | date_data | text | +---------+--------------+------------------------------------------------------------------------------------------------------+ | "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" | | "EEUU" | "2016-10-03" | "T_D: QQAA\nT_NAME: name_2\nT_IN: ind_2\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 46ee" | | . | . | . | | . | . | . | | "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_300000\nT_IN: ind_65\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 47aa" | +---------+--------------+------------------------------------------------------------------------------------------------------+
行数为300,000,"text"字段是一个大约5000个字符的字符串。
我想要将字段“text”分解成以下新字段:
+---------+------------+------+-------------+--------+--------+---------+--------+------+ | country | date_data | t_d | t_name | t_in | t_c | t_add | ...... | t_r | +---------+------------+------+-------------+--------+--------+---------+--------+------+ | EEUU | 2016-10-03 | QQWE | name_1 | ind_1 | c1ws12 | Sec_1_P | ...... | 45ee | | EEUU | 2016-10-03 | QQAA | name_2 | ind_2 | c1ws12 | Sec_1_P | ...... | 45ee | | . | . | . | . | . | . | . | . | | | . | . | . | . | . | . | . | . | | | . | . | . | . | . | . | . | . | | | EEUU | 2016-10-03 | QQWE | name_300000 | ind_65 | c1ws12 | Sec_1_P | ...... | 47aa | +---------+------------+------+-------------+--------+--------+---------+--------+------+
目前,我正在使用正则表达式来解决这个问题。首先,我编写正则表达式并创建一个从文本中提取单个字段的函数(总共有90个正则表达式):
val D_text = "((?<=T_D: ).*?(?=\\\\n))".r
val NAME_text = "((?<=nT_NAME: ).*?(?=\\\\n))".r
val IN_text = "((?<=T_IN: ).*?(?=\\\\n))".r
val C_text = "((?<=T_C: ).*?(?=\\\\n))".r
val ADD_text = "((?<=T_ADD: ).*?(?=\\\\n))".r
.
.
.
.
val R_text = "((?<=T_R: ).*?(?=\\\\n))".r
//UDF function:
def getFirst(pattern2: scala.util.matching.Regex) = udf(
(url: String) => pattern2.findFirstIn(url) match {
case Some(texst_new) => texst_new
case None => "NULL"
case null => "NULL"
}
)
然后,我使用正则表达式将函数应用于文本中的每个新字段,创建一个名为tbl_separate_fields的新数据框。
val tbl_separate_fields = hiveDF.select(
hiveDF("country"),
hiveDF("date_data"),
getFirst(D_text)(hiveDF("texst")).alias("t_d"),
getFirst(NAME_text)(hiveDF("texst")).alias("t_name"),
getFirst(IN_text)(hiveDF("texst")).alias("t_in"),
getFirst(C_text)(hiveDF("texst")).alias("t_c"),
getFirst(ADD_text)(hiveDF("texst")).alias("t_add"),
.
.
.
.
getFirst(R_text)(hiveDF("texst")).alias("t_r")
)
最后,我将这个数据框插入到Hive表中:
tbl_separate_fields.registerTempTable("tbl_separate_fields")
hiveContext.sql("INSERT INTO TABLE TABLE_INSERT PARTITION (date_data) SELECT * FROM tbl_separate_fields")
这个解决方案对整个数据帧的持续时间为1小时,因此我希望优化并减少执行时间。有什么解决方案吗?
我们正在使用Hadoop 2.7.1和Apache-Spark 1.5.1。Spark的配置如下:
val conf = new SparkConf().set("spark.storage.memoryFraction", "0.1")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
提前感谢。
编辑数据:
+---------+--------------+------------------------------------------------------------------------------------------------------+ | 国家 | 日期 | 文本 | +---------+--------------+------------------------------------------------------------------------------------------------------+ | "美国" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" | | "美国" | "2016-10-03" | "T_NAME: name_2\nT_D: QQAA\nT_IN: ind_2\nT_C: c1ws12 ...........\nT_R: 46ee" | | . | . | . | | . | . | . | | "美国" | "2016-10-03" | "T_NAME: name_300000\nT_ADD: Sec_1_P\nT_IN: ind_65\nT_C: c1ws12\n ...........\nT_R: 47aa" | +---------+--------------+------------------------------------------------------------------------------------------------------+