Spark Scala如何在RDD中使用替换函数

5
我有一个推文文件。
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
396124437168537600,"I really wish I didn't give up everything I did for you.     I'm so mad at my self for even letting it get as far as it did.",savava143
396124436958412800,"I really need to double check who I'm sending my     snapchats to before sending it ",juliannpham
396124437218885632,"@Darrin_myers30 I feel you man, gotta stay prayed up.     Year is important",Ful_of_Ambition
396124437558611968,"tell me what I did in my life to deserve this.",_ItsNotBragging
396124437499502592,"Too many fine men out here...see me drooling",LolaofLife
396124437722198016,"@jaiclynclausen will do",I_harley99

我正在尝试替换从RDD读取的所有特殊字符。
    val fileReadRdd = sc.textFile(fileInput)
    val fileReadRdd2 = fileReadRdd.map(x => x.map(_.replace(","," ")))
    val fileFlat = fileReadRdd.flatMap(rec => rec.split(" "))

我遇到了以下错误:
Error:(41, 57) value replace is not a member of Char
    val fileReadRdd2 = fileReadRdd.map(x => x.map(_.replace(",","")))
2个回答

4

我怀疑:

x => x.map(_.replace(",",""))

将您的字符串视为字符序列,而实际上您想要的是:
x => x.replace(",", "")
即您不需要对字符的“序列”进行映射

感谢Brian。 val stripCurly = "[{~,!,@,#,$,%,^,&,*,(,),_,=,-,`,:,',?,/,<,>,.}]" val fileReadRdd2 = fileReadRdd.map(x => stripCurly.replaceAll(x,"")) - Ravinder Karra
但这对我有效。val removeDots = file.map(x => x.replace(".", "")) 适用于具有多行的文件。 - jack AKA karthik

0
在任何Spark支持的文件系统中,使用Spark Scala处理正则文件时,Perl的一行代码perl -pi 's/\s+//' $file将如下所示(请随意调整您的正则表达式):

// read the file into rdd of strings
val rdd: RDD[String] = spark.sparkContext.textFile(uri)

// for each line in rdd apply pattern and save to file
rdd
  .map(line => line.replaceAll("^\\s+", ""))
  .saveAsTextFile(uri + ".tmp")

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接