使用pyspark
或sparkr
(最好两者都用),如何获取两个DataFrame
列的交集?例如,在sparkr
中,我有以下DataFrames
:
newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)
name surname
1 George Williams
#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)
as.vector(y) 出错:无法将此 S4 类型强制转换为向量
如何让 intersect
在单列中起作用?
spark.createDataFrame(["a","b","x"],StringType()).intersect(spark.createDataFrame(["z","y","x"],StringType()))
- rogue-one