我对 ss.csv
数据集中的一些列感兴趣:
ss_ = spark.read.csv("ss.csv", header= True,
inferSchema = True)
ss_.columns
['Reporting Area', 'MMWR Year', 'MMWR Week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017, flag', 'Shiga toxin-producing Escherichia coli, Current week', 'Shiga toxin-producing Escherichia coli, Current week, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max, flag', 'Shiga toxin-producing Escherichia coli, Cum 2018', 'Shiga toxin-producing Escherichia coli, Cum 2018, flag', 'Shiga toxin-producing Escherichia coli, Cum 2017', 'Shiga toxin-producing Escherichia coli, Cum 2017, flag', 'Shigellosis, Current week', 'Shigellosis, Current week, flag', 'Shigellosis, Previous 52 weeks Med', 'Shigellosis, Previous 52 weeks Med, flag', 'Shigellosis, Previous 52 weeks Max', 'Shigellosis, Previous 52 weeks Max, flag', 'Shigellosis, Cum 2018', 'Shigellosis, Cum 2018, flag', 'Shigellosis, Cum 2017', 'Shigellosis, Cum 2017, flag']
但我只需要一点:
columns_lambda = lambda k: k.endswith(', Current week') or k == 'Reporting Area' or k == 'MMWR Year' or k == 'MMWR Week'
过滤器返回所需列的列表,对该列表进行求值:
sss = filter(columns_lambda, ss_.columns)
to_keep = list(sss)
所需列的列表被拆分为数据帧(dataframe)选择函数的参数,该函数返回仅包含列表中列的数据集:
dfss = ss_.select(*to_keep)
dfss.columns
结果:
['Reporting Area',
'MMWR Year',
'MMWR Week',
'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week',
'Shiga toxin-producing Escherichia coli, Current week',
'Shigellosis, Current week']
df.select()
有一个互补对应的函数:http://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop
可以使用该函数来删除列。