Python/PySpark数据框重新排列列。

Question

Python/PySpark数据框重新排列列。

pythonpysparkapache-spark-sql

62

我在Python/Pyspark中有一个数据框，其列包括id、time、city、zip等。

现在我向这个数据框添加了一个新的列name。

现在我需要按照一定顺序重新排列这些列，使得name列在id列之后。

我已经像下面这样操作：

change_cols = ['id', 'name']

cols = ([col for col in change_cols if col in df] 
        + [col for col in df if col not in change_cols])

df = df[cols]

我遇到了这个错误

pyspark.sql.utils.AnalysisException: u"Reference 'id' is ambiguous, could be: id#609, id#1224.;"

为什么会出现这个错误，我该如何纠正它。

- User12345

3个回答

40

如果你正在处理大量列：

df.select(sorted(df.columns))

- melchoir55

1

对于不是 Python 专家的人来说，sorted 是内置的 Python 函数，您无需导入任何额外的内容。 - PatrykMilewski

它按照什么基准进行排序？它是根据列名进行排序的吗？ - Surender Raja

4

如果您只想重新排序其中一些，同时保留其他内容并不关心其顺序：

def get_cols_to_front(df, columns_to_front) :
    original = df.columns
    # Filter to present columns
    columns_to_front = [c for c in columns_to_front if c in original]
    # Keep the rest of the columns and sort it for consistency
    columns_other = list(set(original) - set(columns_to_front))
    columns_other.sort()
    # Apply the order
    df = df.select(*columns_to_front, *columns_other)

    return df

- ZettaP

有一个打字错误，应该是“columns_other = list(set(original) - set(columns_to_front))”。好的解决方案！ - Pengshe

已更正。感谢您的发现 :) - ZettaP

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex · Accepted Answer

你可以使用select来改变列的顺序：

df.select("id","name","time","city")