将Pandas DataFrame列转换为列表

Question

将Pandas DataFrame列转换为列表

217

我正在根据另一列中的条件提取数据列的子集。

我能够正确获得值，但它是以 pandas.core.frame.DataFrame 的形式返回的。如何将其转换为列表？

import pandas as pd

tst = pd.read_csv('C:\\SomeCSV.csv')

lookupValue = tst['SomeCol'] == "SomeValue"
ID = tst[lookupValue][['SomeCol']]
#How To convert ID to a list

- user3646105

19

我犹豫是否要编辑一个浏览量如此之高的老问题，但应该指出的是，尽管标题谈论了“dataframe转list”，但问题实际上是关于“series转list”的。请注意，虽然tst是一个数据框，但tst ['SomeCol']是一系列数据。这种区别很重要，因为tolist()方法可以直接作用于系列数据，但不能作用于数据框。 - JohnE

1

请注意，使用DataFrame比使用列表更方便。 - Burrito

2

如果您想了解如何将数据框转换为列表（列表），请查看此主题：https://dev59.com/nV4c5IYBdhLWcg3wXJRX#28006809 - cs95

4个回答

30

我想澄清几件事：

正如其他答案所指出的，最简单的方法是使用pandas.Series.tolist()。我不确定为什么得票最高的答案会先使用pandas.Series.values.tolist()，因为就我所知，它增加了语法/混淆，而没有增加任何好处。
tst[lookupValue][['SomeCol']]是一个数据帧（在问题中声明），而不是一系列（如问题评论中所述）。这是因为tst [lookupValue]是一个数据帧，并且使用[['SomeCol']]对其进行切片会要求获取列的列表（该列表恰好具有长度1），从而返回一个数据帧。如果您删除额外的括号，例如tst[lookupValue]['SomeCol']，则只请求一个列而不是列的列表，因此您将获得一系列结果。
您需要使用pandas.Series.tolist()来使用一系列，因此在这种情况下应该跳过第二组方括号。 FYI，如果您最终得到一个不容易避免的单列数据框架，可以使用pandas.DataFrame.squeeze()将其转换为一系列。
tst[lookupValue]['SomeCol']通过链接切片获取特定列的子集。它切片一次以仅保留某些行的数据帧，然后再切片以获取某个列。您可以在此处使用它，因为您只是进行读取，而不是写入，但是正确的方法是tst.loc[lookupValue，'SomeCol']（返回一系列）。
使用＃4中的语法，您可以合理地在一行中完成所有操作：ID = tst.loc[tst['SomeCol'] == 'SomeValue'，'SomeCol'].tolist()

演示代码：

import pandas as pd
df = pd.DataFrame({'colA':[1,2,1],
                   'colB':[4,5,6]})
filter_value = 1

print "df"
print df
print type(df)

rows_to_keep = df['colA'] == filter_value
print "\ndf['colA'] == filter_value"
print rows_to_keep
print type(rows_to_keep)

result = df[rows_to_keep]['colB']
print "\ndf[rows_to_keep]['colB']"
print result
print type(result)

result = df[rows_to_keep][['colB']]
print "\ndf[rows_to_keep][['colB']]"
print result
print type(result)

result = df[rows_to_keep][['colB']].squeeze()
print "\ndf[rows_to_keep][['colB']].squeeze()"
print result
print type(result)

result = df.loc[rows_to_keep, 'colB']
print "\ndf.loc[rows_to_keep, 'colB']"
print result
print type(result)

result = df.loc[df['colA'] == filter_value, 'colB']
print "\ndf.loc[df['colA'] == filter_value, 'colB']"
print result
print type(result)

ID = df.loc[rows_to_keep, 'colB'].tolist()
print "\ndf.loc[rows_to_keep, 'colB'].tolist()"
print ID
print type(ID)

ID = df.loc[df['colA'] == filter_value, 'colB'].tolist()
print "\ndf.loc[df['colA'] == filter_value, 'colB'].tolist()"
print ID
print type(ID)

结果：

df
   colA  colB
0     1     4
1     2     5
2     1     6
<class 'pandas.core.frame.DataFrame'>

df['colA'] == filter_value
0     True
1    False
2     True
Name: colA, dtype: bool
<class 'pandas.core.series.Series'>

df[rows_to_keep]['colB']
0    4
2    6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>

df[rows_to_keep][['colB']]
   colB
0     4
2     6
<class 'pandas.core.frame.DataFrame'>

df[rows_to_keep][['colB']].squeeze()
0    4
2    6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>

df.loc[rows_to_keep, 'colB']
0    4
2    6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>

df.loc[df['colA'] == filter_value, 'colB']
0    4
2    6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>

df.loc[rows_to_keep, 'colB'].tolist()
[4, 6]
<type 'list'>

df.loc[df['colA'] == filter_value, 'colB'].tolist()
[4, 6]
<type 'list'>

- MarredCheese

21

你可以使用 pandas.Series.tolist。

例如：

import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})

运行：

>>> df['a'].tolist()

您将会得到

>>> [1, 2, 3]

- zhql0907

3

以上解决方案适用于所有数据类型相同的情况。Numpy数组是同质容器。当您执行df.values时，输出是一个numpy数组。因此，如果数据中有int和float，则输出将为int或float，并且列将失去其原始数据类型。请考虑df。

a  b 
0  1  4
1  2  5 
2  3  6 

a    float64
b    int64

所以，如果你想保留原始的数据类型，可以像这样做：

row_list = df.to_csv(None, header=False, index=False).split('\n')

这将把每一行作为一个字符串返回。

['1.0,4', '2.0,5', '3.0,6', '']

然后将每一行拆分为一个列表。拆分后的每个元素都是Unicode格式的。我们需要将其转换为所需的数据类型。

def f(row_str): 
  row_list = row_str.split(',')
  return [float(row_list[0]), int(row_list[1])]

df_list_of_list = map(f, row_list[:-1])

[[1.0, 4], [2.0, 5], [3.0, 6]]

- ShikharDua

1

更简单的方法是只需执行 df['b'].values。如果在使用 .values 之前选择列，则可以避免转换并保留原始 dtype。这也更加高效。 - JohnE

1

哪一个是“上面的解决方案”？所有的答案都出现在这个答案的上方。谢谢！ - tommy.carstensen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Akavall · Accepted Answer

你可以使用Series.to_list方法。

例如：

import pandas as pd

df = pd.DataFrame({'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
                   'b': [3, 5, 6, 2, 4, 6, 7, 8, 7, 8, 9]})

print(df['a'].to_list())

输出：

[1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]

要删除重复项，您可以执行以下操作之一：

>>> df['a'].drop_duplicates().to_list()
[1, 3, 5, 7, 4, 6, 8, 9]
>>> list(set(df['a'])) # as pointed out by EdChum
[1, 3, 4, 5, 6, 7, 8, 9]