如何将pandas Series的“行”转换为DataFrame的列?

3
我有一个pandas Series,名为ser1,形状为(100,)。它与IT技术相关。
import pandas as pd
ser1 = pd.Series(...)
print(len(ser1)) 
##  prints (100,)

该系列中每个ndarray的长度为150000,其中每个元素都是一个字符。

len(print(ser1[0]))
##  prints 150000

ser1.head()
sample1       xhtrcuviuvjhgfsrexvuvhfgshgckgvghfsgfdsdsg...
sample2       jhkjhgkjvkjgfjyqerwqrbxcvmkoshfkhgjknlkdfk...
sample3       sdfgfdxcvybnjbvtcyuikjhbgfdftgyhujhghjkhjn...
sample4       bbbbbbadfashdwkjhhguhoadfopnpbfjhsaqeqjtyi...
sample5       gfjyqedxcvrexvuvcvmkoshdftgyhujhgcvmkoshfk...
dtype: object

我想将这个pandas Series转换成一个pandas DataFrame,使得该pandas Series "row"的每个元素都是DataFrame的一列。也就是说,该Series数组的每个元素都将是一个单独的列。在这种情况下,ser1将有150000个列。

print(type(df_ser1)) # DataFrame of ser1
## outputs <class 'pandas.core.frame.DataFrame'>
df_ser1.head()
     samples    char1    char2    char3    char4    char5    char6
0    sample1    x        h        t        r        c        u
1    sample2    j        h        k        j        h        g
2    sample3    s        d        f        g        f        d
3    sample4    b        b        b        b        b        b
........

如何将pandas系列转换为数据框?

最明显的想法是执行以下操作:

df_ser = ser1.to_frame

但这并不会将元素分离成单独的数据帧列:

df_ser = ser1.to_frame
df_ser.head()
                                                       0
sample1       xhtrcuviuvjhgfsrexvuvhfgshgckgvghfsgfdsdsg...
sample2       jhkjhgkjvkjgfjyqerwqrbxcvmkoshfkhgjknlkdfk...
sample3       sdfgfdxcvybnjbvtcyuikjhbgfdftgyhujhghjkhjn...
......

不知为何,需要遍历“Series row”中的每个元素并创建一列,尽管我不确定这在计算上是否可行。(这不是很符合Python语言的风格。)

如何实现呢?

2个回答

2
考虑一个样本系列ser1
ser1 = pd.Series(
    'abc def ghi'.split(),
    'sample1 sample2 sample3'.split())

将字符串转换为字符列表后,使用pd.Series进行应用。

ser1.apply(lambda x: pd.Series(list(x))) \
    .rename(columns=lambda x: 'char{}'.format(x + 1))

        char1 char2 char3
sample1     a     b     c
sample2     d     e     f
sample3     g     h     i

这对于我的数据集来说效率出奇地高。谢谢你的帮助! - undefined

2

我的方法是将数据作为numpy数组处理,然后将最终产品存储在pandas DataFrame中。但总的来说,在dataframe中创建10万列相当慢。

与piRSquared的解决方案相比,我的解决方案并没有更好,但我认为还是应该发布一下,因为这是一种不同的方法。

样本数据

import pandas as pd
from timeit import default_timer as timer

# setup some sample data
a = ["c"]
a = a*100
a = [x*10**5 for x in a]
a = pd.Series(a)
print("shape of the series = %s" % a.shape)
print("length of each string in the series = %s" % len(a[0]))

输出:

shape of the series = 100
length of each string in the series = 100000

解决方案

# get a numpy array representation of the pandas Series
b = a.values
# split each string in the series into a list of individual characters
c = [list(x) for x in b]
# save it as a dataframe
df = pd.DataFrame(c)

运行时间

正如piRSquared已经发布了一个解决方案,我应该包括运行时间分析。

execTime=[]
start = timer()
# get a numpy array representation of the pandas Series
b = a.values
end = timer()
execTime.append(end-start)

start = timer()
# split each string in the series into a list of individual characters
c = [list(x) for x in b]
end = timer()
execTime.append(end-start)

start = timer()
# save it as a dataframe
df = pd.DataFrame(c)
end = timer()
execTime.append(end-start)

start = timer()
a.apply(lambda x: pd.Series(list(x))).rename(columns=lambda x: 'char{}'.format(x + 1))
end = timer()
execTime.append(end-start)
print("get numpy array                      = %s" % execTime[0])
print("Split each string into chars runtime = %s" % execTime[1])
print("Save 2D list as Dataframe runtime    = %s" % execTime[2])
print("piRSquared's solution runtime        = %s" % execTime[3])

输出:

get numpy array                      = 7.788003131281585e-06
Split each string into chars runtime = 0.17509693499960122
Save 2D list as Dataframe runtime    = 12.092364584001189
piRSquareds solution runtime         = 13.954442440001003

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接