如何将一个数据框从长格式转换为宽格式，其中索引按年份分组？

Question

如何将一个数据框从长格式转换为宽格式，其中索引按年份分组？

6

以下代码适用于我先前使用过的csv文件，两个csv文件具有相同数量的列，并且列名相同。

适用于已工作的csv文件数据这里适用于未工作的csv文件数据这里这个错误是什么意思？为什么我会收到这个错误？

from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot

series = read_csv('carringtonairtemp.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

groups = series.groupby(Grouper(freq='A'))
years = DataFrame()

for name, group in groups:
    years[name.year] = group.values

years = years.T

pyplot.matshow(years, interpolation=None, aspect='auto')
pyplot.show()

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-7173fcbe8c08> in <module>
      6 #     display(group.head())
      7 #     print(group.values[:10])
----> 8     years[name.year] = group.values

e:\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3038         else:
   3039             # set column
-> 3040             self._set_item(key, value)
   3041 
   3042     def _setitem_slice(self, key: slice, value):

e:\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3114         """
   3115         self._ensure_valid_index(value)
-> 3116         value = self._sanitize_column(key, value)
   3117         NDFrame._set_item(self, key, value)
   3118 

e:\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3759 
   3760             # turn me into an ndarray
-> 3761             value = sanitize_index(value, self.index)
   3762             if not isinstance(value, (np.ndarray, Index)):
   3763                 if isinstance(value, list) and len(value) > 0:

e:\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in sanitize_index(data, index)
    745     """
    746     if len(data) != len(index):
--> 747         raise ValueError(
    748             "Length of values "
    749             f"({len(data)}) "

ValueError: Length of values (365) does not match length of index (252)

- Xavier Conzet

2个回答

1

您之所以会遇到此错误，是因为这些组的行数不同。因此，首先您需要在空数据框中添加一列，其中包含252个值，现在数据框的大小为252。然后，您正在尝试添加365个值的一列，其大小与252不同。这就是为什么会出现此错误的原因。代码工作的数据框每年（组）有相同数量的值（364）。但现在您有：

1990-12-31    252
1991-12-31    365
1992-12-31    366
...

例如，假设我们有这个DataFrame：

如果我们尝试添加一个带有两个值的列，将会出现以下错误：

df['B']=[1,2]

ValueError: Length of values does not match the length of the index

只有在添加相同数量的值时才可以。

df['B']=[1,2,3]

- Billy Bonaros

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Trenton McKinney · Accepted Answer

这种迭代方式创建Dataframe的问题在于，它需要新列与现有Dataframe的“year”索引长度相匹配。
在小型数据集中，所有年份都是365天无缺失天数的。
大型数据集的年份数组合包括365天和366天的混合长度，并且从1990年和2020年开始存在缺失数据，导致出现“ValueError: Length of values (365) does not match length of index (252)”错误。
以下是更简洁的脚本，它可以实现所需的dataframe形状和绘图。
- 这个实现没有不等长数据长度的问题。

import pandas as pd
import matplotlib.pyplot as plt

# links to data
url1 = 'https://raw.githubusercontent.com/trenton3983/stack_overflow/master/data/so_data/2020-09-19%20%2063975678/daily-min-temperatures.csv'
url2 = 'https://raw.githubusercontent.com/trenton3983/stack_overflow/master/data/so_data/2020-09-19%20%2063975678/carringtonairtemp.csv'

# load the data into a DataFrame, not a Series
# parse the dates, and set them as the index
df1 = pd.read_csv(url1, parse_dates=['Date'], index_col=['Date'])
df2 = pd.read_csv(url2, parse_dates=['Date'], index_col=['Date'])

# groupby year and aggregate Temp into a list
dfg1 = df1.groupby(df1.index.year).agg({'Temp': list})
dfg2 = df2.groupby(df2.index.year).agg({'Temp': list})

# create a wide format dataframe with all the temp data expanded
df1_wide = pd.DataFrame(dfg1.Temp.tolist(), index=dfg1.index)
df2_wide = pd.DataFrame(dfg2.Temp.tolist(), index=dfg2.index)

# plot
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 10))

ax1.matshow(df1_wide, interpolation=None, aspect='auto')
ax2.matshow(df2_wide, interpolation=None, aspect='auto')