Pandas数据框：从时间戳列获取唯一值

Question

Pandas数据框：从时间戳列获取唯一值

3

我有一些时间序列数据，看起来像这样：

1998-01-02 09:30:00,0.4298,0.4337,0.4258,0.4317,6426369
1999-01-02 09:45:00,0.4317,0.4337,0.4258,0.4298,10589080
2000-01-02 10:00:00,0.4298,0.4337,0.4278,0.4337,9507980
2001-01-02 10:15:00,0.4337,0.4416,0.4298,0.4416,13639022

我想要的是一个年份列表。

years = list['1998'.'1999','2000','2001']

因此，我可以使用该列表来了解我可以在该数据框中查询哪些年份。并非所有的数据框中都有相同的年份。

data = pd.read_csv(str(inFileName), index_col=0, parse_dates=True, header=None)
  
  #data.iloc[:, 0]

print(pd.DatetimeIndex(data.iloc[:, 0]).year)

  #print(data.iloc[:, 0])

  #years = list(data.index)
  #print(years)

  for x in years:

我正在尝试很多方法，但都没有成功。能否有人向我解释如何解决这样的问题？

编辑1：经过一些建议，我正在尝试以下操作：

data = pd.read_csv(str(inFileName), parse_dates=[0], header=None)
  data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
  data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
  year_list = data['year'].unique().tolist()
  print(year_list)
  for x in year_list:
    newDF = data[x]
    newDF.head()

    print(newDF.head(5))

我得到了列表：[2017, 2018, 2019]

但是我无法从该列表创建新的数据框。我想为列表中的每个值创建一个新的数据框。我遇到了错误：

[2017, 2018, 2019]

Traceback (most recent call last):
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./massageSM.py", line 123, in <module>
    main(sys.argv[1:])
  File "./massageSM.py", line 33, in main
    newDF = data[x]
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017

编辑2

我正在使用这个：

data = pd.read_csv("RHE.SM", parse_dates=[0], header=None)
data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()
print(year_list)
  
for x in year_list:
    df = pd.DataFrame({'years':year_list})
    
    print(df.head(5))

并且它会产生输出：

[2017, 2018, 2019]
   years
0   2017
1   2018
2   2019
   years
0   2017
1   2018
2   2019
   years
0   2017
1   2018
2   2019

但是我想要创建以下内容：只包含2017的数据框只包含2018的数据框只包含2019的数据框

但是我不能硬编码，因为其他文件可能不包含相同的年份。我需要制作一个可用年份列表，并对其进行迭代。

编辑3：

我也尝试过：

data = pd.read_csv("RHE.SM", header=None, parse_dates=[0])
year_list = data[0].dt.year.unique().tolist()
print(year_list)
data.index = pd.DatetimeIndex(data[0])
print(type(data.index))
print(data.index)

for x in year_list:
    print(x)
    newDF = data[x]
    #newDF.head()

    #print(newDF.head(5))

我得到了以下输出，一开始很好，但后来创建 newDF 时出现错误。

[2017, 2018, 2019]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
DatetimeIndex(['2017-10-02 10:15:00', '2017-10-02 10:30:00',
               '2017-10-02 10:45:00', '2017-10-02 11:00:00',
               '2017-10-02 11:15:00', '2017-10-02 11:30:00',
               '2017-10-02 11:45:00', '2017-10-02 12:00:00',
               '2017-10-02 12:15:00', '2017-10-02 12:30:00',
               ...
               '2019-01-03 14:45:00', '2019-01-03 15:00:00',
               '2019-01-03 15:15:00', '2019-01-03 15:30:00',
               '2019-01-03 15:45:00', '2019-01-03 16:00:00',
               '2019-01-03 16:30:00', '2019-01-03 16:45:00',
               '2019-01-03 17:15:00', '2019-01-03 18:30:00'],
              dtype='datetime64[ns]', name=0, length=8685, freq=None)
2017

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 2017

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-19-f31493ccbf2a> in <module>
      9 for x in year_list:
     10     print(x)
---> 11     newDF = data[x]
     12     #newDF.head()
     13 

~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2487         res = cache.get(item)
   2488         if res is None:
-> 2489             values = self._data.get(item)
   2490             res = self._box_item_values(item, values)
   2491             cache[item] = res

~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

~/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 2017

- Jasmine

4个回答

1

如果你想按年份将数据框分成单独的数据框，可以按照以下步骤进行：

dfs = {
    year: sub_df.drop(columns=["year"])
    for year, sub_df in data.assign(year=lambda df: df[0].dt.year)\
                            .groupby("year")
}

输出：

{1998:                     0       1       2       3       4        5
 0 1998-01-02 09:30:00  0.4298  0.4337  0.4258  0.4317  6426369,
 1999:                     0       1       2       3       4         5
 1 1999-01-02 09:45:00  0.4317  0.4337  0.4258  0.4298  10589080,
 2000:                     0       1       2       3       4        5
 2 2000-01-02 10:00:00  0.4298  0.4337  0.4278  0.4337  9507980,
 2001:                     0       1       2       3       4         5
 3 2001-01-02 10:15:00  0.4337  0.4416  0.4298  0.4416  13639022}

如果您想遍历并将各个dfs单独写入CSV文件，可以执行以下操作：

for year, df in dfs.items():
    filename = "base_name_{}.csv".format(year)
    df.to_csv(filename, index=False)

原则上，您需要使用基于原始文件名的基本名称。

- PMende

我从未见过这样做。我该如何迭代每个df以便可以导出到csv？编辑：哦.items__。让我试试。 - Jasmine

@Jason 我已经添加了一个编辑来展示如何写入文件。 - PMende

这基本上是有效的！我只需要删除第一列，以及在每行末尾加上年份：例如：0,2017-10-02 10:15:00,0.971,1.1,0.971,1.1,600,2017 删除末尾的 0 和 2017 - Jasmine

@Jason，我加入了.drop(columns=["year"])，并且去掉了reset_index()的调用。如果可以，请告诉我这是否有效。我假设data不再具有日期时间作为索引。 - PMende

我明白了！这个答案非常棒。你能提供一个学习你所做的方法的资源吗？我从未见过有人这样使用 __df__。 - Jasmine

显示剩余2条评论

0

在您的情况下，最简单的方法是：

data = pd.read_csv(inFileName, header=None, parse_dates=[0])
data[0].dt.year.unique().tolist()

这里使用了日期时间访问器，它快速且矢量化。

- Matthijs Brouns

@Matthijis Brouns - 这只是打印 [1970] - Jasmine

糟糕，我在iloc中使用了错误的访问器，已经更新了上面的答案。 - Matthijs Brouns

@Matthijis Brouns - 打印 DatetimeIndex(['1970-01-01 00:00:00.000001970'], dtype='datetime64[ns]', freq=None)。 - Jasmine

我没有注意到你正在将第一列设置为索引。如果你跳过那一步，只对第0列调用.dt.year.unique().tolist()就可以了。 - Matthijs Brouns

0

首先，您需要确保从datetime类型中提取年份。假设您知道存储日期的列的名称，则可以执行以下操作：

df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)

如果日期在索引中，你可以进行以下操作：

df['datetime'] = pd.to_datetime(df.reset_index()['index'])
df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)

第一行将值从索引中提取，并默认将它们放入名为“index”的列中。第二行将数据转换为datetime格式。

完成此操作后，您可以提取唯一的年份：

years =  df['year'].unique().tolist()

- Sokolokki

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nathan Hellinga · Accepted Answer

我还没有测试过，但我认为这对你会有用。最初的回答。

data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()

首先将第一列转换为DateTime格式。然后创建一个新列，仅包含每个DateTime的年份组件。最后，它将输出该列中每个唯一值的列表。

如果您还想将结果列表转换为新的数据框，请在此行后添加：

```python new_df = pd.DataFrame(unique_years, columns=['Year']) ```

df = pd.DataFrame({'years':year_list})

编辑如果你想将列表中的每个单独项转换为新的数据帧，你可以添加以下内容：

如果您想将列表中的每个项目分别转换为数据框，则可以使用以下代码：

df = []
for x in year_list:
    df.append(pd.DataFrame({'years':[x]}))