我正在尝试转置一个非常大的数据框。由于文件的大小,我使用了Dask,并查找了如何转置Dask数据框。
import pandas as pd
import numpy as np
import dask.dataframe as dd
genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\GENEMATRIX.csv"
genematrix_df = dd.read_csv(genematrix)
new_df = np.transpose(genematrix_df)
new_df.head()
它返回以下内容。
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [39], in <cell line: 6>()
4 genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\TSVSMERGED.csv"
5 genematrix_df = dd.read_csv(genematrix)
----> 6 new_df = np.transpose(genematrix_df)
7 new_df.head()
File <__array_function__ internals>:5, in transpose(*args, **kwargs)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:660, in transpose(a, axes)
601 @array_function_dispatch(_transpose_dispatcher)
602 def transpose(a, axes=None):
603 """
604 Reverse or permute the axes of an array; returns the modified array.
605
(...)
658
659 """
--> 660 return _wrapfunc(a, 'transpose', axes)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
52 bound = getattr(obj, method, None)
53 if bound is None:
---> 54 return _wrapit(obj, method, *args, **kwds)
56 try:
57 return bound(*args, **kwds)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:47, in _wrapit(obj, method, *args, **kwds)
45 if not isinstance(result, mu.ndarray):
46 result = asarray(result)
---> 47 result = wrap(result)
48 return result
File ~\Anaconda3\lib\site-packages\dask\dataframe\core.py:4213, in DataFrame.__array_wrap__(self, array, context)
4210 else:
4211 index = context[1][0].index
-> 4213 return pd.DataFrame(array, index=index, columns=self.columns)
UnboundLocalError: local variable 'index' referenced before assignment
问题似乎来自某些我无法控制的内部函数。 我需要更改文件格式还是应该尝试分批处理而不是一个大数据框架?