如何修改CSV文件中的数据并更改行和列?

3

我有一个CSV文件,我的数据格式如下:

Countries  variable  1995   1996  1997  1998   1999
  USA        GDP      10     11    12    12     13 
  USA        Inf      100    120   130   120    110
  USA        Trade    200    220   210   235    250
  GER        GDP      8      9     9.5   10     10.5
  GER        Inf      100    105   107   109    111 
  GER        Trade    150    156   149   165    167 

我打算修改我的数据,并将其改为以下内容:
  Countries  Years    GDP    Inf   Trade
     USA      1995     10    100    200
     USA      1996     11    120    220
     USA      1997     12    130    210
     USA      1998     12    120    235
     USA      1999     13    110    250
     GER      1995     8     100    150
     GER      1996     9     105    156
     GER      1997     9.5   107    149
     GER      1998     10    109    165
     GER      1999     10.5  111    167

我不知道如何在Python中完成这个任务。我已经使用pandas导入了我的数据,但是可以对数据进行的唯一操作是转置列和行,这不是我想要的。另外,使用csv.writerow也无法修改我的数据。

4个回答

3

您可以使用OrderedDict来对数据进行分组:

import csv
from collections import OrderedDict,defaultdict
from itertools import islice
with open("out.csv") as f:
    od = OrderedDict()
    r = csv.reader(f, delimiter=" ")
    header = next(r)
    years = header[2:]
    zipped = zip(*r)
    countries = OrderedDict.fromkeys(zipped[0]).keys() # next(zipped) python3
    it = iter(countries)
    for  row in zip(*zipped[1:]): #  for  row in zip(*zipped) python3
        if row[0] == "GDP":
            key = next(it)
            od.setdefault(key, defaultdict(list))
            od[key]["Years"] = years
            od[key]["Country"] = [key] * len(years)
        od[key][row[0]].extend(islice(row,1,None))

输出:

OrderedDict([('USA', defaultdict(<type 'list'>, {'GDP': ['10', '11', '12', '12', '13'], 'Inf': ['100', '120', '130', '120', '110'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['200', '220', '210', '235', '250']})), ('GER', defaultdict(<type 'list'>, {'GDP': ['8', '9', '9.5', '10', '10.5'], 'Inf': ['100', '105', '107', '109', '111'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['150', '156', '149', '165', '167']}))])

毫无疑问,对Pandas更有经验的人能够找到更好的方法来做这件事情,但是下面这种方式至少可以创建一个dataframe:

df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])

for k,v in od.items():
    df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ),["Country","Years","GDP","Inf","Trade"]).transpose()
    f = df.append(df_temp,ignore_index=True)
print(df)

输出:

    Country Years   GDP  Inf Trade
0     USA  1995    10  100   200
1     USA  1996    11  120   220
2     USA  1997    12  130   210
3     USA  1998    12  120   235
4     USA  1999    13  110   250
5     GER  1995     8  100   150
6     GER  1996     9  105   156
7     GER  1997   9.5  107   149
8     GER  1998    10  109   165
9     GER  1999  10.5  111   167

如果您的文件比较大,您可以按需创建dataframe,并每次重置OrderedDict以避免将所有数据存储在字典中。此外,您只需要在主代码之外附加最后一个组即可。如果使用python2还可以使用itertools.islice获取所有切片并使用itertools.izip进行压缩。
import csv
from collections import OrderedDict,defaultdict
from itertools import islice,izip

df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])
with open("out.csv") as f:
    od = OrderedDict()
    r = csv.reader(f, delimiter=" ")
    header = next(r)
    years = header[2:]
    zipped = izip(*r)
    countries = OrderedDict.fromkeys(next(zipped)).keys()
    it = iter(countries)
    for row in izip(*zipped):
        if row[0] == "GDP":
            if od: # make sure it is not the first line
                for k, v in od.items():
                    df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
                    df = df.append(df_temp, ignore_index=True)
                    od = OrderedDict()
            key = next(it)
            od.setdefault(key, defaultdict(list))
            od[key]["Years"] = years
            od[key]["Country"] = [key] * len(years)
        od[key][row[0]].extend(islice(row, 1, None))

for k,v in od.items():
        df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
        df = df.append(df_temp, ignore_index=True)


print(df)

这将再次产生相同的输出:

  Country Years   GDP  Inf Trade
0     USA  1995    10  100   200
1     USA  1996    11  120   220
2     USA  1997    12  130   210
3     USA  1998    12  120   235
4     USA  1999    13  110   250
5     GER  1995     8  100   150
6     GER  1996     9  105   156
7     GER  1997   9.5  107   149
8     GER  1998    10  109   165
9     GER  1999  10.5  111   167

2
假设您的数据是一个列表嵌套列表的形式:
>>> for line in data:
...     print('\t'.join(line))
...
USA  GDP    10   11   12   12   13
USA  Inf    100  120  130  120  110
USA  Trade  200  220  210  235  250
GER  GDP    8    9    9.5  10   10.5
GER  Inf    100  105  107  109  111
GER  Trade  150  156  149  165  167

使用以下代码:
from collections import defaultdict
data2 = defaultdict(dict)
for line in data:
    for i, year in ((2,1995),(3,1996),(4,1997),(5,1998),(6,1999)):
        data2[(line[0], year)][line[1]] = line[i]
data3 = [[i,j]+[data2[(i,j)][k] for k in ('GDP','Inf','Trade')] for i,j in data2]
for line in sorted(data3):
    print(line)

你将获得:

['GER', 1995, '8', '100', '150']
['GER', 1996, '9', '105', '156']
['GER', 1997, '9.5', '107', '149']
['GER', 1998, '10', '109', '165']
['GER', 1999, '10.5', '111', '167']
['USA', 1995, '10', '100', '200']
['USA', 1996, '11', '120', '220']
['USA', 1997, '12', '130', '210']
['USA', 1998, '12', '120', '235']
['USA', 1999, '13', '110', '250']

2

这个答案与 @AmiTavory 的删除答案非常相似(使用 unstack 而不是 pivot_table,但在这里它们是等效的),只是最后多了一步:

df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")
df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()
df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]

产生

In [149]: df2
Out[149]: 
  Countries Years   GDP  Inf  Trade
0       GER  1995   8.0  100    150
1       GER  1996   9.0  105    156
2       GER  1997   9.5  107    149
3       GER  1998  10.0  109    165
4       GER  1999  10.5  111    167
5       USA  1995  10.0  100    200
6       USA  1996  11.0  120    220
7       USA  1997  12.0  130    210
8       USA  1998  12.0  120    235
9       USA  1999  13.0  110    250

这是因为我们首先创建了一个融合版本的框架:
In [160]: df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")

In [161]: df2
Out[161]: 
   Countries variable Years  value
0        USA      GDP  1995   10.0
1        USA      Inf  1995  100.0
2        USA    Trade  1995  200.0
3        GER      GDP  1995    8.0
4        GER      Inf  1995  100.0
5        GER    Trade  1995  150.0
6        USA      GDP  1996   11.0
[...]

然后我们设置索引并进行unstack操作:

In [166]: df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()

In [167]: df2
Out[167]: 
         Countries Years value           
variable                   GDP  Inf Trade
0              GER  1995   8.0  100   150
1              GER  1996   9.0  105   156
2              GER  1997   9.5  107   149
3              GER  1998  10.0  109   165
4              GER  1999  10.5  111   167
5              USA  1995  10.0  100   200
6              USA  1996  11.0  120   220
7              USA  1997  12.0  130   210
8              USA  1998  12.0  120   235
9              USA  1999  13.0  110   250

这差不多是我们想要的,但是列太复杂了。不过我们可以解决:

In [168]: df2.columns
Out[168]: 
MultiIndex(levels=[['value', 'Years', 'Countries'], ['GDP', 'Inf', 'Trade', '']],
           labels=[[2, 1, 0, 0, 0], [3, 3, 0, 1, 2]],
           names=[None, 'variable'])

In [169]: df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]

In [170]: df2
Out[170]: 
  Countries Years   GDP  Inf  Trade
0       GER  1995   8.0  100    150
1       GER  1996   9.0  105    156
2       GER  1997   9.5  107    149
3       GER  1998  10.0  109    165
4       GER  1999  10.5  111    167
5       USA  1995  10.0  100    200
6       USA  1996  11.0  120    220
7       USA  1997  12.0  130    210
8       USA  1998  12.0  120    235
9       USA  1999  13.0  110    250

1
我将您的数据复制并粘贴到电子表格中。也许重命名不是必要的,但我认为列名“variable”会导致错误。此外,我没有检查这是否是最少占用内存的方法。
import pandas as pd
import numpy as np
df = pd.read_excel('df_countries.xls','Sheet1')
df.columns=['countries','var','1995','1996','1997','1998','1999']
df_new  = pd.melt(df,id_vars=['countries','var'])
df_new.columns = ['countries','var','year','data']
df_new.set_index(['countries','year','var']).unstack('var')

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接