Python多文件CSV求和列,每周平均值和分支平均值

3

我是Python的新手,在目录中有两个csv文件。

file1.csv

Id place,Date and hour, Value
1,2018.09.17.12.54,200000
2,2018.09.18.14.16,150000
1,2018.09.19.15.06,78000
3,2018.09.17.16.26,110000
2,2018.09.20.13.54,200000
3,2018.09.17.14.16,150000
1,2018.09.21.12.54,200000

file2.csv

Id place,Date and hour, Value
1,2018.09.24.12.54,200000
3,2018.09.24.14.16,150000
1,2018.09.24.15.06,78000
2,2018.09.26.16.26,110000
1,2018.09.27.12.54,200000
3,2018.09.25.14.16,150000
1,2018.09.28.12.54,200000
3,2018.09.28.14.16,150000

我已经阅读了目录中的所有csv文件,并将有关以下信息保存在新的csv文件中:

  • 从文件中求和列值

输出

Id place, Value
1, 1 156 000
2, 460 000
3, 710 000
  • 每周平均总和

输出

Week, average Value
1 ,  155428,57   (1088000 / 7)
2 ,  154750   (1238000 / 8)
  • 每个地点每周的平均总和

输出

Id place,Week, average Value
1, 1 , 159 333  (478000 / 3)
2, 1 , 175 000  (350000 / 2)
3, 1 , 130 000  (260 000/ 2)
1, 2 , 169 500  (678000 / 4) 
2, 2 , 110 000  (110000 / 1)
3, 2 , 150 000  (450000 / 3)

我不知道如何做,提前感谢。


3
听起来你已经有一些想法了。将数据导入Python并进行数学计算。也许可以逐步进行,先尝试将.csv文件读入Python,然后再继续操作。 - sniperd
2
嗨Lukasz,欢迎来到SO。你可以尝试使用Python库,如csvpandas,特别是后者,并展示你的工作。如果你的代码不起作用也没关系,但这会让我们有一个开始帮助你。 - amanb
2个回答

3

我建议使用 pandas

import glob
import pandas as pd

#get all files
files = glob.glob('files/*.csv')
#create list of DataFrames, if necessary remove traling whitespaces in csv headers
dfs = [pd.read_csv(fp).rename(columns=lambda x: x.strip()) for fp in files]
#join together all files
df = pd.concat(dfs, ignore_index=True)

#convert column to datetimes
df['Date and hour'] = pd.to_datetime(df['Date and hour'], format='%Y.%m.%d.%H.%M')
#convert to weeks and for starting with 1 add factorize
df['week'] = pd.factorize(df['Date and hour'].dt.weekofyear)[0] + 1
print (df)
    Id place       Date and hour   Value  week
0          1 2018-09-17 12:54:00  200000     1
1          2 2018-09-18 14:16:00  150000     1
2          1 2018-09-19 15:06:00   78000     1
3          3 2018-09-17 16:26:00  110000     1
4          2 2018-09-20 13:54:00  200000     1
5          3 2018-09-17 14:16:00  150000     1
6          1 2018-09-21 12:54:00  200000     1
7          1 2018-09-24 12:54:00  200000     2
8          3 2018-09-24 14:16:00  150000     2
9          1 2018-09-24 15:06:00   78000     2
10         2 2018-09-26 16:26:00  110000     2
11         1 2018-09-27 12:54:00  200000     2
12         3 2018-09-25 14:16:00  150000     2
13         1 2018-09-28 12:54:00  200000     2
14         3 2018-09-28 14:16:00  150000     2

#aggregate sum
df1 = df.groupby('Id place', as_index=False)['Value'].sum()
print (df1)
   Id place    Value
0         1  1156000
1         2   460000
2         3   710000

#aggregate mean
df2 = df.groupby('week', as_index=False)['Value'].mean()
print (df2)
   week          Value
0     1  155428.571429
1     2  154750.000000

#aggregate mean per 2 columns
df3 = df.groupby(['Id place','week'], as_index=False)['Value'].mean()
print (df3)

   Id place  week          Value
0         1     1  159333.333333
1         1     2  169500.000000
2         2     1  175000.000000
3         2     2  110000.000000
4         3     1  130000.000000
5         3     2  150000.000000

#write output DataFrames to files
df1.to_csv('out1.csv', index=False)
df2.to_csv('out2.csv', index=False)
df3.to_csv('out3.csv', index=False)

2
@jezrael 这非常好。 - RoadRunner

1
绝对不推荐,而 pandas 明显是更好的方法,但手动实现这个功能可以使用 defaultdicts 将项目分组并在最后进行计算。 演示:
from csv import reader
from os import listdir
from collections import defaultdict
from datetime import datetime
from operator import itemgetter
from pprint import pprint

# Collect sums first in a defaultdict
sums = defaultdict(list)

# Collect dates seperately since they are more complicated
dates = []

# Get all csv files and open them
for file in listdir("."):
    if file.endswith(".csv"):
        with open(file) as f:
            csv_reader = reader(f)

            # Skip headers
            next(csv_reader)

            # Separately get sums and dates stuff
            for place, date, value in csv_reader:
                sums[int(place)].append(int(value))
                dates.append(
                    (place, datetime.strptime(date, "%Y.%m.%d.%H.%M"), int(value))
                )

# Print out sum of columns
sum_column_values = {k: sum(v) for k, v in sums.items()}
pprint(sum_column_values)

# Get Minimum date to get weeknumber
min_date = min(map(itemgetter(1), dates)).date().isocalendar()[1]

# Collect weeks stuff in separate dicts
weeks = defaultdict(list)
place_weeks = defaultdict(list)

for place, date, value in dates:

    # Weeknumber calculation
    week_number = date.date().isocalendar()[1] - min_date + 1

    # Collect week stuff
    weeks[week_number].append(value)
    place_weeks[int(place), week_number].append(value)

# Print out week averages
week_averages = {k: sum(v) / len(v) for k, v in weeks.items()}
pprint(week_averages)

# Print out place/week averages
place_week_averages = {k: sum(v) / len(v) for k, v in place_weeks.items()}
pprint(place_week_averages)

以下是存储在不同字典中的结果:
# place averages
{1: 1156000, 2: 460000, 3: 710000}

# week averages
{1: 155428.57142857142, 2: 154750.0}

# place/week averages
{(1, 1): 159333.33333333334,
 (1, 2): 169500.0,
 (2, 1): 175000.0,
 (2, 2): 110000.0,
 (3, 1): 130000.0,
 (3, 2): 150000.0}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接