Python多文件CSV求和列，每周平均值和分支平均值

Question

Python多文件CSV求和列，每周平均值和分支平均值

3

我是Python的新手，在目录中有两个csv文件。

file1.csv

Id place,Date and hour, Value
1,2018.09.17.12.54,200000
2,2018.09.18.14.16,150000
1,2018.09.19.15.06,78000
3,2018.09.17.16.26,110000
2,2018.09.20.13.54,200000
3,2018.09.17.14.16,150000
1,2018.09.21.12.54,200000

file2.csv

Id place,Date and hour, Value
1,2018.09.24.12.54,200000
3,2018.09.24.14.16,150000
1,2018.09.24.15.06,78000
2,2018.09.26.16.26,110000
1,2018.09.27.12.54,200000
3,2018.09.25.14.16,150000
1,2018.09.28.12.54,200000
3,2018.09.28.14.16,150000

我已经阅读了目录中的所有csv文件，并将有关以下信息保存在新的csv文件中：

从文件中求和列值

输出

Id place, Value
1, 1 156 000
2, 460 000
3, 710 000

每周平均总和

输出

Week, average Value
1 ,  155428,57   (1088000 / 7)
2 ,  154750   (1238000 / 8)

每个地点每周的平均总和

输出

Id place,Week, average Value
1, 1 , 159 333  (478000 / 3)
2, 1 , 175 000  (350000 / 2)
3, 1 , 130 000  (260 000/ 2)
1, 2 , 169 500  (678000 / 4) 
2, 2 , 110 000  (110000 / 1)
3, 2 , 150 000  (450000 / 3)

我不知道如何做，提前感谢。

- Łukasz Skrzyński

3

听起来你已经有一些想法了。将数据导入Python并进行数学计算。也许可以逐步进行，先尝试将.csv文件读入Python，然后再继续操作。 - sniperd

2

嗨Lukasz，欢迎来到SO。你可以尝试使用Python库，如csv或pandas，特别是后者，并展示你的工作。如果你的代码不起作用也没关系，但这会让我们有一个开始帮助你。 - amanb

2个回答

1

绝对不推荐，而 pandas 明显是更好的方法，但手动实现这个功能可以使用 defaultdicts 将项目分组并在最后进行计算。 演示：

from csv import reader
from os import listdir
from collections import defaultdict
from datetime import datetime
from operator import itemgetter
from pprint import pprint

# Collect sums first in a defaultdict
sums = defaultdict(list)

# Collect dates seperately since they are more complicated
dates = []

# Get all csv files and open them
for file in listdir("."):
    if file.endswith(".csv"):
        with open(file) as f:
            csv_reader = reader(f)

            # Skip headers
            next(csv_reader)

            # Separately get sums and dates stuff
            for place, date, value in csv_reader:
                sums[int(place)].append(int(value))
                dates.append(
                    (place, datetime.strptime(date, "%Y.%m.%d.%H.%M"), int(value))
                )

# Print out sum of columns
sum_column_values = {k: sum(v) for k, v in sums.items()}
pprint(sum_column_values)

# Get Minimum date to get weeknumber
min_date = min(map(itemgetter(1), dates)).date().isocalendar()[1]

# Collect weeks stuff in separate dicts
weeks = defaultdict(list)
place_weeks = defaultdict(list)

for place, date, value in dates:

    # Weeknumber calculation
    week_number = date.date().isocalendar()[1] - min_date + 1

    # Collect week stuff
    weeks[week_number].append(value)
    place_weeks[int(place), week_number].append(value)

# Print out week averages
week_averages = {k: sum(v) / len(v) for k, v in weeks.items()}
pprint(week_averages)

# Print out place/week averages
place_week_averages = {k: sum(v) / len(v) for k, v in place_weeks.items()}
pprint(place_week_averages)

以下是存储在不同字典中的结果：

# place averages
{1: 1156000, 2: 460000, 3: 710000}

# week averages
{1: 155428.57142857142, 2: 154750.0}

# place/week averages
{(1, 1): 159333.33333333334,
 (1, 2): 169500.0,
 (2, 1): 175000.0,
 (2, 2): 110000.0,
 (3, 1): 130000.0,
 (3, 2): 150000.0}

- RoadRunner

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

我建议使用 pandas：

import glob
import pandas as pd

#get all files
files = glob.glob('files/*.csv')
#create list of DataFrames, if necessary remove traling whitespaces in csv headers
dfs = [pd.read_csv(fp).rename(columns=lambda x: x.strip()) for fp in files]
#join together all files
df = pd.concat(dfs, ignore_index=True)

#convert column to datetimes
df['Date and hour'] = pd.to_datetime(df['Date and hour'], format='%Y.%m.%d.%H.%M')
#convert to weeks and for starting with 1 add factorize
df['week'] = pd.factorize(df['Date and hour'].dt.weekofyear)[0] + 1
print (df)
    Id place       Date and hour   Value  week
0          1 2018-09-17 12:54:00  200000     1
1          2 2018-09-18 14:16:00  150000     1
2          1 2018-09-19 15:06:00   78000     1
3          3 2018-09-17 16:26:00  110000     1
4          2 2018-09-20 13:54:00  200000     1
5          3 2018-09-17 14:16:00  150000     1
6          1 2018-09-21 12:54:00  200000     1
7          1 2018-09-24 12:54:00  200000     2
8          3 2018-09-24 14:16:00  150000     2
9          1 2018-09-24 15:06:00   78000     2
10         2 2018-09-26 16:26:00  110000     2
11         1 2018-09-27 12:54:00  200000     2
12         3 2018-09-25 14:16:00  150000     2
13         1 2018-09-28 12:54:00  200000     2
14         3 2018-09-28 14:16:00  150000     2

#aggregate sum
df1 = df.groupby('Id place', as_index=False)['Value'].sum()
print (df1)
   Id place    Value
0         1  1156000
1         2   460000
2         3   710000

#aggregate mean
df2 = df.groupby('week', as_index=False)['Value'].mean()
print (df2)
   week          Value
0     1  155428.571429
1     2  154750.000000

#aggregate mean per 2 columns
df3 = df.groupby(['Id place','week'], as_index=False)['Value'].mean()
print (df3)

   Id place  week          Value
0         1     1  159333.333333
1         1     2  169500.000000
2         2     1  175000.000000
3         2     2  110000.000000
4         3     1  130000.000000
5         3     2  150000.000000

#write output DataFrames to files
df1.to_csv('out1.csv', index=False)
df2.to_csv('out2.csv', index=False)
df3.to_csv('out3.csv', index=False)