我在pandas中有一个数据框,其中包含有关人员位置和时间的信息。它有超过3亿行。
以下是示例,其中每个名称都通过group.by
分配给唯一的index
,并根据Name
和Year
进行排序:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
输出:
+-------+-------+------+---------------+----------------------+
| Index | Name | Year | Address | Name_Grouped_Index |
+-------+-------+------+---------------+----------------------+
| 5 | Steve | 2018 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 15 | Steve | 2018 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 16 | Steve | 2018 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 6 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 7 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 8 | Steve | 2020 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 9 | Steve | 2020 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 13 | Steve | 2021 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 14 | Steve | 2022 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 17 | Steve | 2022 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 0 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 1 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 2 | John | 2019 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 3 | John | 2019 | Orange county | 0 |
+-------+-------+------+---------------+----------------------+
| 4 | John | 2019 | New York | 0 |
+-------+-------+------+---------------+----------------------+
| 10 | John | 2020 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 11 | John | 2021 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 12 | John | 2021 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
我想获取网络图矩阵(邻接矩阵),以便查看地址之间的变化总数。换句话说,例如,2018年有多少人从“加拿大”搬到了“加利福尼亚”。
理想输出:
1)直接从地址列生成一个有向图。技术上将"地址"列转换为两列 "源" 和 "目标",其中 "目标" 值是下一行的 "源"。最好在另一列 "权重" 中计算成对数,而不是成对重复。
+------------+------------+------+--------+
| Source | Target | Year | Weight |
+------------+------------+------+--------+
| Canada | NewYork | 2018 | |
+------------+------------+------+--------+
| NewYork | California | 2018 | |
+------------+------------+------+--------+
| California | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | California | 2020 | |
+------------+------------+------+--------+
| California | Canada | 2020 | |
+------------+------------+------+--------+
| Canada | California | 2021 | |
+------------+------------+------+--------+
| California | California | 2022 | |
+------------+------------+------+--------+
| California | NewYork | 2022 | |
+------------+------------+------+--------+
或者
2) 一个矩阵来说明地址之间的总变化。
+---------------+--------+---------+------------+---------------+---------------+
| From \ To | Canada | NewYork | California | Beverly hills | Orange county |
+---------------+--------+---------+------------+---------------+---------------+
| Canada | 2 | 2 | 2 | 2 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| NewYork | 1 | 0 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| California | 2 | 1 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| Beverly hills | 0 | 0 | 0 | 2 | 1 |
+---------------+--------+---------+------------+---------------+---------------+
| Orange county | 0 | 1 | 0 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+