我尝试使用Andrea的答案来研究一些巨大的CSV文件,但几乎每个地方都出现了多对多的情况 - 即使是我确定是1对1的列也是如此。问题在于重复数据。
这里是稍微修改过的版本,附带演示,并采用与数据库术语相匹配的格式(并提供描述以消除歧义)。
首先是一个更清晰的例子。
医生开具许多处方,每个处方可以开具多种药物,但每种药物只由一个制造商生产,每个制造商只生产一种药物。
doctor prescription drug producer
0 Doctor Who 1 aspirin Bayer
1 Dr Welby 2 aspirin Bayer
2 Dr Oz 3 aspirin Bayer
3 Doctor Who 4 paracetamol Tylenol
4 Dr Welby 5 paracetamol Tylenol
5 Dr Oz 6 antibiotics Merck
6 Doctor Who 7 aspirin Bayer
以下是我的函数返回的正确结果
对Andrea函数主要作以下更改:
- 对于成对变量使用drop_duplicates处理,以避免1-1被视为多-多关系
- 将结果放入数据框report_df中(在函数中查看),使结果更易读
- 反转逻辑以匹配UML术语(我不涉及set与UML的争论——这只是我想要的方式)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 2 drugs, and drugs up to 3 d...
2 doctor producer many-to-many doctors had up to 2 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 3 doctors, and doctors up to 2...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer 1-to-1 1 drug has 1 producer and vice versa
9 producer doctor many-to-many producers had up to 3 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug 1-to-1 1 producer has 1 drug and vice versa
没有去重的错误结果
这些结果基于我修改过的Andrea算法,但没有进行去重。
您可以看到最后一行 - 医生到药品 - 是多对多的,而应该是1对1的 - 这解释了我的初始结果(在有1000多条记录的情况下难以调试)。
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 3 drugs, and drugs up to 4 d...
2 doctor producer many-to-many doctors had up to 3 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 4 doctors, and doctors up to 3...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer many-to-many drugs had up to 4 producers, and producers up ...
9 producer doctor many-to-many producers had up to 4 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug many-to-many producers had up to 4 drugs, and drugs up to 4...
新功能
from itertools import product
import pandas as pd
def get_relation(df, col1, col2):
first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
else:
return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
else:
if second_max==1:
return '1-to-many', f'each {col1} has many {col2}s (some had {first_max})'
else:
return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'
def report_relations(df):
report = []
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
relation = get_relation(df, col_i, col_j)
report.append([col_i, col_j, *relation])
report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
display(report_df)
print(report_df)
test_df = pd.DataFrame({
'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
'prescription': [1, 2, 3, 4, 5, 6, 7],
'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']
})
display(test_df)
print(test_df)
report_relations(test_df)
谢谢Andrea - 这对我很有帮助。