评估和从Python列表中删除重复的字典

Question

评估和从Python列表中删除重复的字典

3

业务问题：我有一个字典列表，代表给定学生的学术历史...他们所选课程，何时选课，他们的成绩是什么（空白表示该课程正在进行中）等。我需要找到任何重复尝试某个课程，并仅保留最高分数的尝试。 到目前为止我尝试过的方法:

acad_hist = [{‘crse_id’: u'GRG 302P0', ‘grade’: u’’}, {‘crse_id’: u’URB 3010', ‘grade’: u’B+‘},
{‘crse_id’: u'GRG 302P0', ‘grade’: u’D‘}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']

At first I tried to loop through the acad_hist list and add any classes not-yet-seen to the “seen” list. It was then the plan that when I come across a class that had already been added to the “seen” list, I should go back to the acad_hist list, grab the details (e.g. "grade") of that class, evaluate the grades, and remove the class with the lower grade from the acad_hist list. Problem is, I’m having a tough time easily going back and “grabbing” the earlier seen class from the “seen” list and even more difficulty correctly pointing to it once I know I need to delete it from the acad_hist list. The code is a mess but here is what I have so far:
```
key = ‘crse_id’
for index, course in enumerate(acad_hist[:]):
    if course[key] not in seen:
        seen.append(course[key])
    else:
        logger.info('found duplicate {0} at index {1}'.format(course[key], index))
        < not sure what to do here… >
```
OUTPUT:
```
found duplicate GRG 302P0 at index 11
```
So then I thought I might be able to use the set() function to cull the list for me, but the problem here is that I need to choose which class instance to keep and set() doesn’t seem to allow me a way to do that.
```
names = set(d['compressed_hist_crse_id'] for d in acad_hist_condensed)
logger.info('TEST names: {0}'.format(names))
```
OUTPUT:
```
TEST names: set([u'GRG 302P0', u'URB 3010’}]
```

Wanting to see if I could add to #2 above, I thought I’d do some “belt-n-suspenders” looping through the output of the set() “names” and collect a grade. It’s working, but I don’t pretend to fully understand what it’s doing, nor does it really allow me to do the processing I need to do.

new_dicts = []
for name in names:
    d = dict(name=name)
    d['grade'] = max(d['grade'] for d in acad_hist if d['crse_id'] == name)
    new_dicts.append(d)
logger.info('TEST new_dicts: {0}'.format(new_dicts))

OUTPUT:

TEST new_dicts: [{'grade': u'', 'name': u'GRG 302P0'}, {'grade': u’B’+, 'name': u'URB 3010'}]

有人能为我提供缺失的部分，甚至是更好的方法吗？

更新--我最终采用的解决方案（根据我从被接受的答案中得到的想法进行了改编）

def scrub_for_duplicate_courses(acad_hist_condensed, acad_hist_list):
"""
Looks for duplicate courses that may have been taken, and if any are found, will look for the one with the highest
grade and keep that one, deleting the other course from the lists before returning them.
"""

# -------------------------------------------
# set logging params
# -------------------------------------------
logger = logging.getLogger(__name__)

# -----------------------------------------------------------------------------------------------------
# the grade_list is in order of ascending priority/value...a blank grade indicates "in-progress", and
# will therefore replace any class instance that has a grade.
# -----------------------------------------------------------------------------------------------------
grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+', '']
# converting the grade_list in to a more efficient, weighted dict
grade_list = dict(zip(grade_list, range(len(grade_list))))

seen_courses = {}

for course in acad_hist_condensed[:]:
    # -----------------------------------------------------------------------------------------------------
    # one of the two keys checked for below should exist in the list, but not both
    # -----------------------------------------------------------------------------------------------------
    key = ''
    if 'compressed_hist_crse_id' in course:
        key = 'compressed_hist_crse_id'
    elif 'compressed_ovrd_crse_id' in course:
        key = 'compressed_ovrd_crse_id'

    cid = course[key]
    grade = course['grade']

    if cid not in seen_courses:
        seen_courses[cid] = grade
    else:
        # ---------------------------------------------------------------------------------------------------------
        # if we get here, a duplicate course_id has been found in the acad_hist_condensed list, so now we'll want
        # to determine which one has the lowest grade, and remove that course instance from both lists.
        # ---------------------------------------------------------------------------------------------------------
        if grade_list.get(seen_courses[cid], 0) < grade_list.get(grade, 0):
            seen_courses[cid] = grade  # this will overlay the grade for the record already in seen_courses
            grade_for_rec_to_remove = seen_courses[cid]
            crse_id_for_rec_to_remove = cid
        else:
            grade_for_rec_to_remove = grade
            crse_id_for_rec_to_remove = cid

        # -----------------------------------------------------------------------------------------------------
        # find the rec in acad_hist_condensed that needs removal
        # -----------------------------------------------------------------------------------------------------
        for rec in acad_hist_condensed:
            if rec[key] == crse_id_for_rec_to_remove and rec['grade'] == grade_for_rec_to_remove:
                acad_hist_condensed.remove(rec)
        for rec in acad_hist_list:
            if rec == crse_id_for_rec_to_remove:
                acad_hist_list.remove(rec)
                break  # just want to remove one occurrence

return acad_hist_condensed, acad_hist_list

- KeithE

2个回答

1

这可以使用迭代器Lego完成（即ifilter，sorted，groupby和max）。

def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

单调的完整代码

from itertools import groupby, ifilter


COURSE_ID = 'crse_id'
GRADE = 'grade'

ACADEMIC_HISTORY = [
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'B',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : '',
    },
    {
        COURSE_ID: 'URB 3010',
        GRADE    : 'B+',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'D',
    },
]

GRADES = [
    'CR',
    'D-',
    'D' ,
    'D+',
    'C-',
    'C' ,
    'C+',
    'B-',
    'B' ,
    'B+',
    'A-',
    'A' ,
    'A+',
]

GRADES = dict(zip(GRADES, range(len(GRADES))))


def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

best_grades = find_best_grades(ACADEMIC_HISTORY)
print best_grades

- Peter Sutton

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Charles · Accepted Answer

一个简单的解决方案是遍历每个学生的课程历史记录，并计算每门课程的最高成绩...

acad_hist = [{'crse_id': u'GRG 302P0', 'grade': u''}, {'crse_id': u'URB 3010', 'grade': u'B+'}, {'crse_id': u'GRG 302P0', 'grade': u'D'}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
#let's turn grade_list into something more efficient:
grade_list = dict(zip(grade_list, range(len(grade_list)))) # 'CR' == 0, 'D-' == 1

courses = {} # keys will be crse_id, values will be grade.
for course in acad_hist:
    cid = course['crse_id']
    g = course['grade']
    if cid not in courses:
        courses[cid] = g 
    else:
        if grade_list.get(courses[cid], 0) < grade_list.get(g,0):
            courses[cid] = g

输出将是：

{u'GRG 302P0': u'D', u'URB 3010': u'B+'}

如果需要的话，它可以被改写回原始形式。