评估和从Python列表中删除重复的字典

3
业务问题:我有一个字典列表,代表给定学生的学术历史...他们所选课程,何时选课,他们的成绩是什么(空白表示该课程正在进行中)等。我需要找到任何重复尝试某个课程,并仅保留最高分数的尝试。 到目前为止我尝试过的方法:
acad_hist = [{‘crse_id’: u'GRG 302P0', ‘grade’: u’’}, {‘crse_id’: u’URB 3010', ‘grade’: u’B+‘},
{‘crse_id’: u'GRG 302P0', ‘grade’: u’D‘}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
  1. At first I tried to loop through the acad_hist list and add any classes not-yet-seen to the “seen” list. It was then the plan that when I come across a class that had already been added to the “seen” list, I should go back to the acad_hist list, grab the details (e.g. "grade") of that class, evaluate the grades, and remove the class with the lower grade from the acad_hist list. Problem is, I’m having a tough time easily going back and “grabbing” the earlier seen class from the “seen” list and even more difficulty correctly pointing to it once I know I need to delete it from the acad_hist list. The code is a mess but here is what I have so far:

    key = ‘crse_id’
    for index, course in enumerate(acad_hist[:]):
        if course[key] not in seen:
            seen.append(course[key])
        else:
            logger.info('found duplicate {0} at index {1}'.format(course[key], index))
            < not sure what to do here… >
    

    OUTPUT:

    found duplicate GRG 302P0 at index 11
    
  2. So then I thought I might be able to use the set() function to cull the list for me, but the problem here is that I need to choose which class instance to keep and set() doesn’t seem to allow me a way to do that.

    names = set(d['compressed_hist_crse_id'] for d in acad_hist_condensed)
    logger.info('TEST names: {0}'.format(names))
    

    OUTPUT:

    TEST names: set([u'GRG 302P0', u'URB 3010’}]
    
  3. Wanting to see if I could add to #2 above, I thought I’d do some “belt-n-suspenders” looping through the output of the set() “names” and collect a grade. It’s working, but I don’t pretend to fully understand what it’s doing, nor does it really allow me to do the processing I need to do.

    new_dicts = []
    for name in names:
        d = dict(name=name)
        d['grade'] = max(d['grade'] for d in acad_hist if d['crse_id'] == name)
        new_dicts.append(d)
    logger.info('TEST new_dicts: {0}'.format(new_dicts))
    

    OUTPUT:

    TEST new_dicts: [{'grade': u'', 'name': u'GRG 302P0'}, {'grade': u’B’+, 'name': u'URB 3010'}]
    

有人能为我提供缺失的部分,甚至是更好的方法吗?

更新--我最终采用的解决方案(根据我从被接受的答案中得到的想法进行了改编)

def scrub_for_duplicate_courses(acad_hist_condensed, acad_hist_list):
"""
Looks for duplicate courses that may have been taken, and if any are found, will look for the one with the highest
grade and keep that one, deleting the other course from the lists before returning them.
"""

# -------------------------------------------
# set logging params
# -------------------------------------------
logger = logging.getLogger(__name__)

# -----------------------------------------------------------------------------------------------------
# the grade_list is in order of ascending priority/value...a blank grade indicates "in-progress", and
# will therefore replace any class instance that has a grade.
# -----------------------------------------------------------------------------------------------------
grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+', '']
# converting the grade_list in to a more efficient, weighted dict
grade_list = dict(zip(grade_list, range(len(grade_list))))

seen_courses = {}

for course in acad_hist_condensed[:]:
    # -----------------------------------------------------------------------------------------------------
    # one of the two keys checked for below should exist in the list, but not both
    # -----------------------------------------------------------------------------------------------------
    key = ''
    if 'compressed_hist_crse_id' in course:
        key = 'compressed_hist_crse_id'
    elif 'compressed_ovrd_crse_id' in course:
        key = 'compressed_ovrd_crse_id'

    cid = course[key]
    grade = course['grade']

    if cid not in seen_courses:
        seen_courses[cid] = grade
    else:
        # ---------------------------------------------------------------------------------------------------------
        # if we get here, a duplicate course_id has been found in the acad_hist_condensed list, so now we'll want
        # to determine which one has the lowest grade, and remove that course instance from both lists.
        # ---------------------------------------------------------------------------------------------------------
        if grade_list.get(seen_courses[cid], 0) < grade_list.get(grade, 0):
            seen_courses[cid] = grade  # this will overlay the grade for the record already in seen_courses
            grade_for_rec_to_remove = seen_courses[cid]
            crse_id_for_rec_to_remove = cid
        else:
            grade_for_rec_to_remove = grade
            crse_id_for_rec_to_remove = cid

        # -----------------------------------------------------------------------------------------------------
        # find the rec in acad_hist_condensed that needs removal
        # -----------------------------------------------------------------------------------------------------
        for rec in acad_hist_condensed:
            if rec[key] == crse_id_for_rec_to_remove and rec['grade'] == grade_for_rec_to_remove:
                acad_hist_condensed.remove(rec)
        for rec in acad_hist_list:
            if rec == crse_id_for_rec_to_remove:
                acad_hist_list.remove(rec)
                break  # just want to remove one occurrence

return acad_hist_condensed, acad_hist_list
2个回答

1
一个简单的解决方案是遍历每个学生的课程历史记录,并计算每门课程的最高成绩...
acad_hist = [{'crse_id': u'GRG 302P0', 'grade': u''}, {'crse_id': u'URB 3010', 'grade': u'B+'}, {'crse_id': u'GRG 302P0', 'grade': u'D'}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
#let's turn grade_list into something more efficient:
grade_list = dict(zip(grade_list, range(len(grade_list)))) # 'CR' == 0, 'D-' == 1

courses = {} # keys will be crse_id, values will be grade.
for course in acad_hist:
    cid = course['crse_id']
    g = course['grade']
    if cid not in courses:
        courses[cid] = g 
    else:
        if grade_list.get(courses[cid], 0) < grade_list.get(g,0):
            courses[cid] = g 

输出将是:

{u'GRG 302P0': u'D', u'URB 3010': u'B+'}

如果需要的话,它可以被改写回原始形式。

Charles,谢谢你。你有什么想法,我该如何找到并删除acad_hist中具有GRG 302P0班级和空白成绩的dict? - KeithE
你可以使用列表推导式来进行简单的过滤,acad_hist = [c for c in acad_hist if c['grade'] != ''],这将返回一个新列表,删除所有成绩为空字符串的项。 - Charles
非常抱歉...我忘了提到我只能使用Python v.2.6,所以列表推导式不可行。不过,我可以分解您建议的列表推导式,然后用“长手”编码出来。?? - KeithE
列表推导式从2.0版本开始支持,您也可以使用与lambda配合的filter函数,或者如果您使用了上面的循环,则可以添加一个if语句来代替循环。例如:如果g =='',则使用continue跳过任何成绩为空的班级。 - Charles

1

这可以使用迭代器Lego完成(即ifiltersortedgroupbymax)。

def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

单调的完整代码

from itertools import groupby, ifilter


COURSE_ID = 'crse_id'
GRADE = 'grade'

ACADEMIC_HISTORY = [
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'B',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : '',
    },
    {
        COURSE_ID: 'URB 3010',
        GRADE    : 'B+',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'D',
    },
]

GRADES = [
    'CR',
    'D-',
    'D' ,
    'D+',
    'C-',
    'C' ,
    'C+',
    'B-',
    'B' ,
    'B+',
    'A-',
    'A' ,
    'A+',
]

GRADES = dict(zip(GRADES, range(len(GRADES))))


def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

best_grades = find_best_grades(ACADEMIC_HISTORY)
print best_grades

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接