pandas.DataFrame.drop_duplicates(inplace=True) 报错 'TypeError: unhashable type: 'dict''。

4

以下是我的代码:

BLOCK 1

import requests
import pandas as pd

url = ('http://www.omdbapi.com/' '?apikey=ff21610b&t=social+network')
r = requests.get(url)
json_data = r.json()
# from app
print(json_data['Awards'])
json_dict = dict(json_data)
tab=""
# printing all data as Dictionary
print("JSON as Dictionary (all):\n")
for k,v in json_dict.items():
  if len(k) > 6:
    tab = "\t"
  else:
    tab = "\t\t"
  print(str(k) + ":" + tab + str(v))
df = pd.DataFrame(json_dict)
df.drop_duplicates(inplace=True)
# printing Pandas DataFrame of all data
print("JSON as DataFrame (all):\n{}".format(df))

我只是在DataCamp上测试了一个示例问题,然后去探索了不同的事情。问题停在print(json_data['Awards'])。我进一步测试将JSON文件转换为字典并创建pandas DataFrame。有趣的是,我的输出如下:

Won 3 Oscars. Another 165 wins & 168 nominations.
JSON as Dictionary (all):

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True
Traceback (most recent call last):
  File "C:\Users\rschosta\OneDrive - Incitec Pivot Limited\Documents\Data Science\omdb-api-test.py", line 20, in <module>
    df.drop_duplicates(inplace=True)
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3535, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3582, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3570, in f
    vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 471, in factorize
    labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1367, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

我正在研究.drop_duplicates()函数,因为我之前使用过它并且它能正常工作。以下是一个例子代码:

代码块2

import pandas as pd
import numpy as np

#Create a DataFrame
d = {
    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
            'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
    'Age':[26,24,23,22,23,24,26,24,22,23,24,24],

    'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}

df = pd.DataFrame(d,columns=['Name','Age','Score'])
print(df)
df.drop_duplicates(keep=False, inplace=True)
print(df)

请注意两个代码块之间存在一些差异。在我的第一个脚本中,我将numpy导入为np,但它并没有改变结果。

如何让drop_duplicates()方法在BLOCK 1上工作?有任何想法吗?

BLOCK 1的输出 - A

根据@Wen的请求,这里是以字典形式呈现的数据:

{'Title': 'The Social Network', 'Year': '2010', 'Rated': 'PG-13', 'Released': '01 Oct 2010', 'Runtime': '120 min', 'Genre': 'Biography, Drama', 'Director': 'David Fincher', 'Writer': 'Aaron Sorkin (screenplay), Ben Mezrich (book)', 'Actors': 'Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons', 'Plot': 'Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.', 'Language': 'English, French', 'Country': 'USA', 'Awards': 'Won 3 Oscars. Another 165 wins & 168 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}], 'Metascore': '95', 'imdbRating': '7.7', 'imdbVotes': '542,658', 'imdbID': 'tt1285016', 'Type': 'movie', 'DVD': '11 Jan 2011', 'BoxOffice': '$96,400,000', 'Production': 'Columbia Pictures', 'Website': 'http://www.thesocialnetwork-movie.com/', 'Response': 'True'}

现在我在将Ratings字典转换为列并删除重复项之前,不再调用.drop_duplicates()方法,因此我在打印的字典表格列表中也有更多的输出,这使得阅读起来更加容易:

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True

2
你想在第一块中发布你的样本数据吗? - BENY
@Wen 样例数据已经在输出中了。我会添加我的新输出和代码。 - scriptkiddie
2个回答

6

你有一个填充着字典的 Ratings 列。因此,你不能使用 drop_duplicates ,因为 dicts 是可变的且不可哈希。

作为解决方案,你可以将这些值 transform 成元组的 frozenset ,然后使用 drop_duplicates

df['Ratings'] = df.Ratings.transform(lambda k: frozenset(k.items()))
df.drop_duplicates()

或者只选择您想要用作参考的列。例如,如果您想基于年份标题来删除重复项,可以这样做:

ref_cols = ['Title', 'Year']
df.loc[~df[ref_cols].duplicated()]

感谢您指出df['Ratings']列中的字典。您的解决方案会使我失去评级来源和价值数据,或者通过将字典转换为frozenset,我仍然会有大量重复数据,除了Ratings列外,并且不需要.drop_duplicates()。我正在将评级字典转换为df的列,并在应用.drop_duplicates()之前删除df['Ratings'] - scriptkiddie
谢谢你的回答。它并没有完全给我想要的,我需要一行数据。我正在努力解决问题并取得进展。再次感谢 RafaelC 和 Wen 的帮助。 - scriptkiddie

3

Object通常会导致这些问题,一种方法是将dictlist转换为str

df['Ratings1'] = df.Ratings.astype(str)
df=df.drop_duplicates(df.columns.difference(['Ratings'])).drop('Ratings1')

@Wen 谢谢你的回复。这是很好的东西。我正在努力将 df['Ratings'] 字典中的每个值放入它们自己的列中。我创建了一个工作循环来创建列名。这还在进行中。然后我可以删除重复项。 - scriptkiddie
这正是我正在寻找的。谢谢! - Henrique Florencio

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接