在Pandas中将整数赋值给列表列中的字符串

3

我有一个Pandas数据框,其中包含一个列,该列具有字符串列表。

>>> df.head()

   genre
0  [Comedy,  Supernatural,  Romance]
1  [Comedy,  Parody,  Romance]
2  [Comedy]
3  [Comedy,  Drama,  Romance,  Fantasy]
4  [Comedy,  Drama,  Romance]

我该如何为列表中的每个值分配一个唯一的id,以便跨列时保持相同?

>>> df.head()

   genre
0  [1,  2,  3]
1  [1,  4,  3]
2  [1]
3  [1,  5,  3,  6]
4  [1,  5,  3]
3个回答

3

这里的复杂性在于我们正在处理列表列。我们可以先展开行,稍微提高性能。然后使用factorize函数并返回到原始格式:

v = df['genre'].explode()
v[:] = pd.factorize(v)[0] + 1
df['genre2'] = v.groupby(level=0).agg(list)

df
                               genre        genre2
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

2

在字典中获取每种流派的唯一标识:

uniq_genres = df.genre.explode().unique()
dict_genres = {genre:i+1 for i,genre in enumerate(uniq_genres)}
print(dict_genres)
{'Comedy': 1, 'Supernatural': 2, 'Romance': 3, 'Parody': 4, 'Drama': 5, 'Fantasy': 6}

然后使用这个字典来映射流派ID:
df.assign(genre_id = df.genre.apply(lambda x: [dict_genres[genre] for genre in x]))

输出:

                               genre      genre_id
0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

0

您可以设置一个全局字典来跟踪值,并在字典中使用该值(如果存在),如果不存在,则增加最大值:

d = {} # Dictionary to assign numerical ids
maxV = 0 # Max numerical id in the dictionary

def assignId(x):
    lst = []
    global d, maxV
    for item in x:       
        if item in d:
            # Get numerical id from the dictionary.
            lst.append(d.get(item))           
        else:
            # Increment the largest numerical id in the dictionary
            # and add it to the dictionary.
            maxV += 1
            d[item] = maxV
            lst.append(maxV)
    return lst

如果我将此应用于数据框:

df['genre_ids'] = df['genre'].apply(assignId)

我得到:

                              genre     genre_ids

0    [Comedy, Supernatural, Romance]     [1, 2, 3]
1          [Comedy, Parody, Romance]     [1, 4, 3]
2                           [Comedy]           [1]
3  [Comedy, Drama, Romance, Fantasy]  [1, 5, 3, 6]
4           [Comedy, Drama, Romance]     [1, 5, 3]

使用这个字典 d

{'Comedy': 1,
 'Supernatural': 2,
 'Romance': 3,
 'Parody': 4,
 'Drama': 5,
 'Fantasy': 6}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接