使用pandas从包含名称列表的列中获取最常见的名称

Question

使用pandas从包含名称列表的列中获取最常见的名称

3

我的数据框长这样：

star_rating  actors_list
0   9.3     [u'Tim Robbins', u'Morgan Freeman']
1   9.2     [u'Marlon Brando', u'Al Pacino', u'James Caan']
2   9.1     [u'Al Pacino', u'Robert De Niro']
3   9.0     [u'Christian Bale', u'Heath Ledger']
4   8.9     [u'John Travolta', u'Uma Thurman']

我想提取演员列表列中最常见的姓名。我发现了这段代码，你有更好的建议吗？尤其是针对大数据。

import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()

(这个数据)的预期输出为：

robert de niro    13
tom hanks         12
clint eastwood    11
johnny depp       10
al pacino         10
james stewart      9

- Reza energy

提供预期输出。 - Sociopath

最好使用for循环而不是让Pandas自己处理繁重的工作。 - Bharath M Shetty

@coldspeed 我认为这不是 unnesting 的重复。 - Bharath M Shetty

如果你有一个巨大的列表，那么 expand=True 会让你的系统崩溃。 - Bharath M Shetty

@Dark 如果没有 expand=True，.stack() 将无法工作。 - Reza energy

4个回答

3

如果列表很大，建议使用纯Python而不是依赖Pandas，因为后者会消耗大量内存。

如果列表大小为1000，则在使用expand=True时，非1000长度的列表将具有NaN值，这是一种浪费内存的做法。可以尝试使用以下方法代替：

df = pd.concat([df]*1000) # For the sake of large df. 

%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop

%%timeit     
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop

%%timeit
words = {}
for i in df['actors_list']:
    for w in i : 
        if w in words:
            words[w]+=1
        else:
            words[w]=1

100 loops, best of 3: 5.44 ms per loop

- Bharath M Shetty

1

不要忘记计时这部分代码：df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',') - cs95

@Dark：我在你的代码中遇到了“Can only use .str accessor with string values, which use np.object_ dtype in pandas”错误。另外，你执行代码的地方是在jupyter笔记本和Ipython中，它们不接受%%。 - Reza energy

@Rezaenergy，请删除%%timeit部分及其结果，仅使用代码。由于数据类型建议为对象，因此您可以直接从words = {}开始运行代码。 - Bharath M Shetty

@Dark 谢谢，没有 %timeit 或 %%timeit 时它可以工作，但我不知道当我添加 %%timeit 时会导致这个错误 Can only use .str accessor with string values, which use np.object_ dtype in pandas。 - Reza energy

3

我将使用 `ast` 将类似的列表转换为 `list`。

import ast 
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()

- BENY

它显示错误 ValueError: malformed node or string: ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']。 - Reza energy

0

根据this code，我得到了下面的图表

其中

coldspeed的代码是wen2()
Dark的代码是wen4()
我的代码是wen1()
W-B的代码是wen3()

- Reza energy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

根据我的测试，先进行计数再进行正则表达式清理会更快。

from itertools import chain
import re

p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
    x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]


ser.head()

Robert De Niro    18
Brad Pitt         14
Clint Eastwood    14
Tom Hanks         14
Al Pacino         13
dtype: int64