Python中基于分隔符拆分文本

Question

Python中基于分隔符拆分文本

pythonpandas

3

我刚开始学习Python。我正在处理从 Kaggle 下载的 netflix_tiles 数据集。导演一列中的一些条目有多个由逗号分隔的导演名称，我正在尝试使用 split 函数将导演名称分开。

以下是从文件加载到数据框架的原始值之一：

s7 电影小马宝莉：新一代, Robert Cullen, José Luis Ucha Vanessa Hudgens, ..

我正在使用以下代码进行拆分：

def strip(x):
  x = x.strip().split(',')
  return x

director_counts = df["director"].apply(strip)

在上述代码执行后，输出如下：

s7 [Robert Cullen, José Luis Ucha]

导演名称没有根据逗号分割，当我只将导演列传递给函数时，也看到了索引（s7）从函数返回。有人能告诉我为什么会这样吗？编辑：也尝试过这个

director_counts =  df['director'].str.split(',\s*')

协作链接： https://colab.research.google.com/drive/1OXJ9XKCBVg4-6W8Hiqfy4ZTkgz0IVqbR?usp=sharing

- logeeks

除非你的方法执行非常相似的功能，否则不要将其命名为现有方法（例如string.split()方法）。
如果你没有在类定义中定义它作为方法而是作为顶级函数（就像你的例子所示），请不要使用对象.方法()语法来调用你的方法。

- The Photon

可能格式不正确（ apply(strip) 在代码块之外）。另外，您使用了没有括号的strip ，因此未调用它。 - Caridorc

@ThePhoton，我已经编辑了代码。复制代码时有一个拼写错误。 - logeeks

仍然令人困惑（也许对你和我们来说都是如此；甚至对Python来说也是如此），你将函数命名为与现有的string.strip()方法相同。 - The Photon

正如奥比万·肯诺比所说：“那不是你要找的逗号。” - nicomp

2个回答

1

当你在行director_counts = df["director"].strip中使用df ["director"] .strip时，实际上是访问pandas Series对象的strip()方法，而不是将strip()方法应用于序列的每个元素。要将strip()方法应用于每个元素，应该在序列上使用apply()方法。

def strip(x):
    x = x.split(',')
    x = [obj.strip() for obj in x]
    return x

director_counts = df["director"].apply(strip)

- Abubakar Njumwa

我尝试使用此方法，但仍无法基于逗号分隔值 s7。[Robert Cullen，José Luis Ucha] - logeeks

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Corralien · Accepted Answer

使用 str.strip():

df = pd.read_csv('/home/damien/Downloads/netflix_titles.csv.zip')

directors = df['director'].str.split(',\s*')

输出：

>>> directors
0       [Kirsten Johnson]
1                     NaN
2       [Julien Leclercq]
3                     NaN
4                     NaN
              ...        
8802      [David Fincher]
8803                  NaN
8804    [Ruben Fleischer]
8805       [Peter Hewitt]
8806        [Mozez Singh]
Name: director, Length: 8807, dtype: object

更新

我原本期望它会分成两行显示

使用 explode 函数：

>>> directors.explode()
0       Kirsten Johnson
1                   NaN
2       Julien Leclercq
3                   NaN
4                   NaN
             ...       
8802      David Fincher
8803                NaN
8804    Ruben Fleischer
8805       Peter Hewitt
8806        Mozez Singh
Name: director, Length: 9612, dtype: object  # <- 9612 rows instead of 8807

要按导演获取计数，请使用value_counts（默认情况下删除nan）：

>>> directors.explode().value_counts()
Rajiv Chilaka     22
Jan Suter         21
Raúl Campos       19
Suhas Kadav       16
Marcus Raboy      16
                  ..
Raymie Muzquiz     1
Stu Livingston     1
Joe Menendez       1
Eric Bross         1
Mozez Singh        1
Name: director, Length: 4993, dtype: int64