在Pandas列中使用字典来替换字符串中的字符串

Question

在Pandas列中使用字典来替换字符串中的字符串

pythonpandasdictionarydataframereplace

27

我正在尝试使用一个字典的键来替换一个

列中的字符串为它的值。然而，每个列都包含句子。因此，我必须先对句子进行分词，并检测句子中的单词是否与字典中的键相对应，然后用相应的值替换字符串。

然而，我继续得到的结果是“none”。有没有更好的Pythonic方法来解决这个问题？

这是我目前的MVC。在注释中，我指定了问题出现的位置。

import pandas as pd

data = {'Categories': ['animal','plant','object'],
    'Type': ['tree','dog','rock'],
        'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}

ids = {'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']}


df = pd.DataFrame(data)
ids = pd.DataFrame(ids)

def col2dict(ids):
    data = ids[['Id', 'City']]
    idDict = data.set_index('Id').to_dict()['City']
    return idDict

def replaceIds(data,idDict):
    ids = idDict.keys()
    types = idDict.values()
    data['commentTest'] = data['Comment']
    words = data['commentTest'].apply(lambda x: x.split())
    for (i,word) in enumerate(words):
        #Here we can see that the words appear
        print word
        print ids
        if word in ids:
        #Here we can see that they are not being recognized. What happened?
            print ids
            print word
            words[i] = idDict[word]
            data['commentTest'] = ' '.apply(lambda x: ''.join(x))
    return data

idDict = col2dict(ids)
results = replaceIds(df, idDict)

结果：

None

我正在使用 python2.7，当我打印 dict 时，会出现带有 Unicode 编码的 u'。

我的预期输出是：

Categories

Comment

Type

commentTest

  Categories  Comment  Type commentTest
0 animal  The NYC tree is very big tree The New York City tree is very big 
1 plant The cat from the UK is small dog  The cat from the United Kingdom is small 
2 object  The rock was found in LA. rock  The rock was found in Los Angeles.

- owwoow14

2个回答

11

实际上，使用str.replace()比使用replace()要快得多，尽管str.replace()需要循环：

ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}

for old, new in ids.items():
    df['Comment'] = df['Comment'].str.replace(old, new, regex=False)

#   Categories  Type                                   Comment
# 0     animal  tree        The New York City tree is very big
# 1      plant   dog  The cat from the United Kingdom is small
# 2     object  rock         The rock was found in Los Angeles

只有在处理小数据框时，replace() 才能胜过 str.replace() 循环：

参考用的计时函数：

def Series_replace(df):
    df['Comment'] = df['Comment'].replace(ids, regex=True)
    return df

def Series_str_replace(df):
    for old, new in ids.items():
        df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
    return df

请注意，如果ids是一个数据框而不是字典，则可以使用itertuples()获得相同的性能：

ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})

for row in ids.itertuples():
    df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)

- tdy

1

这确实是真的。我正在处理150万行，使用运行替换与包含40个值的字典进行str.replace（更改一些字符）的改进非常重要。我相信性能也取决于列数/行数/更改次数，但对我来说，这种解决方案比已接受的方案快得多。谢谢！ - Svestis

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

您可以创建字典，然后替换：

ids = {'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']}

ids = dict(zip(ids['Id'], ids['City']))
print (ids)
{'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}

df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
  Categories                       Comment  Type  \
0     animal      The NYC tree is very big  tree   
1      plant  The cat from the UK is small   dog   
2     object     The rock was found in LA.  rock   

                                commentTest  
0        The New York City tree is very big  
1  The cat from the United Kingdom is small  
2        The rock was found in Los Angeles.