如何根据字典替换Pandas系列中的字符串组,其中值为列表?

3

我在stackoverflow上找不到基于字典替换值为列表中的解决方案。

字典

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"],
        "application": ["app"]}

输入

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
                         ("laught-out loud so I couldnt too long; did not read"),
                         ("what happened?")], columns=['text'])

期望的输出结果

output_df = pd.DataFrame([("haha TLDR and LOL :D"),
                          ("LOL so I couldnt TLDR"),
                          ("what happened?")], columns=['text'])

编辑

向字典中添加了一个额外条目,即 "application": ["app"]

当前的解决方案输出为 "what happlicationened?"

请建议一种修复方法。

4个回答

6
构建一个倒排映射,并使用 Series.replace 函数,其中添加参数 regex=True
mapping = {v : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR

其中,

print(mapping)
{'laught out loud': 'LOL',
 'laught-out loud': 'LOL',
 "too long didn't read": 'TLDR',
 'too long; did not read': 'TLDR'}

为了匹配整个单词,请为每个单词添加单词边界:
mapping = {rf'\b{v}\b' : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR
2         what happened?

在这里,

print(mapping)
{'\\bapp\\b': 'application',
 '\\blaught out loud\\b': 'LOL',
 '\\blaught-out loud\\b': 'LOL',
 "\\btoo long didn't read\\b": 'TLDR',
 '\\btoo long; did not read\\b': 'TLDR'}

太棒了!请为以下问题提供解决方案。将一个额外的条目添加到字典中:“application”:[“app”],但当前的解决方案输出为“what happlicationened?” 请建议修复方法。 - GeorgeOfTheRF
1
@ML_Pro 你的意思是你只想匹配整个单词吗?嗯,在这种情况下,尝试将“app”更改为r“\ bapp \ b”,并对每个要替换的字符串执行此操作。这是一个正则表达式单词边界,只会匹配整个单词。 - cs95
谢谢。不过,我正在从一个JSON文件中加载字典。如何使用Python代码将“app”转换为r"\bapp\b"?我找不到将字符串转换为原始字符串的函数。我接受了你的回答作为答案。 - GeorgeOfTheRF
太好了。明白了。 - GeorgeOfTheRF

1
使用 df.apply 和自定义函数

示例:

import pandas as pd


def custReplace(value):
    dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

    for k, v in dct.items():
        for i in v:
            if i in value:
                value = value.replace(i, k)
    return value

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].apply(custReplace))

输出:

0     haha TLDR and LOL :D
1    LOL so I couldnt TLDR
Name: text, dtype: object

or

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

dct = { "(" + "|".join(v) + ")": k for k, v in dct.items()}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].replace(dct, regex=True))

1
这是我将要采取的步骤:

import pandas as pd


dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

dct_inv = {}
for key, vals in dct.items():
    for val in vals:
        dct_inv[val]=key

dct_inv

def replace_text(input_str):
    for key, val in dct_inv.items():
        input_str = str(input_str).replace(key, val)
    return input_str

input_df.apply(replace_text, axis=1).to_frame()

1

我认为最合理的起点是将您的字典反转,使您的键是原始字符串,映射到新字符串的值。您可以手动完成这个过程,也可以使用其他无数种方法,例如:

import itertools
dict_rev = dict(itertools.chain.from_iterable([list(zip(v, [k]*len(v))) for k, v in dct.items()]))

这段文字不太易读。或者你可以使用下面这个更好的方案,我从另一个答案中借鉴了一下:

dict_rev = {v : k for k, V in dct.items() for v in V}

这需要确保您的字典中的每个值都在列表(或其他可迭代对象)中,例如"new key": ["single_val"],否则它将分解字符串中的每个字符。
然后,您可以根据此处的代码如何替换字符串的多个子字符串?执行以下操作。
import re
rep = dict((re.escape(k), v) for k, v in dict_rev.items())
pattern = re.compile("|".join(rep.keys()))
input_df["text"] = input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

这种方法的执行速度大约比更简单、更优美的解决方案快3倍:
简单:
%timeit input_df["text"].replace(dict_rev, regex=True)

425 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

更快:
%timeit input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

160 µs ± 7.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接