有没有更好的方法来匹配字典中嵌套列表中的所有正则表达式模式？

Question

有没有更好的方法来匹配字典中嵌套列表中的所有正则表达式模式？

3

我正在尝试一个简单的文本匹配活动，其中我抓取了博客文章的标题，并尝试使用预定义的类别与特定关键字匹配。例如，博客文章的标题是“使用Oppo Reno8系列拍摄完美夜景照片”，一旦我确认“Oppo”包含在我的类别中，它就应该与我的“手机”类别匹配。

categories = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}

那么我的数据框将会是这样

幸运的是，我找到了一个关于如何在映射到嵌套字典中使用正则表达式的参考，但它似乎无法处理前几个单词之外的内容。

参考链接在这里

因此，一旦我使用了下面的代码：

def put_category(cats, text):

    regex = re.compile("(%s)" % "|".join(map(re.escape, categories.keys())))

    if regex.search(text):
        ret = regex.search(text)
        return ret[0]
    else:
        return 'general'

通常情况下，即使以小写形式输入，也会将“一般”作为类别。如此处所示。

我更喜欢使用当前的方法，在字典内输入值来进行匹配活动，而不是运行纯正则表达式模式，然后通过模糊匹配得出结果。

- Nicoconut

你在做的是寻找类似于字典键的模式，而不是值。你的参考资料与你的数据结构相反。 - Kris

请注意，您可以使用海象赋值运算符，以便不必重复在if条件中使用的表达式。 - outis

2个回答

2

在这种情况下，你正在匹配确切的单词，而不是模式。你可以在没有正则表达式的情况下完成它。

回到你的例子：

import pandas as pd

CAT_DICT = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}

df = pd.DataFrame({"title": [
    "Capture Perfect Night Shots with the Oppo Reno8 Series",
    "Personal is Powerful: Why Apple's iOS 16 is the Smartest update"
]})

您可以定义此函数来为每个标题分配类别：

def assign_cat(title: str, cat_dict: dict[str, list[str]]) -> list[str]:
    title_low = title.lower()
    categories = list()
    for c,words in cat_dict.items():
        if any([w in title_low for w in words]):
            categories.append(c)
    if len(categories) == 0:
        categories.append("general")
    return categories

关键部分在这里：any([w in title_low for w in words])。对于类别中的每个单词，您都要检查它是否出现在标题（小写）中。如果有任何一个单词出现了，您就将该类别与之相关联。

你会得到：

这种方法的优点是一个标题可以分配多个类别（请参见第二个标题）。

- slymore

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- blhsing · Accepted Answer

您可以创建一个反向映射，将关键字映射到类别，以便在找到匹配项时有效地返回相应的类别：

mapping = {keyword: category for category, keywords in categories.items() for keyword in keywords}

def put_category(mapping, text):
    match = re.search(rf'\b(?:{"|".join(map(re.escape, mapping))})\b', text, re.I)
    if match:
        return mapping[match[0].lower()]
    return 'general'

print(put_category(mapping, "Capture Perfect Night Shots with the Oppo Reno8 Series"))

这将输出：

phone

演示： https://replit.com/@blhsing/BlandAdoredParser