从两个列表创建自定义字典

Question

从两个列表创建自定义字典

6

我有以下两个Python列表。

prob_tokens = ['119', '120', '123', '1234', '12345']

complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

min_len_sec_list = 3
max_len_sec_list = 5

我想创建一个字典，以第一个列表的元素作为键，并具有以下约束条件：

如果在第二个列表中不存在该键，则值将为False。
如果在第二个列表中存在变体的该键，则值将为False。

例如：

(i) 当检查123时，如果1234、12345（123*）在第二个列表中存在，则123的值将为False。

(ii) 类似地，当检查1234时，如果存在12345（1234*），则值将为False。

这里的*将是[0-9]{(max_len-len_token)}。

如果在第二个列表中存在没有变体的该键，则值将为True

输出: final_token_dict

{'119': False,'120': True, '123': False, '1234': False, '12345': True}

我可以得到如何实现此目标的任何建议吗？谢谢提前！！！

- Avinash Clinton

那么，你实际上尝试了什么？ - taras

这里 * 将会是 [0-9]{(max_len-len_token)}。你能解释一下这是什么意思吗？ - tobias_k

我之前不知道startswith()函数，试了很多替代方法。基本上我在检查，如果123是我的令牌并且max_len为5，则首先检查是否存在123，如果任何四位数字令牌以123开头（例如1230、1231、...、1239），然后再检查是否有任何五位数字令牌以123开头。 - Avinash Clinton

6个回答

4

你可以将列表转换为 Trie 或前缀树结构，然后检查 Trie 中是否有任何键是前缀。这比逐个检查列表中每个元素是否是前缀要快。具体来说，如果你的 prob_tokens 列表中有 k 个元素，complete_tokens 中有 n 个元素，则此方法仅需要 O(n+k) 的时间，而检查每个对的时间是 O(n*k)。¹

def make_trie(lst):
    trie = {}
    for key in lst:
        t = trie
        for c in key:
            t = t.setdefault(c, {})
    return trie

def check_trie(trie, key):
    for c in key:
        trie = trie.get(c, None)
        if trie is None: return False # not in trie
        if trie == {}: return True    # leaf in trie
    return False  # in trie, but not a leaf

prob_tokens = ['119', '120', '123', '1234', '12345']
complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

trie = make_trie(complete_tokens)
# {'1': {'1': {'2': {}}, '2': {'0': {}, '1': {}, '3': {'3': {}, '4': {'5': {}}, '5': {}}}}}
res = {key: check_trie(trie, key) for key in prob_tokens}
# {'119': False, '120': True, '123': False, '1234': False, '12345': True}

¹⁾ 实际上，密钥的平均长度也是一个因素，但在这两种方法中都是如此。

- tobias_k

不错！有没有好的方法来考虑 OP 提出的“最大字符数”标准？ - jpp

1

@jpp 老实说，我并没有真正理解那部分内容。如果你懂的话，能否解释一下？ - tobias_k

重新阅读，你是对的，它并不是完全清晰。我在我的答案末尾包含了我的解释。 - jpp

1

@jpp 对于我的情况，我只需要第二个列表中存在且没有变体的令牌，因此我可以过滤掉所有不在min_len和max_len范围内的令牌。因此，你提到的两种解决方案以及tobias_k的解决方案都非常适合我。 - Avinash Clinton

3

这可能是另一种选择。

import re

prob_tokens = ['119', '120', '123', '1234', '12345']

complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

dictionary = dict()
for tok in prob_tokens:
    if tok not in complete_tokens or any([bool(re.compile(r'^%s\d+'%tok).search(tok2)) for tok2 in complete_tokens]):
        dictionary[tok] = False
    else:
        dictionary[tok] = True

print(dictionary)`

- thushv89

3

您可以使用任何：任何

a = ['119', '120', '123', '1234', '12345']
b = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']
new_d = {c:c in b and not any(i.startswith(c) and len(c) < len(i) for i in b) for c in a}

输出：

{'120': True, '1234': False, '119': False, '123': False, '12345': True}

- Ajax1234

2

我想你也可以尝试这样做：

我猜你也可以尝试这样做：

from collections import Counter

prob_tokens = ['119', '120', '123', '1234', '12345']

complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

result = {}
for token in prob_tokens:
    token_len = len(token)

    # Create counts of prefix lengths
    counts = Counter(c[:token_len] for c in complete_tokens)

    # Set True if only one prefix found, else False
    result[token] = counts[token] == 1

print(result)

输出哪些内容：

{'119': False, '120': True, '123': False, '1234': False, '12345': True}

- RoadRunner

2

只需使用普通的字典推导式，如果以指定键开头的complete_tokens元素数量之和为1，则值为True，就可以完成任务。

prob_tokens = ['119', '120', '123', '1234', '12345']
complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

res = {elem:sum(v.startswith(elem) for v in complete_tokens)==1 for elem in prob_tokens}
print (res)

输出

{'119': False, '120': True, '123': False, '1234': False, '12345': True}

为了更好的效率，您可以将complete_tokens转换为集合，然后使用any而不是检查每个元素。

complete_tokens_set = set(complete_tokens)
res = {elem:elem in complete_tokens_set and not any(v!=elem and v.startswith(elem) for v in complete_tokens_set) for elem in prob_tokens}

- Sunitha

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Accepted Answer

您可以使用自定义函数与字典推导式：

prob_tokens = ['119', '120', '123', '1234', '12345']
complete_tokens = ['112', '120', '121', '123', '1233', '1234', '1235', '12345']

def mapper(val, ref_list):
    if any(x.startswith(val) and (len(x) > len(val)) for x in ref_list):
        return False
    if val in ref_list:
        return True
    return False

res = {i: mapper(i, complete_tokens) for i in prob_tokens}

print(res)

{'119': False, '120': True, '123': False, '1234': False, '12345': True}

如果字符数量标准对您很重要，您可以使用链式比较和额外的输入来相应调整逻辑：

def mapper(val, ref_list, max_len):
    if any(x.startswith(val) and (0 < (len(x) - len(val)) <= max_len) for x in ref_list):
        return False
    if val in ref_list:
        return True
    return False

min_len_sec_list = 3
max_len_sec_list = 5

add_lens = max_len_sec_list - min_len_sec_list

res = {i: mapper(i, complete_tokens, add_lens) for i in prob_tokens}