如何在线性时间内找出字符串中多重集合的所有组合？

Question

如何在线性时间内找出字符串中多重集合的所有组合？

3

我被赋予一个包含m个字符的袋子B（multiset）和一个大小为n的字符串text S。是否可能在线性时间O(n)内找到所有可以由B（4！= 24种组合）在S中创建的子字符串？

例子：

S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}

我发现最快的解决方案是为每个字符保留计数器，并在每一步中将其与袋子进行比较，这样运行时就是O(n*m)。如果需要，可以展示算法。

- owning Mahowny

3个回答

1

感谢您的回答。为了使算法正常工作，必须更改add()和remove()方法。

add(c):
    if hist[c] > 0 and histrun[c] < hist[c] then
        histrunsum++
    else
        histrunsum--

    histrun[c] = histrun[c] + 1


remove(c):
    if histrun[c] > hist[c] then
        histrunsum++
    else
        histrunsum--

    histrun[c] = histrun[c] - 1

解释： histrunsum 可以看作是两个多重集合相似程度的得分。

add(c)：当 histrun 多重集合中某个字符的出现次数少于 hist 多重集合中该字符的出现次数时，该字符的额外出现次数必须被“奖励”，因为 histrun 多重集合正在接近 hist 多重集合。如果 histrun 集合中已经有至少相等或更多的字符，则额外字符为负。

remove(c)：与 add(c) 类似，其中删除字符的权重为正，当其在 histrun 多重集合中的数量大于 hist 多重集合时。

示例代码（PHP）：

function multisetSubstrings($sequence, $mset)
{
    $multiSet = array();
    $substringLength = 0;
    foreach ($mset as $char)
    {
        $multiSet[$char]++;
        $substringLength++;
    }

    $sum = 0;
    $currentSet = array();
    $result = array();

    for ($i=0;$i<strlen($sequence);$i++)
    {

        if ($i>=$substringLength)
        {
            $c = $sequence[$i-$substringLength];

            if ($currentSet[$c] > $multiSet[$c])
                $sum++;
            else
                $sum--;

            $currentSet[$c]--;
        }


        $c = $sequence[$i];

        if ($currentSet[$c] < $multiSet[$c])
            $sum++;
        else
            $sum--;

        $currentSet[$c]++;

        echo $sum."<br>";


        if ($sum==$substringLength)
            $result[] = $i+1-$substringLength;
    }


    return $result;
}

- owning Mahowny

说实话，我无法理解你的逻辑；你能解释一下你所做更改的目的吗？（例如，在你的版本中histrunsum是什么意思，为什么需要进行这些更改） - zeuxcg

0

使用哈希。对于多重集合中的每个字符，分配一个唯一的质数。通过将与数字关联的质数乘以该数字的频率次数来计算任何字符串的哈希值。

例如：CATTA。让C = 2，A = 3，T = 5。哈希= 2 * 3 * 5 * 5 * 3 = 450

哈希多重集合（将其视为字符串）。现在遍历输入字符串，并计算长度为k的每个子字符串的哈希值（其中k是多重集合中字符的数量）。检查此哈希是否与多重集合哈希匹配。如果是，则这是一个出现。

哈希可以按如下方式非常容易地在线性时间内计算：

让多重集合= {A，A，B，C}，A = 2，B = 3，C = 5。

多重集合哈希= 2 * 2 * 3 * 5 = 60

让文本= CABBAACCA

(i) CABB = 5 * 2 * 3 * 3 = 90

(ii) 现在，下一个字母是A，丢弃的字母是第一个字母C。因此，新哈希=（90/5）* 2 = 36

（iii）现在，A被丢弃，而A也被添加，因此新的哈希值=（36/2）* 2 = 36

（iv）现在B被丢弃，C被添加，因此哈希值=（36/3）* 5 = 60 =多重集哈希。因此我们找到了一个所需的出现-BAAC

这个过程显然需要O（n）时间。

- arya

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zeuxcg · Accepted Answer

假设我们只对长度为m的子字符串感兴趣，有一种O(n)的方法可以完成它（否则是不可能的，因为对于包含字符串中所有字符的bag，你必须返回s的所有子字符串，这意味着一个O(n^2)的结果无法在O(n)内计算）。

该算法如下：

Convert the bag to a histogram:

hist = []
for c in B do:
    hist[c] = hist[c] + 1

Initialize a running histogram that we're going to modify (histrunsum is the total count of characters in histrun):
```
histrun = []
histrunsum = 0
```

We need two operations: add a character to the histogram and remove it. They operate as follows:

add(c):
    if hist[c] > 0 and histrun[c] < hist[c] then:
        histrun[c] = histrun[c] + 1
        histrunsum = histrunsum + 1

remove(c):
    if histrun[c] > 0 then:
        histrun[c] = histrun[c] - 1
        histrunsum = histrunsum + 1

Essentially, histrun captures the amount of characters that are present in B in current substring. If histrun is equal to hist, our substring has the same characters as B. histrun is equal to hist iff histrunsum is equal to length of B.
Now add first m characters to histrun; if histrunsum is equal to length of B; emit first substring; now, until we reach the end of string, remove the first character of the current substring and add the next character.
add, remove are O(1) since hist and histrun are arrays; checking if hist is equal to histrun is done by comparing histrunsum to length(B), so it's also O(1). Loop iteration count is O(n), the resulting running time is O(n).