例子:
S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}
我发现最快的解决方案是为每个字符保留计数器,并在每一步中将其与袋子进行比较,这样运行时就是O(n*m)
。如果需要,可以展示算法。
S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}
我发现最快的解决方案是为每个字符保留计数器,并在每一步中将其与袋子进行比较,这样运行时就是O(n*m)
。如果需要,可以展示算法。
假设我们只对长度为m的子字符串感兴趣,有一种O(n)的方法可以完成它(否则是不可能的,因为对于包含字符串中所有字符的bag,你必须返回s的所有子字符串,这意味着一个O(n^2)的结果无法在O(n)内计算)。
该算法如下:
Convert the bag to a histogram:
hist = []
for c in B do:
hist[c] = hist[c] + 1
Initialize a running histogram that we're going to modify (histrunsum is the total count of characters in histrun):
histrun = []
histrunsum = 0
We need two operations: add a character to the histogram and remove it. They operate as follows:
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then:
histrun[c] = histrun[c] + 1
histrunsum = histrunsum + 1
remove(c):
if histrun[c] > 0 then:
histrun[c] = histrun[c] - 1
histrunsum = histrunsum + 1
Essentially, histrun captures the amount of characters that are present in B in current substring. If histrun is equal to hist, our substring has the same characters as B. histrun is equal to hist iff histrunsum is equal to length of B.
Now add first m characters to histrun; if histrunsum is equal to length of B; emit first substring; now, until we reach the end of string, remove the first character of the current substring and add the next character.
add, remove are O(1) since hist and histrun are arrays; checking if hist is equal to histrun is done by comparing histrunsum to length(B), so it's also O(1). Loop iteration count is O(n), the resulting running time is O(n).
感谢您的回答。为了使算法正常工作,必须更改add()
和remove()
方法。
add(c):
if hist[c] > 0 and histrun[c] < hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] + 1
remove(c):
if histrun[c] > hist[c] then
histrunsum++
else
histrunsum--
histrun[c] = histrun[c] - 1
解释: histrunsum 可以看作是两个多重集合相似程度的得分。
add(c):当 histrun 多重集合中某个字符的出现次数少于 hist 多重集合中该字符的出现次数时,该字符的额外出现次数必须被“奖励”,因为 histrun 多重集合正在接近 hist 多重集合。如果 histrun 集合中已经有至少相等或更多的字符,则额外字符为负。
remove(c):与 add(c) 类似,其中删除字符的权重为正,当其在 histrun 多重集合中的数量大于 hist 多重集合时。
示例代码(PHP):
function multisetSubstrings($sequence, $mset)
{
$multiSet = array();
$substringLength = 0;
foreach ($mset as $char)
{
$multiSet[$char]++;
$substringLength++;
}
$sum = 0;
$currentSet = array();
$result = array();
for ($i=0;$i<strlen($sequence);$i++)
{
if ($i>=$substringLength)
{
$c = $sequence[$i-$substringLength];
if ($currentSet[$c] > $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]--;
}
$c = $sequence[$i];
if ($currentSet[$c] < $multiSet[$c])
$sum++;
else
$sum--;
$currentSet[$c]++;
echo $sum."<br>";
if ($sum==$substringLength)
$result[] = $i+1-$substringLength;
}
return $result;
}
(iii)现在,A被丢弃,而A也被添加,因此新的哈希值=(36/2)* 2 = 36
(iv)现在B被丢弃,C被添加,因此哈希值=(36/3)* 5 = 60 =多重集哈希。因此我们找到了一个所需的出现-BAAC
这个过程显然需要O(n)时间。