寻找最短的子数组，其中包含所有元素

Question

寻找最短的子数组，其中包含所有元素

23

假设您有一个数字数组和另一组数字。您需要找到包含所有数字的最短子数组，复杂度要最小。

数组中可能有重复项，而我们假设数字集合没有重复项。它没有被排序 - 子数组可以按任何顺序包含数字集合。

例如：

Array: 1 2 5 8 7 6 2 6 5 3 8 5
Numbers: 5 7

那么最短的子数组显然是 Array[2:5] （Python 表示法）。

此外，如果出于某种原因（例如在线算法），您想避免对数组进行排序，您会怎么做？

- R S

我假设这个数组可以有重复的元素？ - Josh

数字的范围是多少？ - Shamim Hafiz - MSFT

“另一组数字”是无序的吗？因为它是一个集合，所以没有重复项？ - Ted Hopp

给出一些数组的相关示例。 - Saurabh Gokhale

你所说的 Python 表示法是指索引 2 到 5，但不包括 5 吗？ - flight

6个回答

6

这听起来像是一个非常适合采用“滑动窗口”方法的问题：维护一个逐渐扩大和收缩的“窗口”（子数组），并使用哈希表来跟踪窗口中每个“有趣”的数字出现的次数。例如，从一个空窗口开始，然后将其扩展到仅包括元素0，然后是元素0-1，然后是0-2、0-3等，通过添加后续元素（并使用哈希表跟踪窗口中存在哪些数字）。当哈希表告诉你所有有趣的数字都在窗口中存在时，你可以开始收缩它：例如0-5、1-5、2-5等，直到发现窗口不再包含所有有趣的数字。然后，你可以再次从右侧开始扩展它，以此类推。我相当确定（但不完全确定）这种方法可以解决你的问题，并且可以实现线性时间运行。

- Aasmund Eldhuset

这听起来确实是正确的方向，但我找不到具体的表述方式。 - R S

@R S：在纸上试一下，看看是否有帮助。 :-) 不用担心让它在线性时间内运行；这可以通过添加一个小技巧来实现，而不改变主算法的工作方式。因此，对于你添加到/从窗口中删除的每个数字，请更新哈希映射（在纸上，可以是数字频率表），并查看是否存在所有有趣的数字。我现在要去睡觉了（猜测这是发布任何东西的坏时机），但如果你卡住了，就发表一下你迄今为止的进展，我（或其他人）明天可以看一下。 - Aasmund Eldhuset

最后，需要注意的是，在算法的每次迭代中，我们要么通过一个元素扩展窗口，要么通过一个元素收缩窗口，并且每个元素只包含一次和排除一次。因此，对于_n_个数字，有_O(n)_次包含/排除操作，并且每个这样的操作使用（预期的）_O(1)_时间（哈希映射查找和更新_m_）。 - Aasmund Eldhuset

我看到的问题是，如果你在左边收缩窗口，然后在右边扩展它而不在左边向后扩展，那么你就会跳过一些窗口。具体来说，在你的例子中，你永远不会考虑窗口0-6或1-6。为了使你的线性时间算法正确，你需要证明这样做永远不会错过更短的窗口。 - j_random_hacker

我同意你对于情况1的推理，但是对于情况2我并不认同——或者说我并不确信这覆盖了所有可能的情况。（此外，我认为你在这部分中的“interval”一词的某个出现意图是“window”，但我不确定是哪一个！）如果窗口的左边缘超过了间隔的左边缘，无论间隔和窗口的最右边缘的相对位置如何，都会错过该间隔。顺便说一句，我相信我已经使用不同的技术证明了你的算法（或者可能是略微变体）的正确性。 - j_random_hacker

显示剩余5条评论

0

这个解决方案绝对不是像上面的伪代码所建议的那样在O(n)时间内运行，然而它是一个真正的（Python）代码，可以解决问题，并根据我的估计以O(n^2)的时间运行：

def small_sub(A, B):
    len_A = len(A)
    len_B = len(B)

    sub_A = []
    sub_size = -1
    dict_b = {}

    for elem in B:
        if elem in dict_b:
            dict_b[elem] += 1
        else:
            dict_b.update({elem: 1})

    for i in range(0, len_A - len_B + 1):
        if A[i] in dict_b:
            temp_size, temp_sub = find_sub(A[i:], dict_b.copy())

            if (sub_size == -1 or (temp_size != -1 and temp_size < sub_size)):
                sub_A = temp_sub
                sub_size = temp_size

    return sub_size, sub_A

def find_sub(A, dict_b):
    index = 0
    for i in A:
        if len(dict_b) == 0:
            break

        if i in dict_b:
            dict_b[i] -= 1

            if dict_b[i] <= 0:
                del(dict_b[i])

        index += 1

    if len(dict_b) > 0:
        return -1, {}
    else:
        return index, A[0:index]

- flyerfye

0

以下是我如何使用collections.Counter对象在线性时间内解决这个问题的方法。

from collections import Counter

def smallest_subsequence(stream, search):
    if not search:
        return []  # the shortest subsequence containing nothing is nothing

    stream_counts = Counter(stream)
    search_counts = Counter(search)

    minimal_subsequence = None

    start = 0
    end = 0
    subsequence_counts = Counter()

    while True:
        # while subsequence_counts doesn't have enough elements to cancel out every
        # element in search_counts, take the next element from search
        while search_counts - subsequence_counts:
            if end == len(stream):  # if we've reached the end of the list, we're done
                return minimal_subsequence
            subsequence_counts[stream[end]] += 1
            end += 1

        # while subsequence_counts has enough elements to cover search_counts, keep
        # removing from the start of the sequence
        while not search_counts - subsequence_counts:
            if minimal_subsequence is None or (end - start) < len(minimal_subsequence):
                minimal_subsequence = stream[start:end]
            subsequence_counts[stream[start]] -= 1
            start += 1

print(smallest_subsequence([1, 2, 5, 8, 7, 6, 2, 6, 5, 3, 8, 5], [5, 7]))
# [5, 8, 7]

- user3064538

0

假设数组有n个元素，集合有m个元素

Sort the array, noting the reverse index (position in the original array)
// O (n log n) time

for each element in given set
   find it in the array
// O (m log n) time  - log n for binary serch, m times

keep track of the minimum and maximum index for each found element

min - max 定义了你的范围

总时间复杂度：O((m+n)logn)

- Josh

什么是min-max？指的是哪个min和max？对于重复项应该怎么处理？ - flight

哦，等等……我明白了最小值和最大值。我之前有点困惑，因为你的解决方案没有考虑重复的情况。 - flight

仅记录每个数字的最小和最大索引是不够的，因为如果一个数字出现3次或更多次，则最短的子数组未必包含其最左侧或最右侧的出现。例如，在3、1、2、3、3中，包含{1, 2, 3}的最短子数组是什么？答案是[1, 4]，它不使用3的最小或最大位置。 - j_random_hacker

0

Java 解决方案

    List<String> paragraph = Arrays.asList("a", "c", "d", "m", "b", "a");
    Set<String> keywords = Arrays.asList("a","b");

    Subarray result = new Subarray(-1,-1);

    Map<String, Integer> keyWordFreq = new HashMap<>();

    int numKeywords = keywords.size();

    // slide the window to contain the all the keywords**
    // starting with [0,0]
    for (int left = 0, right = 0 ; right < paragraph.size() ; right++){

      // expand right to contain all the keywords
      String currRight = paragraph.get(right);

      if (keywords.contains(currRight)){
        keyWordFreq.put(currRight, keyWordFreq.get(currRight) == null ? 1 : keyWordFreq.get(currRight) + 1);
      }

      // loop enters when all the keywords are present in the current window
      // contract left until the all the keywords are still present
      while (keyWordFreq.size() == numKeywords){
        String currLeft = paragraph.get(left);

        if (keywords.contains(currLeft)){

          // remove from the map if its the last available so that loop exists
          if (keyWordFreq.get(currLeft).equals(1)){

            // now check if current sub array is the smallest
            if((result.start == -1 && result.end == -1) || (right - left) < (result.end - result.start)){
              result = new Subarray(left, right);
            }
            keyWordFreq.remove(currLeft);
          }else {

            // else reduce the frequcency
            keyWordFreq.put(currLeft, keyWordFreq.get(currLeft) - 1);
          }
        }
        left++;
      }

    }

    return result;
  }

- Rayudu Alp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- j_random_hacker · Accepted Answer

线性时间复杂度解法证明

我会使用右扩展表示将范围的右端点增加1，左收缩表示将范围的左端点增加1。这个答案是Aasmund Eldhuset's answer的一个小变化。区别在于，一旦我们找到最小的j使得[0，j]包含所有感兴趣的数字，我们之后就只考虑包含所有感兴趣的数字的范围。（可能会这样理解Aasmund的答案，但也可能将其解释为允许由于左收缩而丢失单个有趣数字）。

基本思想是，对于每个位置j，我们将找到以位置j结束的满足要求的最短范围，假设我们已经知道了以位置j-1 结束的满足要求的最短范围。

编辑: 修正了基础案例中的错误。

基础案例: 找到最小的j'，使得[0，j']包含所有有趣的数字。根据构造，不能存在包含所有有趣的数字的[0，k < j']范围，所以我们不需要进一步考虑它们。现在找到最小最大的i，使得[i，j']包含所有有趣的数字（即保持j'固定）。这是以位置j'结束的最小满足要求的范围。

为了找到以任意位置j结束的最小满足要求的范围，我们可以将以位置j-1 结束的最小满足要求的范围向右扩展1个位置。这个范围必然也包含所有有趣的数字，但可能不是最短的长度。我们已经知道这是一个满足要求的范围，意味着我们不必担心向左“反向”延伸范围，因为那只会增加范围超过其最小长度（即使解决方案更糟）。我们需要考虑的唯一操作是保持包含所有有趣数字属性的左收缩。因此，在满足此属性的情况下尽可能推进范围的左端点。当无法再进行左收缩时，我们就找到了以j结束的最小满足要求的范围（因为进一步的左收缩显然不能再使范围满足要求），这时我们就完成了。

由于我们在每个最右端的位置j上执行此操作，因此可以在所有最右端的位置中取最小长度范围，以找到整体最小值。这可以使用嵌套循环来完成，在每个外部循环周期中，j都会向前推进1次。显然，j会向前推进1 n次。由于在任何时候，我们只需要知道以前一个j值为最佳范围的最左侧位置，因此我们可以将其存储在i中，并根据需要进行更新。 i始终从0开始，始终<= j <= n，仅向上方向上升，最多可以向上升n次。i和j都最多向前推进n次，因此该算法具有线性时间复杂度。

下面是伪代码，将两个阶段合并为单

# x[0..m-1] is the array of interesting numbers.
# Load them into a hash/dictionary:
For i from 0 to m-1:
    isInteresting[x[i]] = 1

i = 0
nDistinctInteresting = 0
minRange = infinity
For j from 0 to n-1:
    If count[a[j]] == 0 and isInteresting[a[j]]:
        nDistinctInteresting++
    count[a[j]]++

    If nDistinctInteresting == m:
        # We are in phase 2: contract the left side as far as possible
        While count[a[i]] > 1 or not isInteresting[a[i]]:
            count[a[i]]--
            i++

        If j - i < minRange:
            (minI, minJ) = (i, j)

count[] 和 isInteresting[] 是哈希/字典（如果涉及的数字较小，则为普通数组）。