如何找到包含序列所有元素的最小长度子序列

Question

如何找到包含序列所有元素的最小长度子序列

algorithm

16

给定一个序列，例如S={1,8,2,1,4,1,2,9,1,8,4}，我需要找到包含所有S元素（无重复项，顺序不重要）的最小长度子序列。如何以有效的方式找到这个子序列？

注意：在S中有5个不同的元素：{1,2,4,8,9}。最小长度的子序列必须包含所有这5个元素。

- russell

O(n^2) 不够，有 O(nlgn) 或 O(n) 的算法吗？ - russell

元素数量n为10^5，每个元素的值<=10^5。 - russell

是的，数字始终为正数。 - russell

1

你可以使用Boyer-Moore算法的改编来实现。查找可以使用哈希表。在一般情况下，时间复杂度应该是O(n)或更低。 - leppie

这听起来很像是一个OrderedSet - 它将按它们首次出现的顺序包含唯一元素，并且在实现中为O(n)。 - F1Rumors

显示剩余5条评论

7个回答

3

这可以通过动态规划来解决。

在每一步 k 中，我们将计算以 S 的第 k 个位置结尾且满足包含 S 的所有唯一元素要求的最短子序列。

给定步骤 k 的解决方案（以下简称“序列”），计算步骤 k+1 的解决方案很容易：将 S 的第 (k+1) 个元素附加到序列中，然后逐个删除序列开头在扩展序列中出现超过一次的元素。

整体问题的解决方案是在任何步骤中找到的最短序列。

算法的初始化由两个阶段组成：

1. 扫描一次 S，构建唯一值的字母表。 2. 找到以 S 的第一个元素为首个元素的最短有效序列；该序列的最后位置将是 k 的初始值。

以上所有操作均可在O(n logn)的最坏时间内完成（如果需要澄清，请告诉我）。

下面是以上算法在Python中的完整实现：

import collections

S = [1,8,2,1,4,1,2,9,1,8,4,2,4]

# initialization: stage 1
alphabet = set(S)                         # the unique values ("symbols") in S
count = collections.defaultdict(int)      # how many times each symbol appears in the sequence

# initialization: stage 2
start = 0
for end in xrange(len(S)):
  count[S[end]] += 1
  if len(count) == len(alphabet):         # seen all the symbols yet?
    break
end += 1

best_start = start
best_end = end

# the induction
while end < len(S):
  count[S[end]] += 1
  while count[S[start]] > 1:
    count[S[start]] -= 1
    start += 1
  end += 1
  if end - start < best_end - best_start: # new shortest sequence?
    best_start = start
    best_end = end

print S[best_start:best_end]

注意：

我使用的数据结构（字典和集合）基于哈希表；它们在平均情况下性能良好，但在最坏情况下可能会降至 O(n)。如果您关心的是最坏情况，将它们替换为基于树的结构将提供我上面承诺的总体复杂度为 O(n logn) 。
正如 @biziclop 指出的那样，可以消除对 S 的第一次扫描，使算法适用于流数据。
如果 S 的元素是小的非负整数，如您的评论所示，则可以将 count 扁平化为整数数组，将总体复杂度降至 O(n)。

- NPE

我认为如果你在每个步骤中跟踪子序列中有多少不同的字符，然后选择具有最大不同字符的字符，那么你可能可以避免扫描整个序列以查找字母表。但是，这种方法的优点只有在处理流数据时才值得使用。 - biziclop

2

这里有一个算法，需要O（N）的时间和O（N）的空间。它类似于Grigor Gevorgyan的算法。它还使用了一个辅助的O（N）标志数组。该算法找到了最长的唯一元素子序列。如果bestLength < numUnique，则不存在包含所有唯一元素的子序列。该算法假设元素是正数，并且最大元素小于序列长度。

bool findLongestSequence() {
    // Data (adapt as needed)
    const int N = 13;
    char flags[N];
    int a[] = {1,8,2,1,4,1,2,9,1,8,1,4,1};

    // Number of unique elements
    int numUnique = 0;
    for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
    for (int n = 0; n < N; ++n) {
        if (a[n] < 0 || a[n] >= N) return false; // assumptions violated 
        if (flags[a[n]] == 0) {
            ++numUnique;
            flags[a[n]] = 1;
        }
    }

    // Find the longest sequence ("best")
    for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
    int bestBegin = 0, bestLength = 0;
    int begin = 0, end = 0, currLength = 0;
    for (; begin < N; ++begin) {
        while (end < N) {
            if (flags[a[end]] == 0) {
                ++currLength;
                flags[a[end]] = 1;
                ++end;
            }
            else {
                break; // end-loop
            }
        }
        if (currLength > bestLength) {
            bestLength = currLength;
            bestBegin = begin;
        }
        if (bestLength >= numUnique) {
            break; // begin-loop
        }
        flags[a[begin]] = 0; // reset
        --currLength;
    }

    cout << "numUnique = " << numUnique << endl;
    cout << "bestBegin = " << bestBegin << endl;
    cout << "bestLength = " << bestLength << endl;
    return true; // longest subseqence found 
}

- Jiri Kriz

1

如果您需要针对相同序列和不同集合经常执行此操作，则可以使用倒排列表。您为序列准备倒排列表，然后收集所有偏移量。然后从倒排列表的结果中扫描m个连续数字的序列。

如果序列长度为n，查询大小为m，则准备工作将在O(n)内完成。如果我没有计算错误的话，查询的响应时间将在O(m^2)内。

如果您需要更多详细信息，请查看Clausen / Kurth于2004年发表的关于代数数据库的论文（“通过群论方法进行基于内容的信息检索”）。这概述了一个通用的数据库框架，可适用于您的任务。

- LiKao

我觉得我误解了问题，所以在这种情况下使用倒排列表可能行不通。 - LiKao

1

我有一个O(N*M)的算法，其中N是S的长度，M是元素的数量（它在M的小值时更有效，即：如果有很少的重复，它可能是具有二次成本的不良算法）编辑：实际上，在实践中它更接近于O(N)。你只有在最坏情况下才会得到O(N*M)

首先遍历序列并记录S的所有元素。我们称之为集合E。

我们将使用S的动态子序列。创建一个空的映射M，其中M将每个元素与其在子序列中出现的次数相关联。

例如，如果subSequence = {1,8,2,1,4}，E = {1, 2, 4, 8, 9}

M[9]==0
M[2]==M[4]==M[8]==1
M[1]==2

你需要两个索引，它们将各自指向 S 的一个元素。其中一个被称为 L，因为它在这两个索引形成的子序列的左侧。另一个被称为 R，因为它是子序列右侧的索引。

首先初始化 L=0，R=0 和 M[S[0]]++ 算法如下：

While(M does not contain all the elements of E)
{
    if(R is the end of S)
      break
  R++
  M[S[R]]++ 
}
While(M contains all the elements of E)
{
  if(the subsequence S[L->R] is the shortest one seen so far)
    Record it
  M[S[L]]--
  L++
}

要检查M是否包含E的所有元素，您可以使用布尔向量V。V[i]==true如果M[E[i]]>0，V[i]==false如果M[E[i]]==0。因此，您首先将V的所有值设置为false，每次执行M[S[R]]++时，您可以将该元素的V设置为true，每次执行M[S[L]]--并且M[S[L]]==0时，将该元素的V设置为false。

- B. Decoster

你如何控制 map 或 set 中的顺序？ - leppie

我使用两个索引来控制子序列。这个映射在这里是为了指示子序列中的元素。 - B. Decoster

1

我曾经考虑过这种想法。但是我担心对于大量不同的数字，例如50000个不同的数字，这种方法将无法奏效。因此，复杂度会达到O(50000*50000)。不过，非常感谢您的回复。 - russell

我修改了我的答案，以便告诉你实际上它是O(N)，并且在很大程度上与M无关。你应该接受Grigor的解决方案，它与我的解决方案几乎完全相同，只是它包含了一个聪明的cnt技巧，使其完全成为O(N)。 - B. Decoster

0

我会这样说：

构建元素集 D。
保持一个与序列 S 同样大小的数组。
用来自 S 的索引填充数组，指示以该索引结尾、包含元素集 D 中所有元素的最新序列的起始点。
查找数组中序列的最小长度并保存开始和结束位置。

显然，第3步是棘手的。我会使用优先队列/堆，为元素集 D 中的每个元素分配一个键，并将元素作为值。此外，您需要一种数据结构，能够通过它们的值在堆中访问元素（具有指向元素的指针的映射）。键应始终是元素最后出现的位置。

因此，您遍历 S，对于读取的每个字符，执行 setKey O(log n) 操作，然后查看当前的最小值 O(1) 并将其写入数组中。

应该是 O(n*log n) 的。希望我没有漏掉什么。这只是我突然想到的，所以请谨慎对待，或者让社区指出我可能犯的错误。

- b.buchhold

0

上述解决方案是正确的，这是上述代码的Java版本

public class MinSequence {

    public static void main(String[] args)
    {
        final int n; // the size of array
        // read n and the array
        final List<Integer> arr=new ArrayList<Integer>(4);
        Map<Integer, Integer> cur = new TreeMap<Integer, Integer>();
        arr.add(1);
        arr.add(2);
        arr.add(1);
        arr.add(3);
        int distinctcount=0;
        for (final Integer integer : arr)
        {
            if(cur.get(integer)==null)
            {
                cur.put(integer, 1);
                ++distinctcount;
            }else
            {
                cur.put(integer,cur.get(integer)+1);
            }
        }

        // now k is the number of distinct elements
        cur=new TreeMap<Integer,Integer>();
        //   memset( cur, 0, sizeof( cur )); // we need this array anew
        int begin = 0, end = -1; // to make it 0 after first increment
        int best = -1; // best answer currently found
        int ansbegin = 0, ansend = 0; // interval of the best answer currently found
        int cnt = 0; // distinct elements in current subsequence
        final int inpsize = arr.size();
        while(true)
        {
            if( cnt < distinctcount )
            {
                ++end;
                if (end == inpsize) {
                    break;
                }
                if( cur.get(arr.get(end)) == null ) {
                    ++cnt;
                    cur.put(arr.get(end), 1);
                } // this elements wasn't present in current subsequence;
                else
                {
                    cur.put(arr.get(end),cur.get(arr.get(end))+1);
                }
                continue;
            }
            // if we're here it means that [begin, end] interval contains all distinct elements
            // try to shrink it from behind
            while (cur.get(arr.get(begin)) != null && cur.get(arr.get(begin)) > 1) // we have another such element later in the subsequence
            {
                cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
                ++begin;
            }
            // now, compare [begin, end] with the best answer found yet
            if( best == -1 || end - begin < best )
            {
                best = end - begin;
                ansbegin = begin;
                ansend = end;
            }
            // now increment the begin iterator to make cur < k and begin increasing the end iterator again
            if (cur.get(arr.get(begin)) != null) {
                cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
            }
            ++begin;
            --cnt;
        }

        // output the [ansbegin, ansend] interval as it's the answer to the problem
        System.out.println(ansbegin+"--->"+ansend);
        for( int i = ansbegin; i <= ansend; ++i ) {
            System.out.println(arr.get(i));
        }
    }

- Nagappa L M

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Grigor Gevorgyan · Accepted Answer

算法：

首先，确定数组中不同元素的数量 - 这可以在线性时间内轻松完成。假设有 k 个不同的元素。

分配一个大小为 10^5 的数组 cur，每个元素显示当前子序列中使用了多少个每个元素（稍后详见）。

保持一个 cnt 变量，显示当前考虑的序列中有多少个不同的元素。现在，取两个索引，begin 和 end，并按以下方式迭代它们：

将 cnt 和 begin 初始化为 0，将 end 初始化为 -1（在第一次递增后得到 0）。然后，只要可能，就执行以下操作：
如果 cnt != k：

2.1. 递增 end。如果 end 已经是数组的末尾，则退出。如果 cur [array [end]] 为零，则递增 cnt。递增 cur [array [end]]。

否则：

2.2 {

尝试递增 begin 迭代器：当 cur [array [begin]] > 1 时，递减它，并递增 begin（cur [array [begin]] > 1 意味着我们在当前子序列中有另一个这样的元素）。最后，将 [begin, end] 区间与当前答案进行比较，并将其存储在其中，如果它更好，则存储。

}

在无法进一步处理之后，您将得到答案。复杂度为 O(n) - 只需通过数组传递两个迭代器即可。

C++实现：

    #include <iostream>

using namespace std;

const int MAXSIZE = 10000;

int arr[ MAXSIZE ];
int cur[ MAXSIZE ];

int main ()
{
   int n; // the size of array
   // read n and the array

   cin >> n;
   for( int i = 0; i < n; ++i )
      cin >> arr[ i ];

   int k = 0;
   for( int i = 0; i < n; ++i )
   {
      if( cur[ arr[ i ] ] == 0 )
         ++k;
      ++cur[ arr[ i ] ];
   }

   // now k is the number of distinct elements

   memset( cur, 0, sizeof( cur )); // we need this array anew
   int begin = 0, end = -1; // to make it 0 after first increment
   int best = -1; // best answer currently found
   int ansbegin, ansend; // interval of the best answer currently found
   int cnt = 0; // distinct elements in current subsequence

   while(1)
   {
      if( cnt < k )
      {
         ++end;
         if( end == n )
            break;
         if( cur[ arr[ end ]] == 0 )
            ++cnt; // this elements wasn't present in current subsequence;
         ++cur[ arr[ end ]];
         continue;
      }
      // if we're here it means that [begin, end] interval contains all distinct elements
      // try to shrink it from behind
      while( cur[ arr[ begin ]] > 1 ) // we have another such element later in the subsequence
      {
         --cur[ arr[ begin ]];
         ++begin;
      }
      // now, compare [begin, end] with the best answer found yet
      if( best == -1 || end - begin < best )
      {
         best = end - begin;
         ansbegin = begin;
         ansend = end;
      }
      // now increment the begin iterator to make cur < k and begin increasing the end iterator again
      --cur[ arr[ begin]];
      ++begin;
      --cnt;
   }

   // output the [ansbegin, ansend] interval as it's the answer to the problem

   cout << ansbegin << ' ' << ansend << endl;
   for( int i = ansbegin; i <= ansend; ++i )
      cout << arr[ i ] << ' ';
   cout << endl;

   return 0;
}