给定一个序列,例如S={1,8,2,1,4,1,2,9,1,8,4},我需要找到包含所有S元素(无重复项,顺序不重要)的最小长度子序列。如何以有效的方式找到这个子序列?
注意:在S中有5个不同的元素:{1,2,4,8,9}。最小长度的子序列必须包含所有这5个元素。
给定一个序列,例如S={1,8,2,1,4,1,2,9,1,8,4},我需要找到包含所有S元素(无重复项,顺序不重要)的最小长度子序列。如何以有效的方式找到这个子序列?
注意:在S中有5个不同的元素:{1,2,4,8,9}。最小长度的子序列必须包含所有这5个元素。
算法:
首先,确定数组中不同元素的数量 - 这可以在线性时间内轻松完成。假设有 k
个不同的元素。
分配一个大小为 10^5 的数组 cur
,每个元素显示当前子序列中使用了多少个每个元素(稍后详见)。
保持一个 cnt
变量,显示当前考虑的序列中有多少个不同的元素。现在,取两个索引,begin
和 end
,并按以下方式迭代它们:
cnt
和 begin
初始化为 0
,将 end
初始化为 -1
(在第一次递增后得到 0
)。然后,只要可能,就执行以下操作:如果 cnt != k
:
2.1. 递增 end
。如果 end
已经是数组的末尾,则退出。如果 cur [array [end]]
为零,则递增 cnt
。递增 cur [array [end]]
。
否则:
2.2 {
尝试递增 begin
迭代器:当 cur [array [begin]] > 1
时,递减它,并递增 begin
(cur [array [begin]] > 1
意味着我们在当前子序列中有另一个这样的元素)。最后,将 [begin, end]
区间与当前答案进行比较,并将其存储在其中,如果它更好,则存储。
}
在无法进一步处理之后,您将得到答案。复杂度为 O(n)
- 只需通过数组传递两个迭代器即可。
C++实现:
#include <iostream>
using namespace std;
const int MAXSIZE = 10000;
int arr[ MAXSIZE ];
int cur[ MAXSIZE ];
int main ()
{
int n; // the size of array
// read n and the array
cin >> n;
for( int i = 0; i < n; ++i )
cin >> arr[ i ];
int k = 0;
for( int i = 0; i < n; ++i )
{
if( cur[ arr[ i ] ] == 0 )
++k;
++cur[ arr[ i ] ];
}
// now k is the number of distinct elements
memset( cur, 0, sizeof( cur )); // we need this array anew
int begin = 0, end = -1; // to make it 0 after first increment
int best = -1; // best answer currently found
int ansbegin, ansend; // interval of the best answer currently found
int cnt = 0; // distinct elements in current subsequence
while(1)
{
if( cnt < k )
{
++end;
if( end == n )
break;
if( cur[ arr[ end ]] == 0 )
++cnt; // this elements wasn't present in current subsequence;
++cur[ arr[ end ]];
continue;
}
// if we're here it means that [begin, end] interval contains all distinct elements
// try to shrink it from behind
while( cur[ arr[ begin ]] > 1 ) // we have another such element later in the subsequence
{
--cur[ arr[ begin ]];
++begin;
}
// now, compare [begin, end] with the best answer found yet
if( best == -1 || end - begin < best )
{
best = end - begin;
ansbegin = begin;
ansend = end;
}
// now increment the begin iterator to make cur < k and begin increasing the end iterator again
--cur[ arr[ begin]];
++begin;
--cnt;
}
// output the [ansbegin, ansend] interval as it's the answer to the problem
cout << ansbegin << ' ' << ansend << endl;
for( int i = ansbegin; i <= ansend; ++i )
cout << arr[ i ] << ' ';
cout << endl;
return 0;
}
import collections
S = [1,8,2,1,4,1,2,9,1,8,4,2,4]
# initialization: stage 1
alphabet = set(S) # the unique values ("symbols") in S
count = collections.defaultdict(int) # how many times each symbol appears in the sequence
# initialization: stage 2
start = 0
for end in xrange(len(S)):
count[S[end]] += 1
if len(count) == len(alphabet): # seen all the symbols yet?
break
end += 1
best_start = start
best_end = end
# the induction
while end < len(S):
count[S[end]] += 1
while count[S[start]] > 1:
count[S[start]] -= 1
start += 1
end += 1
if end - start < best_end - best_start: # new shortest sequence?
best_start = start
best_end = end
print S[best_start:best_end]
注意:
O(n)
。如果您关心的是最坏情况,将它们替换为基于树的结构将提供我上面承诺的总体复杂度为 O(n logn)
。S
的第一次扫描,使算法适用于流数据。S
的元素是小的非负整数,如您的评论所示,则可以将 count
扁平化为整数数组,将总体复杂度降至 O(n)
。bestLength < numUnique
,则不存在包含所有唯一元素的子序列。该算法假设元素是正数,并且最大元素小于序列长度。bool findLongestSequence() {
// Data (adapt as needed)
const int N = 13;
char flags[N];
int a[] = {1,8,2,1,4,1,2,9,1,8,1,4,1};
// Number of unique elements
int numUnique = 0;
for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
for (int n = 0; n < N; ++n) {
if (a[n] < 0 || a[n] >= N) return false; // assumptions violated
if (flags[a[n]] == 0) {
++numUnique;
flags[a[n]] = 1;
}
}
// Find the longest sequence ("best")
for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
int bestBegin = 0, bestLength = 0;
int begin = 0, end = 0, currLength = 0;
for (; begin < N; ++begin) {
while (end < N) {
if (flags[a[end]] == 0) {
++currLength;
flags[a[end]] = 1;
++end;
}
else {
break; // end-loop
}
}
if (currLength > bestLength) {
bestLength = currLength;
bestBegin = begin;
}
if (bestLength >= numUnique) {
break; // begin-loop
}
flags[a[begin]] = 0; // reset
--currLength;
}
cout << "numUnique = " << numUnique << endl;
cout << "bestBegin = " << bestBegin << endl;
cout << "bestLength = " << bestLength << endl;
return true; // longest subseqence found
}
如果您需要针对相同序列和不同集合经常执行此操作,则可以使用倒排列表。您为序列准备倒排列表,然后收集所有偏移量。然后从倒排列表的结果中扫描m个连续数字的序列。
如果序列长度为n
,查询大小为m
,则准备工作将在O(n)
内完成。如果我没有计算错误的话,查询的响应时间将在O(m^2)
内。
如果您需要更多详细信息,请查看Clausen / Kurth于2004年发表的关于代数数据库的论文(“通过群论方法进行基于内容的信息检索”)。这概述了一个通用的数据库框架,可适用于您的任务。
我有一个O(N*M)的算法,其中N是S的长度,M是元素的数量(它在M的小值时更有效,即:如果有很少的重复,它可能是具有二次成本的不良算法)编辑:实际上,在实践中它更接近于O(N)。你只有在最坏情况下才会得到O(N*M)
首先遍历序列并记录S的所有元素。我们称之为集合E。
我们将使用S的动态子序列。创建一个空的映射M,其中M将每个元素与其在子序列中出现的次数相关联。
例如,如果subSequence = {1,8,2,1,4}
,E = {1, 2, 4, 8, 9}
M[9]==0
M[2]==M[4]==M[8]==1
M[1]==2
L=0
,R=0
和 M[S[0]]++
算法如下:While(M does not contain all the elements of E)
{
if(R is the end of S)
break
R++
M[S[R]]++
}
While(M contains all the elements of E)
{
if(the subsequence S[L->R] is the shortest one seen so far)
Record it
M[S[L]]--
L++
}
要检查M是否包含E的所有元素,您可以使用布尔向量V。V[i]==true
如果M[E[i]]>0
,V[i]==false
如果M[E[i]]==0
。因此,您首先将V的所有值设置为false
,每次执行M[S[R]]++
时,您可以将该元素的V设置为true
,每次执行M[S[L]]--
并且M[S[L]]==0
时,将该元素的V设置为false
。
cnt
技巧,使其完全成为O(N)。 - B. Decoster上述解决方案是正确的,这是上述代码的Java版本
public class MinSequence {
public static void main(String[] args)
{
final int n; // the size of array
// read n and the array
final List<Integer> arr=new ArrayList<Integer>(4);
Map<Integer, Integer> cur = new TreeMap<Integer, Integer>();
arr.add(1);
arr.add(2);
arr.add(1);
arr.add(3);
int distinctcount=0;
for (final Integer integer : arr)
{
if(cur.get(integer)==null)
{
cur.put(integer, 1);
++distinctcount;
}else
{
cur.put(integer,cur.get(integer)+1);
}
}
// now k is the number of distinct elements
cur=new TreeMap<Integer,Integer>();
// memset( cur, 0, sizeof( cur )); // we need this array anew
int begin = 0, end = -1; // to make it 0 after first increment
int best = -1; // best answer currently found
int ansbegin = 0, ansend = 0; // interval of the best answer currently found
int cnt = 0; // distinct elements in current subsequence
final int inpsize = arr.size();
while(true)
{
if( cnt < distinctcount )
{
++end;
if (end == inpsize) {
break;
}
if( cur.get(arr.get(end)) == null ) {
++cnt;
cur.put(arr.get(end), 1);
} // this elements wasn't present in current subsequence;
else
{
cur.put(arr.get(end),cur.get(arr.get(end))+1);
}
continue;
}
// if we're here it means that [begin, end] interval contains all distinct elements
// try to shrink it from behind
while (cur.get(arr.get(begin)) != null && cur.get(arr.get(begin)) > 1) // we have another such element later in the subsequence
{
cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
++begin;
}
// now, compare [begin, end] with the best answer found yet
if( best == -1 || end - begin < best )
{
best = end - begin;
ansbegin = begin;
ansend = end;
}
// now increment the begin iterator to make cur < k and begin increasing the end iterator again
if (cur.get(arr.get(begin)) != null) {
cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
}
++begin;
--cnt;
}
// output the [ansbegin, ansend] interval as it's the answer to the problem
System.out.println(ansbegin+"--->"+ansend);
for( int i = ansbegin; i <= ansend; ++i ) {
System.out.println(arr.get(i));
}
}