回文检测效率

Question

回文检测效率

16

我对Jon Limjap的访谈失误感到好奇，并开始寻找高效的回文检测方法。我查看了回文高尔夫答案，我觉得答案中只有两个算法，一个是反转字符串，另一个是从头和尾部进行检查。

def palindrome_short(s):
    length = len(s)
    for i in xrange(0,length/2):
        if s[i] != s[(length-1)-i]: return False
    return True

def palindrome_reverse(s):
    return s == s[::-1]

我认为这两种方法都不适用于检测巨大DNA序列中的精确回文。我查找了一下，并没有发现任何免费文章介绍超高效的方法。

采用分治的并行化方法，将每个线程或处理器分配给一个字符数组1..n和length-1-n..length-1，可能是一个好的方法。

有更好的方法吗？

你知道吗？

- Vinko Vrsalovic

10个回答

3

这是你第二个函数的另一种变体。我们不需要检查正常字符串和反转字符串的右侧部分是否相等。

def palindrome_reverse(s):
  l = len(s) / 2
  return s[:l] == s[l::-1]

- drnk

2

除非进行模糊匹配，否则没有。这可能是在DNA中所做的（我已经用smith-waterman在DNA中进行了EST搜索，但这显然比在序列中匹配回文或反向互补要困难得多）。

- nlucaroni

2

显然，由于每个字符至少需要检查一次，因此您无法获得比O（n）渐近效率更高的效果。但是，您可以获得更好的乘法常数。

对于单个线程，您可以使用汇编语言来提高速度。您还可以通过一次检查大于一个字节的数据块来取得更好的效果，但由于对齐考虑，这可能有些棘手。如果您可以一次检查16个字节大小的块，则使用SIMD将获得更好的效果。

如果您想并行化它，可以将字符串分成N个部分，并使处理器i将段[i * n / 2，（i + 1）* N / 2）与段[L-（i + 1）* N / 2，L-i * N / 2）进行比较。

- Adam Rosenfield

不必比较16字节的块，一次做4个回文可能更快。这将节省数据交换并且可能不需要太多的水平操作。 - Jasper Bekkers

其他想法：尽可能将密钥存储在一个机器字中。将其与包含测试项的内存缓冲区的每个字节进行比较。在此之前不要使用字符串操作。不要使用超过8位字符的任何内容，因为限制因素将是内存访问。 - Loren Pechtel

1

它们都是O(N)级别的，所以我认为这些解决方案没有什么特别的效率问题。也许我不够有创意，但我看不出如何在少于N步骤中比较N个元素，因此在我看来，类似于O(log N)的算法肯定是不可能的。

并行可能会有所帮助，但它仍然不会改变算法的大O复杂度等级，因为它相当于在更快的机器上运行它。

- DrJokepu

0

除了其他人说的，我还要为真正大的输入添加一些预检条件：

quick check whether tail-character matches 
                    head character 

if NOT, just early exit by returning Boolean-False

if (input-length < 4) { 

    # The quick check just now already confirmed it's palindrome 

    return Boolean-True  

} else if (200 < input-length) {
     
     # adjust this parameter to your preferences
     #
     # e.g. make it 20 for longer than 8000 etc
     #      or make it scale to input size,
     #      either logarithmically, or a fixed ratio like 2.5%
     #
     reverse last ( N = 4 ) characters/bytes of the input  

     if that **DOES NOT** match first N chars/bytes {

         return boolean-false # early exit
                              # no point to reverse rest of it
                              # when head and tail don't even match
     } else {
         
         if N was substantial

             trim out the head and tail of the input
             you've confirmed; avoid duplicated work

             remember to also update the variable(s)
             you've elected to track the input size  

     }

     [optional 1 : if that substring of N characters you've 
                   just checked happened to all contain the
                   same character, let's call it C,
                    
                   then locate the index position, P, for the first               
                   character that isn't C
                   
                   if P == input-size 

                       then you've already proven
                       the entire string is a nonstop repeat 
                       of one single character, which, by def, 
                       must be a palindrome

                       then just return Boolean-True


                   but the P is more than the half-way point,
                       you've also proven it cannot possibly be a 
                       palindrome, because second half contains a              
                       component that doesn't exist in first half,
                       

                       then just return Boolean-False ]


     [optional 2 : for extremely long inputs, 
                   like over 200,000 chars,
                   take the N chars right at the middle of it,
                   and see if the reversed one matches
                 
                   if that fails, then do early exit and save time ]

 }

 if all pre-checks passed,
 then simply do it BAU style :

     reverse second-half of it, 
     and see if it's same as first half

- RARE Kpop Manifesto

0

从中心开始比较总是更有效率的，因为你可以在错过时提前退出，但它也允许你进行更快的最大回文搜索，无论你是寻找最大半径还是所有非重叠回文。

唯一真正的并行化是如果你有多个独立的字符串要处理。将其分成块会浪费很多工作，因为每次错过的情况总是比命中的情况多得多。

- ZXX

你能解释一下从中心开始比较如何让我们提前退出吗？为什么中心附近的不匹配比边缘附近的不匹配更有可能呢？ - Michał Wróbel

你怎么知道“失误总是比命中多得多”？OP是否提到了输入数据的特征？ - Michał Wróbel

-1

你可以使用哈希表来存储字符，并使用一个计数器变量，每当找到一个不在表/映射中的元素时，其值就会增加。如果检查并发现元素已经在表中，则减少计数。

For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.

See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
        char charA[]= s.toCharArray();
        for(int i=0;i<charA.length;i++)
        {

            if(!textMap.containsKey(charA[i]))
            {   
                textMap.put(charA[i], ++count);

            }
            else
                {
                textMap.put(charA[i],--count);


        }
        if(length%2 !=0)
        {
            if(count == 1)
            System.out.println("(odd case:PALINDROME)");
            else
                System.out.println("(odd case:not palindrome)");
        }
        else if(length%2==0)    
        {
            if(count ==0)
                System.out.println("(even case:palindrome)");
            else
                System.out.println("(even case :not palindrome)");
        }

- mahashwetha

此答案中提出的算法完全不正确。 - Michał Wróbel

-1

使用Python编程，短代码可以更快，因为它将负载放入VM的更快的内部（还有整个缓存和其他类似的东西）。

def ispalin(x):
   return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))

- Demur Rumed

不错。Python风格。紧凑。不幸的是，与建议相反，这段代码似乎并不比Vinko的提案更快。 - Michał Wróbel

-1

public class Palindrome{
    private static boolean isPalindrome(String s){
        if(s == null)
            return false;   //unitialized String ? return false
        if(s.isEmpty())     //Empty Strings is a Palindrome 
            return true;
        //we want check characters on opposite sides of the string 
        //and stop in the middle <divide and conquer>
        int left = 0;  //left-most char    
        int right = s.length() - 1; //right-most char

        while(left < right){  //this elegantly handles odd characters string 
            if(s.charAt(left) != s.charAt(right)) //left char must equal 
                return false; //right else its not a palindrome
            left++; //converge left to right 
            right--;//converge right to left 
        }
        return true; // return true if the while doesn't exit 
    }
}

虽然我们进行了n/2次计算，但它仍然是O(n)。这也可以使用线程完成，但计算会变得混乱，最好避免这种情况。这不测试特殊字符并且区分大小写。我有处理此类问题的代码，但此代码可以轻松修改以处理此类问题。

- Tangang Atanga

尝试更清晰地解释这个答案为什么是回答这个问题的。 - Jeroen Heier

这个网站是为了回答人们的问题。代码可以帮助回答一些问题，但更重要的是描述代码背后的思想。 - Sneftel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Claudiu · Accepted Answer

如果只给出一个回文字符串，你需要在O（N）的时间内完成它。是的，你可以通过拆分字符串来利用多处理器来提高效率。

现在假设你想进行精确的DNA匹配。这些字符串有成千上万个字符，并且非常重复。这给了我们优化的机会。

假设你将一个1000个字符长的字符串分成5对100,100。代码将如下所示：

isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...

等等……第一次进行这些匹配时，您必须进行处理。但是，您可以将所有已完成的结果添加到哈希表中，将成对映射到布尔值：

isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}

等等……这将占用太多内存。对于100,100的一对，哈希表将有2*4^100个元素。假设您只将两个32位字符串哈希值作为键存储，您将需要大约10^55兆字节的空间，这是荒谬的。

也许如果您使用较小的字符串，问题就可以解决。然后，您将拥有一个巨大的哈希表，但至少对于10x10对的回文串，它将需要O(1)的时间，因此检查一个1000个字符的字符串是否为回文将需要100次查找而不是500次比较。尽管如此，它仍然是O(N)的……