可以更改的最小子串使得字符串中每个字符数量相同

7
我正在尝试解决一个几乎完全相同的问题。特别地,我有一个字符串 s,使得 s.Length % 4 == 0,并且每个 s[i] 都是 'A''C''T''G' 中的一个。我想找到最小的子字符串,以便我可以替换它,以便每个 'A''C''T''G' 出现恰好 s.Length / 4 次。
例如,对于 s="GAAATAAA",一种最优解是将子字符串 "AAATA" 替换为 "TTCCG",从而得到 "GTTCCGAA"
我已经在下面的注释中描述了我的方法,并且我想知道它是否通常正确,是否能得到正确的答案。
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
class Solution
{
    static string ReplacementForSteadiness(string s)
    {   
        var counter = new Dictionary<char,int>() {
            { 'A', 0 }, { 'C', 0 }, { 'G', 0 }, { 'T', 0 }
        };
        for(int i = 0; i < s.Length; ++i)
                counter[s[i]] += 1;

        int div = s.Length / 4;

        var pairs = counter.ToList();
        if(pairs.All(p => p.Value == div))
            return "";

        // If here, that means there is an even count of characters in s. For example, if
        // s = "AAATGTTCTTGCGGGG", then counter = { A -> 3, T -> 5, C -> 2, G -> 6 },
        // div = 4, and we know that we need to increase the number of As by 1, decrease 
        // the number of Ts by 1, increase the number of Cs by 2 and decrease the number
        // of Gs by 2.

        // The smallest strings to replace will have 1 T and 2 Gs, to be replaced with 1 A and
        // 2 Cs (The order of characters in the replacement string doesn't matter).
        // "TGG" --> "ACC" 
        // "GTG" --> "ACC"
        // "GGT" --> "ACC"

        // None of those strings exist in s. The next smallest strings that could be replaced
        // would have 1 T and 3Gs, to be replaced with 1 A and 2 of the Gs to be replaced with
        // Cs. Or, 2 Ts and 2Gs, 1 of the Ts to be replaced by an A and both the Gs to be replaced
        // by Cs.
        // "TGGG" --> "AGCC"
        // "GTGG" --> "AGCC"
        // "GGTG" --> "AGCC"
        // "GGGT" --> "AGCC"
        // "TTGG" --> "ATCC"
        // "TGTG" --> "ATCC"
        // "GTGT" --> "ATCC"
        // "GGTT" --> "ATCC"

        // None of those strings exist in s. Etc.      

        string r;

        // ... 

        return r;
    }

    static void Main(String[] args)
    {
       Console.ReadLine(); // n
       string str = Console.ReadLine();
       string replacement = ReplacementForSteadiness(str);
       Console.WriteLine(replacement.Length);
    }
}

这个回答解决了你的问题吗?熊和稳定基因 - 改进方案 - Anatolii
4个回答

0

如果字符串已经有一个平衡的字符集,那么你就完成了,不需要做任何事情。

否则,你可以通过替换零个字符来解决问题,这是最小的。你可以通过添加缺失的任何字符来实现这一点。例如,对于你的测试用例:

GAAATAAA

出现次数最多的字符是6个A。你需要5个额外的G,5个额外的T和6个额外的C。因此,用所需的字符(包括A本身)替换一个A:

GAAATAA[AGGGGGTTTTTCCCCCC]

由于原始的A被替换为A,你实际上替换了零个字符,这是可能的最小值。


1
尽管原帖没有明确说明,但我认为(基于示例以及您展示的事实,即问题可以很容易地得到解决),替换字符串必须与被替换字符串具有相同的长度。 - j_random_hacker

0

我认为你的解决方案可以行得通,但它的复杂度太高了。
这里有一个替代方案
如果在您的字符串中计算字符返回{'A',4},{'C',6},{'G',6},{'T',4},则子字符串必须以C或G开头,以C或G结尾,并且长度>= 2
因此,我们需要做的是取出每个符合这些条件的字符串,测试它是否包含“坏字符”(在我们的情况下是一个C和一个G)。 如果它的长度= 2,我们就赢了,否则我们将其保存在一个临时变量中并继续测试

   using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
class Solution
{
    static void Main(String[] args)
    {
        string[] inputs = { "GAAATAAA", "CACCGCTACCGC", "CAGCTAGC", "AAAAAAAA", "GAAAAAAA", "GATGAATAACCA", "ACGT" };

        List<string> replacement = new List<string>();
        foreach (var item in inputs)
        {
            replacement.Add(StringThatHasToBeReplaced(item));
        }
    }

    static string StringThatHasToBeReplaced(string s)
    {
        var counter = new Dictionary<char, int>() {
            { 'A', 0 }, { 'C', 0 }, { 'G', 0 }, { 'T', 0 }
        };
        for (int i = 0; i < s.Length; ++i)
            counter[s[i]] += 1;

        int div = s.Length / 4;
        var pairs = counter.ToList();

        if (pairs.Where(p => p.Value != div).Count() == 0)
        {
            return null;
        }

        List<char> surplusCharacter = pairs.Where(p => p.Value > div).Select(p => p.Key).ToList();
        int minLength = pairs.Where(p => p.Value > div).Sum(p => p.Value - div);
        string result = s;
        for (int i = 0; i < s.Length - minLength + 1; i++) // i is the start index
        {
            if (surplusCharacter.Contains(s[i]))
            {
                if (minLength == 1)
                    return s[i].ToString();

                for (int j = i + minLength - 1; j < s.Length; j++) // j is the end index
                {
                    if (surplusCharacter.Contains(s[j]))
                    {
                        var substring = s.Substring(i, j - i);
                        if (substring.Length >= result.Length)
                        {
                            break;
                        }
                        // we test if substring can be the string that need to be replaced
                        var isValid = true;
                        foreach (var c in surplusCharacter)
                        {
                            if (substring.Count(f => f == c) < counter[c] - div)
                            {
                                isValid = false;
                                break;
                            }
                        }
                        if (isValid)
                            result = substring;
                    }
                }
            }
        }
        return result;
    }


}

我对边界情况进行了一些修改。 这是一些测试样本,我得到的结果看起来很好 {{link1:输入图像描述}}


如果你好奇的话,那个解决方案没有通过测试。 - user6048670
@user6048670,你能给出一个错误的例子吗?也许解决方案可以得到改进。 - AnotherGeek
尝试通过 https://www.hackerrank.com/challenges/bear-and-steady-gene 运行它,这是我最终要解决的问题。 - user6048670
@AnotherGeek 我认为你的第二种情况可以通过长度为6的子字符串来完成,而不是7。你可以用AGAGTT(最小长度:6)替换CACCGC以得到AGAGTTTACCGC。 - gowrath

0
public int balancedString(String s) {
        int[] count = new int[128];
        int n = s.length(), res = n, i = 0, k = n / 4;
        for (int j = 0; j < n; ++j) {
            ++count[s.charAt(j)];
        }
        for (int j = 0; j < n; ++j) {
            --count[s.charAt(j)];
            while (i < n && count['A'] <= k && count['C'] <= k && count['T'] <= k && count['G'] <= k) {
                res = Math.min(res, j - i + 1);
                ++count[s.charAt(i++)];
            }
        }
        return res;
    }

在你的代码中添加注释可以为问题增加很大的价值~ - Simas Joneliunas

-1

有什么想法吗?很抱歉代码和Python解决方案都很混乱。我最初是在手机上写这篇文章的,而且当时有点懒。

import re
from itertools import permutations

def find_min(s):
    freq = {ch:0 for ch in 'ATGC'}
    for ch in s:
        freq[ch] += 1
    desired_len = int(len(s)/4)
    fixes = {ch:desired_len-freq[ch] for ch in 'ATGC'}
    replacement = ''
    for ch in fixes:
        adj = fixes[ch]
        if adj < 0:
            replacement += ch*(-1*adj)
    perms = set(permutations(replacement))
    m = len(s)
    to_replace = ''
    for rep in perms:
        regex = '.*?'.join([ch for ch in rep])
        finds = re.findall(regex,s)
        if finds:
            x = sorted(finds, key=lambda x:len(x))[0]
            if m >= len(x):
                m = len(x)
                to_replace = x

    print_replacement(s, to_replace, fixes)

def print_replacement(inp, to_replace, fixes):
    replacement = ''
    for ch in to_replace:
        if fixes[ch] > 0:
            replacement += ch
    for ch in fixes:
        if fixes[ch] > 0:
            replacement += ch*fixes[ch]
    print('{0}\t\t- Replace {1} with {2} (min length: {3})'.format(inp ,to_replace, replacement, len(replacement)))


def main():
    inputs =  ["GAAATAAA", "CACCGCTACCGC", "CAGCTAGC", "AAAAAAAA", "GAAAAAAA", "GATGAATAACCA", "ACGT"]
    for inp in inputs:
        find_min(inp)

if __name__ == '__main__':
    main()

感谢@AnotherGeek提供的测试输入!以下是输出结果。

GAAATAAA        - Replace AAATA with TCCGT (min length: 5)
CACCGCTACCGC    - Replace CACCGC with AGAGTT (min length: 6)
CAGCTAGC        - Replace C with T (min length: 1)
AAAAAAAA        - Replace AAAAAA with CCGGTT (min length: 6)
GAAAAAAA        - Replace AAAAA with CCGTT (min length: 5)
GATGAATAACCA    - Replace ATGAA with TGCGT (min length: 5)
ACGT            - Replace  with  (min length: 0)

我意识到这相当低效。有什么改进建议吗?


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接