在F#中如何在一个字符串中查找子字符串?

5
我在网上找到了一个有趣的f#项目,它的想法是找到给定字符串中子字符串的数量。
以下是提示:
Description:
You are given a DNA sequence:
a string that contains only characters 'A', 'C', 'G', and 'T'.
Your task is to calculate the number of substrings of sequence,
in which each of the symbols appears the same number of times.

Example 1:
For sequence = "ACGTACGT", the output should be 6
All substrings of length 4 contain each symbol exactly once (+5),
and the whole sequence contains each symbol twice (+1).

Example 2:
For sequence = "AAACCGGTTT", the output should be 1
Only substring "AACCGGTT" satisfies the criterion above: it contains each symbol twice.


Input: String, a sequence that consists only of symbols 'A', 'C', 'G', and 'T'.
Length constraint: 0 < sequence.length < 100000.

Output: Integer, the number of substrings where each symbol appears equally many times.

我不确定该怎么做,具体来说是不知道该去哪里。我在互联网上查找了相关信息,只找到了以下代码(我添加了输入变量、var变量,并将“things”显示为input再搜索子字符串(希望这样说起来有意义)):

open System

let countSubstring (where :string) (what : string) =
match what with
| "" -> 0
| _ -> (where.Length - where.Replace(what, @"").Length) / what.Length


[<EntryPoint>]
let main argv =

let input = System.Console.ReadLine();
let var = input.Length;
Console.WriteLine(var);
let show where what =
    printfn @"countSubstring(""%s"", ""%s"") = %d" where what (countSubstring where what)
show input "ACGT"
show input "CGTA"
show input "GTAC"
show input "TACG"
0

无论如何,如果有人能帮我解决这个问题,我将不胜感激。
提前致谢。

1
你想了解正则表达式 - 它可以有多个匹配。 - John Palmer
3
即使你能够生成看起来正常的代码,你如何知道它实际上是有效的?如果您输入一个包含99987个数字的字符串,您如何验证返回的数字是否正确? - Mark Seemann
2个回答

3

以下是一种解决方案,它生成了所有长度可被四整除的子字符串,然后计算其中有多少个具有相等数量的符号。需要注意的是,如果子字符串的长度不能被四整除,则它不可能具有四种不同符号的相等数量。

let hasEqualAmountOfSymbols (substring : string) =
    let symbolAppearances =
        ['A'; 'C'; 'G'; 'T']
        |> List.map (fun symbol ->
            substring
            |> Seq.filter ((=) symbol)
            |> Seq.length)
    symbolAppearances
    |> List.pairwise
    |> List.forall (fun (x, y) -> x = y)


let countSubstrings input =
    let potentialSubstrings =
        let lastIndex = String.length input - 1
        [ for i in 0 .. lastIndex do
            for j in i + 3 .. 4 .. lastIndex do
                yield input.Substring(i, j - i + 1) ]
    potentialSubstrings
    |> List.filter hasEqualAmountOfSymbols
    |> List.length


countSubstrings "ACGTACGT" // -> 6
countSubstrings "AAACCGGTTT" // -> 1

请注意,这种蛮力解决方案无法很好地扩展到允许的问题大小(长度为99999的字符串)。例如,在我的计算机上运行长度为2000的字符串大约需要20秒。 - kvb
@kvb 是的,这是一种O(n^2)复杂度的暴力解决方案。使用可变数据结构可以轻松使其快十倍,但我猜想要想出更好复杂度的算法可能相当棘手。 - hvester
实际上,我认为它是O(n^3),因为需要将hasEqualAmountOfSymbols应用于每个候选项。您肯定可以实现真正的O(n^2)行为。 - kvb

3

首先声明一个函数numberACGT,该函数从一个字符串返回1,如果字符A的数量与C、G和T相同,则返回1,否则返回0。为此,声明一个初始化为0的包含4个整数的数组N,并运行该字符串,递增相应的计数器。最后比较数组元素。

然后对于每个子字符串(长度为4的倍数),调用numberACGT并将结果加到计数器count中(在开始时初始化为0)。

let numberACGT (aString:string) =
    let N = Array.create 4 (0:int)
    let last = aString.Length - 1 
    for i = 0 to last do
        match aString.[i] with
        | 'A' -> N.[0] <- N.[0] + 1
        | 'C' -> N.[1] <- N.[1] + 1
        | 'G' -> N.[2] <- N.[2] + 1
        | _ -> N.[3] <- N.[3] + 1
    if (N.[0] = N.[1]) && (N.[1] = N.[2]) && (N.[2] = N.[3]) then 1 else 0 

let numberSubStrings (aString:string) =
    let mutable count = 0
    let len = aString.Length 
    for k = 1 to len / 4 do //only multiple of 4
        for pos = 0 to len - 4*k do
            count <- count + numberACGT (aString.[pos..pos+4*k-1])
    count

我希望速度够快。

[<EntryPoint>]
let main argv = 
  let stopWatch = System.Diagnostics.Stopwatch.StartNew()
  let input =  Console.ReadLine() in
    printf "%i  " (numberSubStrings input)
  stopWatch.Stop()
  let g =  Console.ReadLine()
  0

结果:

62    4.542700

一个O(n²)复杂度的新版本:
let numberSubStringsBis (aString:string) =
    let mutable count = 0 
    let len = aString.Length 
    for pos = 0 to len - 1 do
        let mutable a = 0 
        let mutable  c = 0 
        let mutable g = 0 
        let mutable t = 0 
        let mutable k = pos 
        while k + 3 <= len - 1 do
            for i in [k..k+3] do
                match aString.[i] with
                | 'A' -> a <- a + 1
                | 'C' -> c <- c + 1
                | 'G' -> g <- g + 1
                | _ -> t <- t + 1
            k <- k + 4 
            if a=c && c=g && g=t then count <- count + 1               
    count

首先,我想非常感谢您的帮助。 - Marc Karam
你知道我如何在下面的代码中添加控制台输入吗?你说“我希望这足够快”。我尝试添加一个“let input = console.readline”,然后用'var'替换字符串(我还添加了一个“let var = input.ToString”),但是我收到一个错误,说“此表达式应该具有字符串类型,但这里具有unit->string类型”。你知道我怎么能解决这个问题吗? - Marc Karam
谢谢您。我已经在答案中作出了替换。 Console.ReadLine 返回字符串,因此您可以直接使用结果进行操作。您必须在 Console.ReadLine 后添加(),以等待类型单元。 - Jean-Claude Colette

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接