如何在R中将通配符字符替换为随机字符

4

我有以下序列:

s0 <- "KDRH?THLA???RT?HLAK"

通配符的表示方式是?。我想要做的是用来自这个向量的样本字符替换该字符。
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

由于 s0 有5个通配符 ?,因此我会从AADict中进行采样:

set.seed(1)
nof_wildcard <- 5
tolower(sample(AADict, nof_wildcard, TRUE))

这给出了[1] "d" "q" "a" "r" "l"

因此预期结果是:

     KDRH?THLA???RT?HLAK
     KDRHdTHLAqarRTlHLAK

因此,采样的字符必须恰好放置在?的相同位置,但字符的顺序不重要。例如,此答案也是可以接受的:KDRHqTHLAdlaRTrHLAK

我该如何使用 R 实现呢?

其他示例包括:

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"
3个回答

4

一种方法是使用循环“逐个”替换“?”字符,例如:

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
s0
#> [1] "KDRH?THLA???RT?HLAK"
repeat{s0 <- sub("\\?", sample(tolower(AADict), 1), s0); if(grepl("\\?", s0) == FALSE) break}
s0
#> [1] "KDRHtTHLAidwRTyHLAK"

s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
repeat{s1 <- sub("\\?", sample(tolower(AADict), 1), s1); if(grepl("\\?", s1) == FALSE) break}
s1
#> [1] "FKDHKHIDVKDRHRTHLAKrstaRTRHLAK"

s2 <- "FKHIDVKDRHRTRHLAK??????????"
repeat{s2 <- sub("\\?", sample(tolower(AADict), 1), s2); if(grepl("\\?", s2) == FALSE) break}
s2
#> [1] "FKHIDVKDRHRTRHLAKdvcfmheiqn"

另一种方法也可以允许无替换采样:

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
matches <- gregexpr("\\?", s0)
regmatches(s0, matches) <- lapply(lengths(matches), sample, x = tolower(AADict), replace = FALSE)
s0
#> [1] "KDRHdTHLAlanRTiHLAK"

该代码示例创建于2022年10月22日,使用的是 reprex包(v2.0.1版本)。


3
您可以将字符串拆分为单个字符,这样便于替换通配符而无需使用循环(这是我最初的方法):
replace_wc <- function(x, dict) {
  x <- strsplit(x, split = "")[[1]]
  ix <- grepl("\\?", x)
  x[ix] <- sample(dict, sum(ix), replace = TRUE)

  return(paste0(x, collapse = ""))
}

s0 <- "KDRH?THLA???RT?HLAK"
AADict <- c(
  "A", "R", "N", "D", "C", "E", "Q", "G", "H",
  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)

set.seed(1)

replace_wc(s0, tolower(AADict))
#> [1] "KDRHdTHLAqarRTlHLAK"

3
这里是一个向量化函数,用于替换字符串向量中的问号字符("?")。
fun <- function(x, dict = AADict) {
  dict <- tolower(dict)
  inx <- gregexpr("\\?", x)
  sapply(seq_along(x), \(j) {
    for(i in inx[[j]]) {
      substr(x[j], i, i) <- sample(dict, 1L)
    }
    x[j]
  })
}

AADict <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", 
            "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

s0 <- "KDRH?THLA???RT?HLAK"
s1 <- "FKDHKHIDVKDRHRTHLAK????RTRHLAK"
s2 <- "FKHIDVKDRHRTRHLAK??????????"

fun(s0)
#> [1] "KDRHsTHLAwppRTwHLAK"

fun(s1)
#> [1] "FKDHKHIDVKDRHRTHLAKyfqfRTRHLAK"

fun(s2)
#> [1] "FKHIDVKDRHRTRHLAKnsfehqwmkv"

fun(c(s0, s1, s2))
#> [1] "KDRHiTHLAdssRTgHLAK"            "FKDHKHIDVKDRHRTHLAKcdivRTRHLAK"
#> [3] "FKHIDVKDRHRTRHLAKfrpafwpnif"

使用 reprex v2.0.2 在2022-10-22创建


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接