为字符串创建距离矩阵

Question

为字符串创建距离矩阵

3

我希望能够加速以下代码。请有经验的人提出建议吗？Original Answer翻译成"最初的回答"。

library(dplyr)
library(fuzzywuzzyR)

set.seed(42)
rm(list = ls())
options(scipen = 999)

init = FuzzMatcher$new()

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

distance_function <- function(string_1, string_2) {
    init$Token_set_ratio(string1 = string_1, string2 = string_2)
}

combinations <- combn(nrow(data), 2)
distances <- matrix(, nrow = 1, ncol = ncol(combinations))

distance_matrix <- matrix(NA, nrow = nrow(data), ncol = nrow(data), dimnames = list(data$string, data$string))

for (i in 1:ncol(combinations)) {

    distance <- distance_function(data[combinations[1, i], 1], data[combinations[2, i], 1])

    #print(data[combinations[1, i], 1])
    #print(data[combinations[2, i], 1])
    #print(distance)

    distance_matrix[combinations[1, i], combinations[2, i]] <- distance
    distance_matrix[combinations[2, i], combinations[1, i]] <- distance

}

distance_matrix

顺便说一下，我尝试使用proxy::dist和其他各种方法都没有成功。我也不认为字符串距离函数的工作方式符合预期，但这是另一回事。最终，我想使用距离矩阵来执行一些聚类操作，以将相似的字符串（与单词顺序无关）分组。"Original Answer"（最初的回答）。

- cs0815

请问您能否对程序进行性能分析，看看哪个部分耗时更多？如果是distance_function占用了大量时间，那么在当前的设置下可能会比较困难。 - akrun

不确定如何做这个，但也许可以将循环改为应用？抱歉，我还是一个R语言的新手。 - cs0815

你可以使用profvis来进行性能分析。在这里查看详细信息：链接。 - akrun

谢谢 - 很好知道 - 我会试一试，但不确定这是否只适用于R Studio，而我并不使用它... - cs0815

如果proxy::dist对你来说仍然太慢，那么你可能需要在C或C++中实现自己的函数。我最近展示了一个使用地理距离的多线程示例，但你可以调整它以支持字符串并输出完整的矩阵。还可以参考这个示例。 - Alexis

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrew · Accepted Answer

如果您需要一个矩阵，可以使用 stringdist 包。从我所了解的情况来看，您使用的包计算的是Levenshtein距离，因此我包含了 method = "lv" （您也可以尝试其他方法）。如果您遇到问题或者希望使用除了矩阵以外的其他格式，请告诉我。此外，您可能考虑使用除了Levenshtein距离以外的其他方法（例如，一个四个字母的单词中的两个变化与一个20个单词的句子中的两个变化相同）。祝您好运！

library(dplyr)
library(stringdist)

set.seed(42)
rm(list = ls())
options(scipen = 999)

data <- data.frame(string = c("hello world", "hello vorld", "hello world 1", "hello world", "hello world hello world"))
data$string <- as.character(data$string)

dist_mat <- stringdist::stringdistmatrix(data$string, data$string, method = "lv")

rownames(dist_mat) <- data$string
colnames(dist_mat) <- data$string

dist_mat
                        hello world hello vorld hello world 1 hello world hello world hello world
hello world                       0           1             2           0                      12
hello vorld                       1           0             3           1                      13
hello world 1                     2           3             0           2                      11
hello world                       0           1             2           0                      12
hello world hello world          12          13            11          12                       0