在R中进行部分匹配的数据合并

6

我有两个数据集

datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
 "Kish", "Lissp", "Kcn", "CCCa"),
 number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
#-----------
    name number1
1 regular       1
2  kklmin       8
3   notSo       9
4   Jijoh       2
5    Kish      18
6   Lissp      25
7     Kcn      33
8    CCCa       8

 datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
 "KcN", "CaPN"),
   number2 = c(2, 8, 12,    13, 20, 18,   13))
#-------------
   name number2
1 reGulr       2
2   ntSo       8
3  Jijoh      12
4   sean      13
5  LiSsp      20
6    KcN      18
7   CaPN      13

我希望通过名称列将它们合并,但允许部分匹配(以避免在大型数据集中阻碍合并拼写错误,甚至检测此类拼写错误),例如

(1)如果任何位置的连续四个字母(如果字母数量少于4,则全部匹配) - 匹配就可以了

 ABBCD = BBCDK = aBBCD = ramABBBCD = ABB 

(2) 匹配时区分大小写已关闭,例如ABBCD = aBbCd

(3) 新的数据集将保留来自datf1和datf2的名称。这样我们就可以检测到匹配是否完全(可能有一个单独的列来显示有多少个字母匹配)

这样的合并可能性如何?

编辑:

datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
             "Kish", "Lissp", "Kcn", "CCCa"),
                     number1 = c(1, 8, 9,  2,  18, 25, 33,   8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", 
             "LiSsp", "KcN", "CaPN"),
                     number2 = c(2, 8, 12,  13, 20, 18,   13))


uglyMerge(datf1, datf2)
       name1  name2 number1 number2 matches
1  xxregular   <NA>       1      NA       0
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA> reGulr      NA       2       0
10      <NA>   ntSo      NA       8       0
11      <NA>   sean      NA      13       0
12      <NA>   CaPN      NA      13       0

尝试修复一些格式。我看到你添加了“uglyMerge”的副本,这似乎是来自@sgibb的回复。'xxregular'与'reGulr'的不匹配可能对你来说很明显,但由于它似乎符合你的规格要求,所以你可能需要向我们解释一下。 - IRTFM
查看 fuzzyjoin 包。 - Ibo
2个回答

10

也许有一个简单的解决方案,但我找不到任何方法。
在我看来,您必须自己实现这种合并操作。
请参考下面的丑陋例子(还有很大的改进空间):

uglyMerge <- function(df1, df2) {

    ## lower all strings to allow case-insensitive comparison
    lowerNames1 <- tolower(df1[, 1]);
    lowerNames2 <- tolower(df2[, 1]);

    ## split strings into single characters
    names1 <- strsplit(lowerNames1, "");
    names2 <- strsplit(lowerNames2, "");

    ## create the final dataframe
    mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, 
                        number1=df1[,2], number2=NA, matches=0,
                        stringsAsFactors=FALSE);

    ## store names of dataframe2 (to remember which strings have no match)
    toMerge <- df2[, 1];

    for (i in seq(along=names1)) {
        for (j in seq(along=names2)) {
            ## set minimal match to 4 or to string length
            minMatch <- min(4, length(names2[[j]]));

            ## find single matches
            matches <- names1[[i]] %in% names2[[j]];

            ## look for consecutive matches
            r <- rle(matches);

            ## any matches found?
            if (any(r$values)) {
                ## find max consecutive match
                possibleMatch <- r$value == TRUE;
                maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];

                ## store max conscutive match length
                maxMatch <- r$length[maxPos];

                ## to remove FALSE-POSITIVES (e.g. CCC and kcn) find 
                ## largest substring
                start <- sum(r$length[0:(maxPos-1)]) + 1;
                stop <- start + r$length[maxPos] - 1;
                maxSubStr <- substr(lowerNames1[i], start, stop);

                ## all matching criteria fulfilled
                isConsecutiveMatch <- maxMatch >= minMatch &&
                                    grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
                                    nchar(maxSubStr) > 0;

                if (isConsecutiveMatch) {
                    ## merging
                    mergedDf[i, "matches"] <- maxMatch
                    mergedDf[i, "name2"] <- as.character(df2[j, 1]);
                    mergedDf[i, "number2"] <- df2[j, 2];

                    ## don't append this row to mergedDf because already merged
                    toMerge[j] <- NA;

                    ## stop inner for loop here to avoid possible second match
                    break;
                }
            }
        } 
    }

    ## append not matched rows to mergedDf
    toMerge <- which(df2[, 1] == toMerge);
    df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), 
                    number1=NA, number2=df2[toMerge, 2], matches=0, 
                    stringsAsFactors=FALSE);
    mergedDf <- rbind(mergedDf, df2);

    return (mergedDf);
}

输出:

> uglyMerge(datf1, datf2)
    name1  name2 number1 number2 matches
1  xxregular reGulr       1       2       5
2     kklmin   <NA>       8      NA       0
3      notSo   <NA>       9      NA       0
4      Jijoh  Jijoh       2      12       5
5       Kish   <NA>      18      NA       0
6      Lissp  LiSsp      25      20       5
7        Kcn    KcN      33      18       3
8       CCCa   <NA>       8      NA       0
9       <NA>   ntSo      NA       8       0
10      <NA>   sean      NA      13       0
11      <NA>   CaPN      NA      13       0

非常感谢您提供的出色解决方案。它在我提供的示例中运行良好。但是,似乎如果不是前四个连续字母,它就无法正常工作,只需查看我的编辑,在“regular”之前添加了额外的“xxx”,不匹配。尽管如此,这将为我提供很好的起点,谢谢! - jon
@hijo对不起,我的子字符串计算中有几个bug。请使用我修改过的版本。 - sgibb
对于您来说,使用一些编辑距离进行字符串匹配也可能很有价值。它在R中有实现 - danas.zuokas

6
agrep可以帮助您入门。类似这样:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))

然后,您可以调整 max.distance 参数,直到获得适当数量的匹配。然后按照您喜欢的方式进行合并。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接