使用不规则字符串向量匹配键值向量的 R 代码

Question

使用不规则字符串向量匹配键值向量的 R 代码

3

我被困在一个噩梦里，一直试图在论坛中找到答案，但没有成功。因此，我想直接提问。

我有一个向量，其中包含随机城市的不规则字符串，我想从包含城市名称的关键字值向量中提取/标记每个这些不规则字符串。例如：

Vector <- c("...the life in Paris is ...","In Roma, there is...","...nice weekend in New York with...")
Cities <- c("London","Paris","Madrid","Roma","New York")

对于向量中的每个字符串，都应有相应的值与城市对应。

一开始我考虑使用循环，但是数据量太大了，R搜索时间太长。我更倾向于使用一种矩阵计算方法和grep结合，但总是出现错误。

您认为这是正确的方法吗？

- jernac

2个回答

1

这里有一种使用文本分析包quanteda的方法。它允许您设置一组匹配城市名称的模式，这在您拥有不同城市拼写（例如“罗马”和“罗马”）但想将它们视为单个城市时非常有用。下面的匹配使用简化的“glob”格式，但您也可以使用正则表达式匹配。

require(quanteda)

# only required if you have compound word city names
compoundCities <- dictionary(list(NY = "New York"))
VectorPhrased <- phrasetotoken(Vector, compoundCities)

# uses the "glob" format for Pattern Matching
citiesDict <- dictionary(list(London = c("London", "Londres"), Paris = "Paris", 
                              Rome = "Rom?", NewYork = "New_York"))

dfm(VectorPhrased, dictionary = citiesDict, verbose = FALSE)
# Document-feature matrix of: 3 documents, 4 features.
# 3 x 4 sparse Matrix of class "dfmSparse"
#        features
# docs    London Paris Rome NewYork
#   text1      0     1    0       0
#   text2      0     0    1       0
#   text3      0     0    0       1

- Ken Benoit

我的那边不起作用。为避免复合名称，我使用 tolower 并删除所有空格以获得干净整洁的字符串（例如 thelifeinparisis）。然后我用小写字母构建城市字典。能显示 text1 的实际值吗？ - jernac

我不确定你的意思，因为它不是你的示例的一部分。但是，如果你改变valuetype =“regex”，那么你将能够在像“thelifeinparis”这样的字符串中找到“paris”的匹配项。但是，如果你扩展示例，我可以直接解决它。 - Ken Benoit

很好，它运作了（顺便说一句，这真是个不错的包）。问题出在我没有清晰地阐明示例背后的整个过程。我看到值类型需要三个参数（glob、regex 和 fixed）。我不理解 glob 格式（"glob"-style 通配符），它和 regex 有什么不同（两者都用于模式匹配）？ - jernac

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Cath · Accepted Answer

你可以使用 sapply 和 grepl:

check_vec <- sapply(Cities, grepl, Vector)
row.names(check_vec) <- Vector

check_vec
#                                    London Paris Madrid  Roma New York
#...the life in Paris is ...          FALSE  TRUE  FALSE FALSE    FALSE
#In Roma, there is...                 FALSE FALSE  FALSE  TRUE    FALSE
#...nice weekend in New York with...  FALSE FALSE  FALSE FALSE     TRUE

如果您需要每个向量的关键字：

apply(check_vec, 1, function (x) colnames(check_vec)[which(x)])
#        ...the life in Paris is ...                In Roma, there is... ...nice weekend in New York with... 
#                            "Paris"                              "Roma"                          "New York"

编辑

为了更安全，就像@nicola明智的建议一样，您可以使用vapply代替sapply:

check_vec <- vapply(Cities, grepl, x=Vector, logical(length(Vector)))