如何在R中使用udpipe包进行“单词聚类”?

3
我正在使用 R 中的 udpipe 包进行一些文本挖掘。我按照这个教程:https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html#nouns__adjectives_used_in_same_sentence,但现在有点卡住了。
事实上,我想将两个以上的单词分组,以便能够识别关键表达式,比如“从黄昏到黎明”。所以,我想知道是否可以根据上面教程中的图表,进行一种聚类算法,将强烈且频繁联系在一起的单词“合并”?如果可以,怎么做?
还有其他方法吗?
谢谢!

1
“n-grams”是您要查找的关键词。 - moodymudskipper
2
你看过igraph包中的ego()cliques()函数了吗?试一试cliques(wordnetwork, min = 2, max = NULL)ego(wordnetwork)。结果是否符合你的预期? - nghauran
1
我并没有指特定的软件包,只是你正在寻找的这些单词集合(一个接一个地找到的单词)被称为n元组,而关联(下面的答案)是另一回事,它是在语料库中的项目中一起发现的单词。 - moodymudskipper
1
如果我理解正确的话,您应该对自我网络感兴趣。在您提供的一个网络示例中,例如 proof -> ofof -> conceptproof 的二级自我网络将包含 ofconcept,即使 proofconcept 没有直接连接。 - nghauran
2
我给你的链接提供了一个使用bigrams的示例,你可以使用完全相同的示例,将n设置为4而不是2,这样你就有了4grams... - moodymudskipper
显示剩余5条评论
1个回答

4
这里有两种选项(使用自我网络和社区检测),基于您提供的教程。
library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)


cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                     term = "lemma", 
                     group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)

library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooc, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
        geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
        geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
        theme_graph(base_family = "Arial Narrow") +
        theme(legend.position = "none") +
        labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")


### Option 1: using ego-networks
V(wordnetwork) # the graph has 23 vertices
ego(wordnetwork, order = 2) # 2.0 level ego network for each vertex
ego(wordnetwork, order = 1, nodes = 10) # 1.0 level ego network for the 10th vertex (publico)


### Option 2: using community detection

# Community structure detection based on edge betweenness (http://igraph.org/r/doc/cluster_edge_betweenness.html)
cluster_edge_betweenness(wordnetwork, weights = E(wordnetwork)$cooc)

# Community detection via random walks (http://igraph.org/r/doc/cluster_walktrap.html)
cluster_walktrap(wordnetwork, weights = E(wordnetwork)$cooc, steps = 2)

# Community detection via optimization of modularity score
# This works for undirected graphs only
wordnetwork2 <- as.undirected(wordnetwork) # an undirected graph
cluster_fast_greedy(wordnetwork2, weights = E(wordnetwork2)$cooc)

# Note that you can plot community object
comm <- cluster_fast_greedy(wordnetwork2, weights = E(wordnetwork2)$cooc)
plot_dendrogram(comm)

enter image description here


2
虽然我认为提问者只对函数关键词_rake / 关键词_collocation / 关键词_phrases / textrank_keywords 感兴趣,应该更详细地查看这些函数的参数,但我真的很喜欢这种单词聚类的方法。非常有趣的单词聚类用例!感谢分享! - user1600826
@jwijffels 我看了一下 keywords_phrases: https://rdrr.io/cran/udpipe/man/keywords_phrases.html 但是它似乎有点“繁重”来配置模式... 有更简单的方法吗? - MysteryGuy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接