短语无监督语义聚类

7
我有大约一千个潜在的调查项,作为字符串向量,我想将其减少到几百个。通常在谈论数据降维时,我们都有实际的数据。我向参与者提供项目,并使用因子分析、PCA或其他降维方法。但在我的情况下,我没有任何数据,只有这些项目(即文本字符串)。我想通过消除意思相似的项目来减少这个集合。如果实际应用于参与者,它们很可能高度相关。我一直在研究文本分析的聚类方法。这个SO问题演示了一个我在不同例子中看到过的方法。原帖指出,聚类解决方案并不能完全回答他/她的问题。以下是如何在我的情况下(令人不满意地)应用它的方法:
# get data (2 columns, 152 rows)

链接到text.R文件,其中包含样本项的dput()。

# clustering
library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource(text$item) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

图表显示物品145和149被聚类在一起:
145 "让你知道你不受欢迎"
149 "让你知道他爱你"
这些项目共享相同的词干“让你知道”,这可能是聚类的原因。从语义上讲,它们是相反的。
OP在他/她的示例中遇到了类似的挑战。一位评论者指出wordnet软件包是一个可能的解决方案。
问题(根据反馈进行编辑):
如何防止像145和149这样共享词干的项目聚类?
次要问题具有较少的编程重点:是否有人在这里看到更好的解决方案?我遇到的许多方法都涉及监督学习、测试/训练数据集和分类。我认为我正在寻找更多的语义相似性/聚类(例如,FAC pdf)。

1
你可以去掉一些停用词。Mathew Jockers使用的方法是除了可能有用的名词之外,删除所有其他内容。 - Tyler Rinker
是什么促使人们投票关闭?我提供了一个最小的样本数据集,包括我尝试过的代码,解释了为什么代码没有产生我所寻求的结果,并询问了替代方案的想法。这里肯定有一个概念性的部分,但我认为那些遇到过类似问题的人可以在R中提供一个编程解决方案来实现目标。群众必须最了解关闭的情况,但我有点困惑。 - Eric Green
1
我没有投票关闭,但问题可能更多地涉及内容而不是编码。也许可以编辑问题,使其更接近于编码。 - Tyler Rinker
谢谢,@TylerRinker。我编辑了问题,专注于编码挑战。 - Eric Green
1个回答

3

+1 对@TylerRinker的建议,包括:

  • 去除停用词
  • 只使用名词进行Jockers的聚类方法(我在这里有一个实际的例子here)。

另一个你应该尝试的选择是使用bigrams而不是unigrams来制作你的术语文档矩阵。如果你对词组感兴趣,bigrams是一个不错的起点。我有一个示例here

以下是一个将停用词与bigrams相结合的实际例子。通过此示例,您可以迭代使用不同的参数值来获得最合理的聚类。

获取数据...

dat <- text <- structure(list(id = c("GHQ1", "GHQ2", "GHQ3", "GHQ4", "GHQ5", 
                                 "GHQ6", "GHQ7", "GHQ8", "GHQ9", "GHQ10", "GHQ11", "GHQ12", "GHQ13", 
                                 "GHQ14", "GHQ15", "GHQ16", "GHQ17", "GHQ18", "GHQ19", "GHQ20", 
                                 "GHQ21", "GHQ22", "GHQ23", "GHQ24", "CGMH9", "GHQ25", "GHQ26", 
                                 "GHQ27", "GHQ28", "GHQ29", "GHQ30", "GHQ31", "PARQ01A-P", "PARQ02A-P", 
                                 "PARQ03A-P", "PARQ04A-P", "PARQ05A-P", "PARQ06A-P", "PARQ07A-P", 
                                 "PARQ08A-P", "PARQ09A-P", "PARQ10A-P", "PARQ11A-P", "PARQ12A-P", 
                                 "PARQ13A-P", "PARQ14A-P", "PARQ15A-P", "PARQ16A-P", "PARQ17A-P", 
                                 "PARQ18A-P", "PARQ19A-P", "PARQ20A-P", "PARQ21A-P", "PARQ22A-P", 
                                 "PARQ23A-P", "PARQ24A-P", "PARQ25A-P", "PARQ26A-P", "PARQ27A-P", 
                                 "PARQ28A-P", "PARQ29A-P", "PARQ30A-P", "PARQ31A-P", "PARQ32A-P", 
                                 "PARQ33A-P", "PARQ34A-P", "PARQ35A-P", "PARQ36A-P", "PARQ37A-P", 
                                 "PARQ38A-P", "PARQ39A-P", "PARQ40A-P", "PARQ41A-P", "PARQ42A-P", 
                                 "PARQ43A-P", "PARQ44A-P", "PARQ45A-P", "PARQ46A-P", "PARQ47A-P", 
                                 "PARQ48A-P", "PARQ49A-P", "PARQ50A-P", "PARQ51A-P", "PARQ52A-P", 
                                 "PARQ53A-P", "PARQ54A-P", "PARQ55A-P", "PARQ56A-P", "PARQ57A-P", 
                                 "PARQ58A-P", "PARQ59A-P", "PARQ60A-P", "PARQ01A-C", "PARQ02A-C", 
                                 "PARQ03A-C", "PARQ04A-C", "PARQ05A-C", "PARQ06A-C", "PARQ07A-C", 
                                 "PARQ08A-C", "PARQ09A-C", "PARQ10A-C", "PARQ11A-C", "PARQ12A-C", 
                                 "PARQ13A-C", "PARQ14A-C", "PARQ15A-C", "PARQ16A-C", "PARQ17A-C", 
                                 "PARQ18A-C", "PARQ19A-C", "PARQ20A-C", "PARQ21A-C", "PARQ22A-C", 
                                 "PARQ23A-C", "PARQ24A-C", "PARQ25A-C", "PARQ26A-C", "PARQ27A-C", 
                                 "PARQ28A-C", "PARQ29A-C", "PARQ30A-C", "PARQ31A-C", "PARQ32A-C", 
                                 "PARQ33A-C", "PARQ34A-C", "PARQ35A-C", "PARQ36A-C", "PARQ37A-C", 
                                 "PARQ38A-C", "PARQ39A-C", "PARQ40A-C", "PARQ41A-C", "PARQ42A-C", 
                                 "PARQ43A-C", "PARQ44A-C", "PARQ45A-C", "PARQ46A-C", "PARQ47A-C", 
                                 "PARQ48A-C", "PARQ49A-C", "PARQ50A-C", "PARQ51A-C", "PARQ52A-C", 
                                 "PARQ53A-C", "PARQ54A-C", "PARQ55A-C", "PARQ56A-C", "PARQ57A-C", 
                                 "PARQ58A-C", "PARQ59A-C", "PARQ60A-C"), item = c("Been feeling unhappy or depressed", 
                                                                                  "Been feeling reasonably happy, all things considered", "Feeling edgy and bad-tempered", 
                                                                                  "Feel constantly under strain", "Found everything getting on top of you", 
                                                                                  "Been feeling nervous and strung-up all the time", "found at times you couldn't do anything because your nerves were too bad", 
                                                                                  "found everything getting on top of you", "thought of the possibility that you might make away with yourself", 
                                                                                  "found that the idea of taking your own life kept coming into your mind?", 
                                                                                  "found yourself withing you were dead and away from it all?", 
                                                                                  "felt that life isn't worth living", "felt that life was entirely hopeless?", 
                                                                                  "been able to enjoy your normal day-to-day activities", "been satisfied with the way you've carried out your task", 
                                                                                  "felt that you are playing a useful part in things", "felt on the whole you were doing things well?", 
                                                                                  "been feeling perfectly well and in good health", "been feeling in need of a good tonic", 
                                                                                  "been feeling run down and out of sorts?", "felt that you are ill", 
                                                                                  "been getting any pains in your head", "been getting a feeling of tightness or pressure in your head", 
                                                                                  "been having hot or cold spells", "Do you feel you have physical problems because of stress?", 
                                                                                  "Lost sleep over worry", "Had difficulty in staying asleep once you are off", 
                                                                                  "felt capable of making decisions about things", "been taking longer over the things that you do", 
                                                                                  "been managing to keep yourself busy and occupied", "been thinking of yourself as a worthless person", 
                                                                                  "been getting scared or panicky for no good reason ", "You say nice things about your child", 
                                                                                  "You nag or scold your child when (s)he is bad", "You ignore your child", 
                                                                                  "You wonder if you really love your child", "You talk to your child about daily routines and plans, and listen to what (s)he has to say", 
                                                                                  "You complain about your child to others when (s)he does not listen to you", 
                                                                                  "You take an interest in your child", "You want your child to bring friends home, and you try to make things pleasant for them", 
                                                                                  "You call your child names and make fun of him/her", "You ignore your child as long as (s)he does nothing to bother you", 
                                                                                  "You yell at your child when you are angry", "You sit close with your child so that (s)he feels free to talk about important things", 
                                                                                  "You are harsh with your child", "You enjoy having your child around you", 
                                                                                  "You make your child feel proud when (s)he does well", "Your hit your child even when (s)he may not deserve it, like for small mistakes", 
                                                                                  "You forget things you are supposed to do for your child", "You see your child as an annoyance", 
                                                                                  "You praise your child to others", "You punish your child when you are angry", 
                                                                                  "You make sure your child has the right kind of food to eat", 
                                                                                  "You talk to your child in a warm and loving way", "You get angry easily at your child", 
                                                                                  "You are too busy to answer your child's questions", "You hate/despise your child", 
                                                                                  "You say nice things to your child when (s)he deserves it, such as when (s)he does well in school", 
                                                                                  "You are irritable with your child", "You care about who your child's friends are", 
                                                                                  "You are really interested in what your child does", "You say many unkind things to your child", 
                                                                                  "You pay no attention to your child when (s)he asks for help", 
                                                                                  "You think it is your child's own fault when (s)he is having trouble", 
                                                                                  "You make your child feel wanted and needed", "You tell your child (s)he annoys you", 
                                                                                  "You pay a lot of attention to your child", "You tell your child how proud you are of him/her when (s)he is good", 
                                                                                  "You hurt your child's feelings", "You forget important things your child thinks you should remember", 
                                                                                  "When your child misbehaves, you make him/her feel unloved", 
                                                                                  "You make your child feel what (s)he does is important", "When your child does something wrong, you frighten or threaten him/her", 
                                                                                  "You like to spend time with your child, for example you sit and laugh together", 
                                                                                  "You try to help your child when (s)he is scared or upset", "When your child misbehaves, you shame him/her in front of his/her friends", 
                                                                                  "You avoid your child's company", "You complain about your child", 
                                                                                  "You care about what your child thinks, and encourage him/her to talk about it", 
                                                                                  "You feel other children are better than your own child", "When you make plans, you take your child's thoughts into consideration", 
                                                                                  "You let your child do things (s)he thinks are important, even if it is hard for you", 
                                                                                  "When your child misbehaves, you compare him/her unfavorably with other children", 
                                                                                  "You want to leave your child in someone else's care (for example, a neighbor or relative)", 
                                                                                  "You let your child know (s)he is not wanted", "You are interested in the things your child does", 
                                                                                  "You try to make your child feel better when (s)he is hurt or sick", 
                                                                                  "You tell your child you are ashamed of him/her when (s)he misbehaves", 
                                                                                  "You let your child know you love him/her", "You treat your child gently and with kindness", 
                                                                                  "When your child misbehaves, you make him/her feel ashamed or guilty", 
                                                                                  "You try to make your child happy", "Says nice things about you", 
                                                                                  "Nags or scolds you when you are bad", "Ignores you", "Does not really love you", 
                                                                                  "Talks to you about your plans and listens to what you have to say", 
                                                                                  "Complains about you to others when you do not listen to him", 
                                                                                  "Takes an interest in you", "Wants you to bring your friends home, and tries to make things pleasant for them", 
                                                                                  "Calls you names, ridicules you, and makes fun of you", "Ignores you as long as you do nothing to bother him", 
                                                                                  "Yells at you when he is angry", "Sits close with you so that you feel free to talk about important things", 
                                                                                  "Treats you harshly", "Enjoys having you around him", "Make you feel proud when you do well", 
                                                                                  "Hits you even when you do not deserve it, like for small mistakes", 
                                                                                  "Forgets things he is supposed to do for you", "Sees you as an annoyance", 
                                                                                  "Praises you to others", "Punishes you severely when he is angry", 
                                                                                  "Makes sure you have the right kind of food to eat", "Talks to you in a warm and loving way", 
                                                                                  "Gets angry at you easily", "Is too busy to answer your questions", 
                                                                                  "Seems to hate / despise you", "Says nice things to you when you deserve them, such as when you do well in school", 
                                                                                  "Gets mad quickly and picks on you", "Wants to know who your friends are", 
                                                                                  "Is really interested in what you do", "Says many unkind things to you", 
                                                                                  "Pays no attention when you ask for help", "Thinks it is your own fault when you are having trouble", 
                                                                                  "Makes you feel wanted and needed", "Tells you that you annoy him", 
                                                                                  "Pays a lot of attention to you", "Tells you how proud he is of you when you are good", 
                                                                                  "Goes out of his way to hurt your feelings", "Forgets important things you think he should remember", 
                                                                                  "Makes you feel unloved if you misbehave", "Makes you feel what you do is important", 
                                                                                  "Frightens or threatens you when you do something wrong", "Likes to spend time with you, for example you sit and laugh together", 
                                                                                  "Tries to help you when you are scared or upset", "Shames you in front of your friends when you misbehave", 
                                                                                  "Tries to stay away from you", "Complains about you and talks about you behind your back", 
                                                                                  "Cares about what you think, and likes you to talk about it", 
                                                                                  "Feels other children are better than you are no matter what you do", 
                                                                                  "Cares about what you would like when he makes plans", "Lets you do things you think are important, even if it is hard for him", 
                                                                                  "Thinks other children behave better than you do", "Wants other people to take care of you (for example, a neighbor or relative)", 
                                                                                  "Lets you know you are not wanted", "Is interested in the things you do", 
                                                                                  "Shows concern and tries to make you feel better when you are hurt or sick", 
                                                                                  "Tells you how ashamed he is when you misbehave", "Lets you know he loves you", 
                                                                                  "Treats you gently and with kindness", "Makes you feel ashamed or guilty when you misbehave", 
                                                                                  "Tries to make you happy")), .Names = c("id", "item"), row.names = c(NA, 
                                                                                                                                                       152L), class = "data.frame")

现在制作一个二元组的tdm,然后删除包含停用词的二元组...
library("RWeka")
library("tm")
library("Matrix")    
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer))
# little bit of regex to remove bigrams with stopwords in them, cf. https://dev59.com/0Gw05IYBdhLWcg3w6GDH#6947724
stpwrds <- paste(stopwords("en"), collapse = "|")
x$dimnames$Terms[!grepl(stpwrds, x$dimnames$Terms)]
[1] "cold spells"  "else s"       "feel free"    "feel proud"  [5] "feel unloved" "feels free"   "lost sleep" 

使用tm包自带的停用词库进行去除二元组的快速测试表明,它只剩下了8个二元组!显然我们需要一个更小的停用词列表,因此让我们通过找到这个特定语料库中最常见的单词并将其删除来制作自定义列表。

# find freq words in corpus
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)))
# arbitrary choice of 10 occurances = hi freq
mystopwords <- findFreqTerms(x, 10, Inf)

你应该尝试不同的lowfreq值,我在尝试几个值后将其设置为10,但可能有其他更好的值。
# try to filter the bigrams again with custom stopword list
x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer))
# little bit of regex to remove bigrams with mystopwords in them, cf. https://dev59.com/0Gw05IYBdhLWcg3w6GDH#6947724
mystpwrds <- paste(mystopwords, collapse = "|")
# subset tdm to keep only bigrams remaining after mystopwords removed
x <- x[x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)],]
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

enter image description here

但是这样阅读起来有点困难,所以让我们像这样打印出组成员

hc <- hclust(dist(t(y)))
cutree(hc, k = 100)

 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
  1   2   3   4   5   3   6   5   3   7   8   9  10  11  12  13 
 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32 
 14  15  16  17   3  18  19  20  21  22  23  24  25  26  27  28 
 33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48 
  3  29   3  30  31  32  33  34  35  36   3  37   3   3  38  39 
 49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 
 40  41   3   3  42  43  44  45   3  46   3   3  47  48  49  50 
 65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  3  51  52  53   3   3   3  54  55  56  57  58   3   3   3  59 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96 
 60  61   3  62  63  47  64  65   3   3  66   3   3  67   3  68 
 97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 
 69  70  33  71  35  72  73  74   3  75   3  76  40  41   3  73 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 
 42  43  77  45  78  79  80  81  47  48  82  83   3   3  84  85 
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
 86  87   3  88  89  90  91  92  93   3   3  59   3  94  95  96 
145 146 147 148 149 150 151 152 
  3  97  98  99 100   3  66   3 

我们可以看到行145和149在不同的组中。但是是否这是一个好的答案很难说,因为您没有指定您期望得到的输出。这就是为什么您收到了接近投票,因为从您的问题中推断出一个好的答案很困难(更具体地说,问题在于您要求"推荐或查找工具、库或外部资源",这导致了有意见的回答),SO社群似乎更喜欢具体的期望输出示例。您可以尝试在新的数据科学堆栈交换站点上提问data science stack exchange site
总之,希望现在您有了几个更多的想法和一些在探索数据时调节的空间。如果您遇到与此相关的特定的编程问题,请随时提出另一个问题。

感谢您花费时间撰写这个非常有帮助的答案。我正在仔细阅读它。在运行 x <- TermDocumentMatrix(Corpus(VectorSource(text$item)), control = list(tokenize = BigramTokenizer)) 时出现了错误:Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument In addition: Warning message: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' - Eric Green
可能与并行处理或Java有关。那是一个常见的错误,毫无疑问你已经通过谷歌找到了这些内容:https://dev59.com/SmMm5IYBdhLWcg3wVdt5 和 https://dev59.com/amIk5IYBdhLWcg3wE6d8。 - Ben
1
是的,在调用NGramTokenizer()之前设置options(mc.cores=1)就可以解决问题。 - Eric Green
1
我认为以下错误不应该与我的设置有关: x <- x[x$dimnames $ Terms [! grepl(mystpwrds,x $ dimnames $ Terms)]] 会出现错误Error in x$nrow : $ operator is invalid for atomic vectors。 内部部分, x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)] 是可以运行的。添加逗号可以使它正常运行: x <- x[x$dimnames$Terms[!grepl(mystpwrds, x$dimnames$Terms)],]. - Eric Green
很好,我已经编辑了我的回答。这回答了你的问题吗? - Ben
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接