R tm包用于预测分析。如何对新文档进行分类?

16
这是有关文本挖掘程序的一般性问题。假设有一个被分类为垃圾邮件/非垃圾邮件的文档语料库。按照标准程序,需要对数据进行预处理,去除标点符号、停用词等。将其转换为DocumentTermMatrix后,可以建立一些模型来预测垃圾邮件/非垃圾邮件。
现在我的问题是:我想要使用已经建立好的模型来处理新到达的文档。为了检查单个文档,我需要构建一个DocumentTermVector,以便可以用来预测垃圾邮件/非垃圾邮件。在tm的文档中,我发现可以使用tfidf权重将整个语料库转换为矩阵。那么,如何使用语料库中的idf来转换单个向量呢?我是否需要每次更改我的语料库并建立一个新的DocumentTermMatrix?
我处理了我的语料库,并将其转换为矩阵,然后将其分成了训练集和测试集。但是,在这里,测试集与完整集的文档矩阵在同一行中构建。我可以检查精度等指标,但不知道新文本分类的最佳程序。
Ben,假设我有一个预处理过的DocumentTextMatrix,我将其转换为data.frame。
dtm <- DocumentTermMatrix(CorpusProc,control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE),stopwords = TRUE, wordLengths=c(3, Inf), bounds = list(global = c(4,Inf))))

dtmDataFrame <- as.data.frame(inspect(dtm))

添加了一个因子变量并建立了一个模型。

Corpus.svm<-svm(Risk_Category~.,data=dtmDataFrame)

现在想象一下,我给你一个新的文档d(之前不在你的语料库中),你想知道模型预测是垃圾邮件/非垃圾邮件。你该怎么做?

好的,让我们根据这里使用的代码创建一个示例。

examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem" 
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following:  Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."



corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4)))

请注意,我删除了示例5。
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)

corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
dtmDataFrame <- as.data.frame(inspect(corpus2a.dtm)) 

新增一个因子变量 Spam_Classification,包含两个级别:垃圾邮件(spam)和非垃圾邮件(No_Spam)。

dtmFinal<-cbind(dtmDataFrame,Spam_Classification)

我建立了一个SVM模型 Corpus.svm<-svm(Spam_Category~.,data=dtmFinal)

现在想象一下,我有一个新的文档(电子邮件)作为例子5。如何生成垃圾邮件/非垃圾邮件的值?


请更新您的问题,包括您当前使用的代码、一些示例数据以便我们能够重现您的方法,以及您所期望的输出示例。有了这些额外的信息,您更有可能得到更有用的答案。 - Ben
1
Ben,这是一个非常普遍的问题,我认为我们不需要代码。无论如何,假设我有一个预处理的DocumentTextMatrix,我将其转换为data.frame。dtm <- DocumentTermMatrix(CorpusProc,control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE),stopwords = TRUE, wordLengths=c(3, Inf), bounds = list(global = c(4,Inf))))。 - Dr VComas
3个回答

3
感谢这个有趣的问题。我已经思考了一段时间。要简化事情,我的发现的精髓是:对于除了tf以外的加权方法,没有绕过繁琐的工作或重新计算整个DTM(可能需要重新运行svm)的方法。
只有针对tf加权,我才能找到一个简单的分类新内容的方法。您必须将新文档(当然)转换为DTM。在转换过程中,您必须添加一个包含旧语料库上训练分类器使用的所有术语的“字典”。然后,您可以像往常一样使用“predict()”。这里是一个非常简洁的示例和分类新文档的方法:
### I) Data

texts <- c("foo bar spam",
           "bar baz ham",
           "baz qux spam",
           "qux quux ham")

categories <- c("Spam", "Ham", "Spam", "Ham")

new <- "quux quuux ham"

### II) Building Model on Existing Documents „texts“

library(tm)  # text mining package for R
library(e1071)  # package with various machine-learning libraries

## creating DTM for texts
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)))

## making DTM a data.frame and adding variable categories
df <- data.frame(categories, as.data.frame(inspect(dtm)))

model <- svm(categories~., data=df)

### III) Predicting class of new

## creating dtm for new
dtm_n <- DocumentTermMatrix(Corpus(VectorSource(new)),
                            ## without this line predict won't work
                            control=list(dictionary=names(df)))
## creating data.frame for new
df_n <- as.data.frame(inspect(dtm_n))

predict(model, df_n)

## > 1 
## > Ham 
## > Levels: Ham Spam

2

我有同样的问题,我认为RTextTools包可以帮助你。

看一下create_matrix:

...
originalMatrix - 用于训练模型的原始DocumentTermMatrix。如果提供了,则会调整新矩阵以与保存的模型配合使用。
...

所以在代码中:

train.data <- loadDataTable() # load data from DB - 3 columns (info, subject, category)
train.matrix <- create_matrix(train.data[, c(subject, info)]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
train.container <- create_container(train.matrix,train.data$category,trainSize=1:nrow(train.data), virgin=FALSE)
model <- train_model(train.container, algorithm=c("SVM"))
# save model & matrix

predict.text <- function(info, subject, train.matrix, model)
{
     predict.matrix <- create_matrix(cbind(subject = subject, info = info), originalMatrix = train.matrix, language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
     predict.container <- create_container(predict.matrix, NULL, testSize = 1, virgin = FALSE) # testSize = 1 - we have only one row!
     return(classify_model(predict.container, model))
}

0

不清楚你的问题是什么或者你在寻找什么样的答案。

假设你的问题是“如何获取一个'DocumentTermVector'以便传递给其他函数?”,这里有一种方法。

以下是一些可重现的数据:

examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem" 
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following:  Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."

从这些文本中创建一个语料库:

corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))

处理文本:

skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)

将处理过的语料库转换为词项文档矩阵:

corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
inspect(corpus2a.dtm)

A document-term matrix (5 documents, 273 terms)

Non-/sparse entries: 314/1051
Sparsity           : 77%
Maximal term length: 10 
Weighting          : term frequency (tf)

    Terms
Docs able actually addition allows answer answering answers archives are arsenal avoid background based
   1    0        0        2      0      0         0       0        0   1       0     1          0     0
   2    1        1        0      0      0         0       0        0   0       0     0          0     0
   3    0        1        0      1      0         0       0        0   0       0     0          0     0
   4    0        0        0      0      0         0       0        0   0       0     0          1     0
   5    2        1        0      0      8         2       3        1   0       1     0          0     1

这是获取你所提到的“DocumentTerm*Vector*”的关键代码行:

# access vector of first document in the dtm
as.matrix(corpus2a.dtm)[1,]

able   actually   addition     allows     answer  answering    answers   archives        are 
         0          0          2          0          0          0          0          0          1 
   arsenal      avoid background      based      basic     before     better     beware        bit 
         0          1          0          0          0          0          0          0          0 
     board       book     bother        bug    changed       chat      check       

实际上,它是一个命名数字,应该对传递给其他函数等非常有用,这似乎是您想要做的事情:
str(as.matrix(corpus2a.dtm)[1,])
 Named num [1:273] 0 0 2 0 0 0 0 0 1 0 ...

如果您只需要一个数字向量,请尝试使用as.numeric(as.matrix(corpus2a.dtm)[1,])) 这是您想要做的吗?

1
不完全是。很抱歉可能不太清楚。我已经完成了所有这些步骤。假设你使用创建的矩阵训练一个模型(例如svm),使用分类变量spam/No_spam。然后当新邮件到达时,您想要使用您的模型。问题是新邮件不属于您的语料库。当您需要预测垃圾邮件/非垃圾邮件时,您需要将其转换为原始矩阵并将其发送到模型中。这就是我遇到问题的地方。需要分类的新文档。 - Dr VComas
如果我理解你的意思正确,你需要处理新的电子邮件(就像上面的tm_map行),然后将其附加到DocumentTermMatrix中,然后将DTM转换为矩阵,然后在其上运行模型。您可以使用 c 将新文档添加到现有的DTM中,或者您可以使用 Content(myCorpus[[10]])<- "嘿,我是这个文件的新内容" 更新DTM中的现有文件。这能帮到你吗? - Ben
2
我宁愿不更改语料库,每次新邮件到来时都会改变基于tfidf数字的矩阵。当然,我每次都必须建立一个新的SVM。这就是问题所在。我想输入新邮件,进行预处理,并构建一个具有与矩阵相同列的1行向量,从新文档中获取tf和从语料库中获取idf。并用它来预测垃圾邮件/非垃圾邮件。我不知道是否有标准程序或函数可实现此操作,还是必须编写代码。 - Dr VComas
很抱歉,我不熟悉这样的标准流程。但是,如果您提出一个新问题,并包括一些示例数据、代码和期望的结果(这对于吸引这里更有经验的参与者的注意至关重要),您可能会得到更多了解的人的答案。 - Ben

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接