将语料库拆分为句子的R代码

15
  1. 我有很多PDF文档,使用库tm读入了一个语料库,如何将该语料库分成句子?

  2. 可以通过使用qdap包中的sentSplit函数从文件中读取文本 (readLines ),生成数据帧,并对其进行拆分以实现该目标[*]。但这需要放弃使用语料库并逐个读取所有文件。

  3. 如何在tm中对语料库使用sentSplit {qdap}函数呢?或者还有更好的方法吗?

注意:openNLP中曾经有sentDetect函数,现在是Maxent_Sent_Token_Annotator - 同样的问题也适用:如何将其与语料库 [tm] 结合起来?

7个回答

16

我不知道如何重塑语料库,但这将是一个非常好的功能。

我想我的方法可能是这样的:

使用以下包

# Load Packages
require(tm)
require(NLP)
require(openNLP)

我会按照以下方式设置我的文本分句功能:
convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

我的重塑语料库函数是一个hack版本(注意:如果您不修改此函数并适当地复制它们,您将失去元属性)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

Which works as follows:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

我的sessionInfo输出

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1  

1
我将您的第一个代码块改编成了一个单独的函数。但是,我遇到了一个错误:无法将类“c(“Simple_Sent_Token_Annotator”,“Annotator”)”强制转换为数据框。请查看我的gist链接:https://gist.github.com/simkimsia/9ace6002cc758d5a303a - Kim Stacks
2
@KimStacks 我找到了问题所在。这是因为ggplot2和openNLP都有它们自己的annotate方法,而我在openNLP之后加载了ggplot2,导致annotate对象被ggplot2掩盖了。尝试在ggplot2之后加载openNLP,就没问题了。 - Logan Yang
你好!在reshape_corpus函数中,“Content”是什么?很棒的解决方案 :) - woodstock
1
@woodstock 谢谢,我忘记了这个函数。 "Content" 是“tm”包中的一个函数,它基本上从语料库中提取文档中的文本。我认为在最新版本的包中,它被称为“content_transformer”,你可以通过执行?tm_map和?content_transformer在tm包中找到它的示例。 - Tony Breyal
@TonyBreyal 非常感谢!我将来可能会需要这个函数! :) - woodstock
显示剩余3条评论

5

openNLP 发生了一些重大变化。不好的消息是它看起来与以前很不一样。好消息是它更加灵活,并且您之前喜欢的功能仍然存在,只是您需要找到它。

这将帮助您得到您想要的结果:

?Maxent_Sent_Token_Annotator

只需按照示例操作,您就会看到您正在寻找的功能。


嗨,泰勒,我已经尝试了一下,但是出现了以下错误:> sent_token_annotator <- Maxent_Sent_Token_Annotator() 错误信息为:找不到函数“Maxent_Sent_Token_Annotator”。我已经加载了openNLP和NLP库。另外,如何在语料库上应用这个方法?对于数据框,我们可以使用超级简单的sentDetect {qdap}。 - Henk
更新了openNLP到“0.2-1”,而NLP则是“0.1-0”。我直接从文档中复制了示例文本,但仍然收到错误信息"> sent_token_annotator <- Maxent_Sent_Token_Annotator() Error: could not find function "Maxent_Sent_Token_Annotator""。 - Henk
你能提供 sessionInfo() 吗? - Tyler Rinker
信息太长了,我可以把这个信息发送到哪里给Tyler? - Henk
1
你可以创建自己的函数,并像之前使用 sentDetect 一样应用它。我在这里使用了 tagPOS(请参见文件中的第二个函数)。我基本上是将示例重新制作成了该函数。 - Tyler Rinker
显示剩余8条评论

1
这是一个基于这个Python解决方案构建的函数,可以灵活地修改前缀、后缀等列表以适应特定文本。它肯定不完美,但对于合适的文本可能会有用。
caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)\\."
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
starters = "(Mr|Mrs|Ms|Dr|He\\s|She\\s|It\\s|They\\s|Their\\s|Our\\s|We\\s|But\\s|However\\s|That\\s|This\\s|Wherever)"
websites = "\\.(com|edu|gov|io|me|net|org)"
digits = "([0-9])"

split_into_sentences <- function(text){
  text = gsub("\n|\r\n"," ", text)
  text = gsub(prefixes, "\\1<prd>", text)
  text = gsub(websites, "<prd>\\1", text)
  text = gsub('www\\.', "www<prd>", text)
  text = gsub("Ph.D.","Ph<prd>D<prd>", text)
  text = gsub(paste0("\\s", caps, "\\. "), " \\1<prd> ", text)
  text = gsub(paste0(acronyms, " ", starters), "\\1<stop> \\2", text)
  text = gsub(paste0(caps, "\\.", caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>\\3<prd>", text)
  text = gsub(paste0(caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>", text)
  text = gsub(paste0(" ", suffixes, "\\. ", starters), " \\1<stop> \\2", text)
  text = gsub(paste0(" ", suffixes, "\\."), " \\1<prd>", text)
  text = gsub(paste0(" ", caps, "\\."), " \\1<prd>",text)
  text = gsub(paste0(digits, "\\.", digits), "\\1<prd>\\2", text)
  text = gsub("...", "<prd><prd><prd>", text, fixed = TRUE)
  text = gsub('\\.”', '”.', text)
  text = gsub('\\."', '\".', text)
  text = gsub('\\!"', '"!', text)
  text = gsub('\\?"', '"?', text)
  text = gsub('\\.', '.<stop>', text)
  text = gsub('\\?', '?<stop>', text)
  text = gsub('\\!', '!<stop>', text)
  text = gsub('<prd>', '.', text)
  sentence = strsplit(text, "<stop>\\s*")
  return(sentence)
}

test_text <- 'Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years. He earned $2.5 million when it sold! Now he works at www.website.com.'
sentences <- split_into_sentences(test_text)
names(sentences) <- 'sentence'
df_sentences <- dplyr::bind_rows(sentences) 

df_sentences
# A tibble: 3 x 1
sentence                                                     
<chr>                                                        
1 Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years.
2 He earned $2.5 million when it sold!                         
3 Now he works at www.website.com.  

1
只需将语料库转换为数据框,然后使用正则表达式来检测句子即可。下面是一个函数,它使用正则表达式来检测段落中的句子,并返回每个单独的句子。
chunk_into_sentences <- function(text) {
      break_points <- c(1, as.numeric(gregexpr('[[:alnum:] ][.!?]', text)[[1]]) + 1)
      sentences <- NULL
      for(i in 1:length(break_points)) {
        res <- substr(text, break_points[i], break_points[i+1]) 
        if(i>1) { sentences[i] <- sub('. ', '', res) } else { sentences[i] <- res }
      }
      sentences <- sentences[sentences=!is.na(sentences)]
      return(sentences)
    }

...使用tm包中的一个语料库内的一个段落。

text <- paste('Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.')
mycorpus <- VCorpus(VectorSource(text))
corpus_frame <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=F)

Use as follows:

chunk_into_sentences(corpus_frame)

Which gives us:

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                       
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

现在有更大的语料库。
text1 <- "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
text2 <- "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like)."
text3 <- "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."
text_list <- list(text1, text2, text3)
my_big_corpus <- VCorpus(VectorSource(text_list))

如下所示:

使用方法:

lapply(my_big_corpus, chunk_into_sentences)

"这给了我们:"
$`1`
[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                      
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

$`2`
[1] "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout."                                                             
[2] "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English."     
[3] "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."

$`3`
[1] "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable."
[2] "If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text."                                                                     
[3] "All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet."                                                       
[4] "It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable."                                                       
[5] "The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc." 

0

使用 qdap 版本 1.1.0,您可以通过以下方式实现此目标(我使用了 @Tony Breyal 的 current.corpus 数据集):

library(qdap)
with(sentSplit(tm_corpus2df(current.corpus), "text"), df2tm_corpus(tot, text))

你也可以这样做:

tm_map(current.corpus, sent_detect)


## inspect(tm_map(current.corpus, sent_detect))

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $doc1
## [1] Doctor Who is a British science fiction television programme produced by the BBC.                                                                     
## [2] The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor.                                            
## [3] He explores the universe in his TARDIS, a sentient time-travelling space ship.                                                                        
## [4] Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired.                                    
## [5] Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.
## 
## $doc2
## [1] The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.
## [2] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor.                                                                                                                                                                                                       
## [3] In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody for evolving with technology and the times like nothing else in the known television universe.                                                                                                                                   
## 
## $doc3
## [1] The programme is listed in Guinness World Records as the longest-running science fiction television show in the world and as the most successful science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.
## [2] During its original run, it was recognised for its imaginative stor

1
遗憾的是,sent_detect 方法会误认数字之间的句点为句子分隔符,而 openNLP 的 Maxent_Sent_Token_Annotator 在运行句子识别器之前会将它们识别为逗号,从而导致更加稳健的句子识别。 - christopherlovell
1
qdap的开发版本(v. 2.2.1)@GitHub包含sent_detect_nlp,以允许灵活性,因为它使用了 NLP 包中的方法。 这允许tm_map(current.corpus, sent_detect_nlp)。请参见提交:https://github.com/trinker/qdap/commit/d54fbd95f0caee68a3845353a06fa324d7fdf42e。 - Tyler Rinker

0
我使用tokenizers包实现了以下代码来解决相同的问题。
# Iterate a list or vector of strings and split into sentences where there are
# periods or question marks
sentences = purrr::map(.x = textList, function(x) {
  return(tokenizers::tokenize_sentences(x))
})

# The code above will return a list of character vectors so unlist
# to give you a character vector of all the sentences
sentences = unlist(sentences)

# Create a corpus from the sentences
corpus = VCorpus(VectorSource(sentences))

-2
错误意味着与ggplot2包相关,注释函数会出现此错误,请分离ggplot2包,然后再次尝试。希望这样能行。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接