如何使用R将rtf字符串转换为纯文本?

3

我有很多rtf字符串(Base64编码),我希望使用R获取纯文本。这可能吗?下面有一个例子。

虽然有很多其他语言可以实现此功能,但如果我能找到“R方式”来完成这项工作,它将非常有用。

rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="

plainText <- function(rtfString)

# The result will be something similar to this:

plainText

[1] "Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um show de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo.\nEstão de parabéns o Paysandu, o Governo do Estado, que deu um show de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor.\nPaysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará.\nMuito obrigado, Sr. Presidente."

你可以使用C++编写代码,然后使用Rcpp包从R中部署它。 - lnNoam
你看过tm包及其PlainTextDocument函数吗?还有qdap包中的plain text函数。 - lawyeR
@lawyeR,这些函数不能完成这项工作。 - Davi Moreira
@lnNoam,这是一种可能的方法,但我真的很想只使用R。 - Davi Moreira
1个回答

4
一些软件包和正则表达式的结合可以完成此操作:
library(RCurl)
library(stringr)
library(magrittr)

decode_rtf <- function(txt) {

  txt %>%
    base64Decode %>%
    str_replace_all("\\\\'e3", "ã") %>%
    str_replace_all("\\\\'e1", "á") %>%
    str_replace_all("\\\\'e9", "é") %>%
    str_replace_all("\\\\'e7", "ç") %>%
    str_replace_all("\\\\'ed", "í") %>%
    str_replace_all("\\\\'f3", "ó") %>%
    str_replace_all("\\\\'ea", "ê") %>%
    str_replace_all("\\\\'e0", "à") %>%
    str_replace_all("(\\\\[[:alnum:]']+|[\\r\\n]|^\\{|\\}$)", "") %>%
    str_replace_all("\\{\\{[[:alnum:]; ]+\\}\\}", "") %>%
    str_trim

}

rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="

decode_rtf(rtfString)

## [1] "O SR. DUDIMAR PAXIUBA  (PSDB-PA. Sem revisão do orador.) - Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um  show  de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo.  Estão de parabéns o Paysandu, o Governo do Estado, que deu um  show  de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor.  Paysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará.  Muito obrigado, Sr. Presidente."

我相信在某些边缘情况下,这可能会出现问题,但这绝对是一个好的开始。


@hrbmstr,太棒了!我编辑了你的答案以得到正确的结果。当然,这可以用更优雅的方式完成。 - Davi Moreira
谢谢。librtf 在 *nix、Windows 和 OS X 上似乎不太难使用,所以下周我会尝试发布一个Rcpp版本。 - hrbrmstr

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接