如何在R中删除字符串末尾的省略号

Question

如何在R中删除字符串末尾的省略号

3

我有一个单词列表，这些单词是从下面的代码中获取的。

tags_vector <- unlist(tags_used)

这个列表中的一些字符串末尾有省略号，我想要删除它们。这里我打印了该列表的第五个元素及其类别。

tags_vector[5]
#[1] "#b…"

class(tags_vector[5])
#[1] "character"

我正在尝试使用gsub从第五个元素中删除省略号，使用以下代码：

gsub("[…]", "", tags_vector[5])
#[1] "#b…"

这段代码没有起作用，输出结果为“#b…”。但是在同一段代码中，当我直接放入第5个元素的值时，它就能正常工作，如下所示：

gsub("[…]", "", "#b…")
#[1] "#b"

我甚至尝试将tags_vector[5]的值赋给变量x1，并尝试在gsub()代码中使用它，但仍然无法正常工作。

- Gautam Kumar

2

你能提供tags_vector吗？我用一个简单的x <- "#b..."可以运行，所以我猜测问题出在你的向量上。 - LAP

请查看 https://ideone.com/61hWht，它似乎可以工作。顺便说一下，由于省略号不是ASCII码，您可以尝试使用stringr的`str_replace_all`函数：`library(stringr)` -> str_replace_all(tags_vector[5], "…", "") - Wiktor Stribiżew

这似乎确实是一个Unicode问题。可能是tags_vector [5]的打印已经改变了字符（例如省略号有两个不同的Unicode：[2026]（http://www.fileformat.info/info/unicode/char/2026/index.htm）和[22EF]（http://www.fileformat.info/info/unicode/char/22EF/index.htm））。这也可以解释为什么直接gsub有效。你能试试 gsub(gsub("[#b]","",tags_vector[5]), "", tags_vector[5]) 吗？ - takje

由于可能是编码问题，您能否同时展示sessionInfo()的结果？ - takje

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- takje · Accepted Answer

可能是Unicode问题。在R（studio）中，不是所有字符都被创建得一样。

我试图创建一个可重现的例子：

# create the ellipsis from the definition (similar to your tags_used)
> ell_def <- rawToChar(as.raw(c('0xE2','0x80','0xA6'))) # from the unicode definition here: http://www.fileformat.info/info/unicode/char/2026/index.htm
> Encoding(ell_def) <- 'UTF-8'
> ell_def
[1] "…"
> Encoding(ell_def)
[1] "UTF-8"

# create the ellipsis from text (similar to your string)
> ell_text <- '…'
> ell_text
[1] "…"
> Encoding(ell_text)
[1] "latin1"

# show that you can get strange results
> gsub(ell_text,'',ell_def)
[1] "…"

这个例子的可重复性可能取决于您的区域设置。在我的情况下，我使用 windows-1252，因为在 Windows 中无法将区域设置设置为 UTF-8。根据this stringi source，“R 可以让 ASCII、UTF-8 和您平台的本地编码的字符串和平共处”。就像上面的例子所示，有时这可能会导致矛盾的结果。

基本上，您看到的输出看起来相同，但在字节级别上却不同。

如果我在 R 终端中运行此示例，则会获得类似的结果，但显然它将省略号显示为点号：“。”。

您的示例的快速修复方法是在 gsub 中使用省略号的定义。例如：

gsub(ell_def,'',tags_vector[5])