将字节编码转换为Unicode

Question

将字节编码转换为Unicode

3

如果标题语言不合适，请随意编辑。

我想将一个带有 "byte" 替换为 Unicode 字符的字符串转换回 Unicode。假设我有：

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

I'd like to get back:

"bißchen Zürcher hello world Æ"

我知道如果能够将它转换为这种形式，它就会按预期输出到控制台：

"bi\xdfchen Z\xfcrcher \xc6"

我尝试了：

gsub("<([[a-z0-9]+)>", "\\x\\1", x)
## [1] "bixdfchen Zxfcrcher xc6"

- Tyler Rinker

2个回答

1

你也可以使用 gsubfn 库。

library(gsubfn)
f <- function(x) rawToChar(as.raw(as.integer(paste0("0x", x))), multiple=T)
gsubfn("<([0-9a-f]{2})>", f, "bi<df>chen Z<fc>rcher hello world <c6>")
## [1] "bißchen Zürcher hello world Æ"

- hwnd

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MrFlick · Accepted Answer

这样怎么样：

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})

regmatches(x, m) <- chars

x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"

Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"

请注意，您不能通过在数字前面粘贴“\x”来创建转义字符。实际上，字符串中根本没有“\x”。这只是R选择在屏幕上表示它的方式。在这里，我们使用rawToChar()将数字转换为所需的字符。

我在Mac上进行了测试，因此必须将编码设置为“latin1”才能在控制台中看到正确的符号。仅使用单个字节并不是正确的UTF-8。