在 Rust 中，是否可能将字节解码为 UTF-8，并将错误转换为转义序列？

Question

在 Rust 中，是否可能将字节解码为 UTF-8，并将错误转换为转义序列？

20

在 Rust 中，可以通过以下方式从字节中获取 UTF-8：

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

这要么有效，要么无效，但 Python 有处理错误的能力，例如：

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

在这个例子中，参数surrogateescape将无效的utf-8序列转换为转义码，因此不会忽略或替换无法解码的文本，而是用一个字节字面表达式替换它们，这是有效的utf-8。详情请参见：Python文档。

Rust是否有一种从字节中获取UTF-8字符串的方法，它可以转义错误而不是完全失败？

- ideasman42

2个回答

2

您可以选择以下两种方式之一：

通过使用严格的UTF-8解码构造字符串，该方法会返回一个错误，指示解码失败的位置，您可以根据此进行转义。但这种方法效率低下，因为每次解码失败都需要解码两次。
尝试使用第三方软件包，这些软件包提供了更多可定制的字符集解码器。

- the8472

“对于每个失败的尝试进行两次解码”——您能详细说明一下吗？我没有看到双重解码尝试。 - Shepmaster

回复：“但这很低效，因为您将两次解码每个失败的尝试。”似乎应该有一种更好的方法可以通过一个小函数完成，类似于这个答案，但支持有效的utf8：https://dev59.com/SFgR5IYBdhLWcg3wV8He#41450295 - ideasman42

@Shepmaster，在存在错误的情况下，您认为单次传递可能吗？ - the8472

@ideasman42，我建议的第二个选项是更好的选择。 - the8472

从头开始，你解析数据直到遇到错误，跳过该错误/添加所需的标记，然后继续在错误之后解析。你只读取每个字节一次，在所有数据上进行单次遍历。这就是为什么我问自己缺少了什么。 - Shepmaster

要找到错误位置，您需要调用 from_utf8，然后需要再次调用前缀以获得有效的部分结果。因此，在输入上进行了两次操作。在 stdlib 中，没有任何以增量方式处理 u8（基本上是字符集解码器）的内容，这似乎是 OP 所寻找的。 - the8472

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shepmaster · Accepted Answer

可以通过String::from_utf8_lossy方法实现。

fn main() {
    let text = [104, 101, 0xFF, 108, 111];
    let s = String::from_utf8_lossy(&text);
    println!("{}", s); // he�lo
}

如果您需要更多的控制权，可以使用std::str::from_utf8，正如其他答案所建议的那样。然而，没有必要像它建议的那样对字节进行双重验证。

一个快速的示例：

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

你可以扩展这个函数，让它接受一些值来替换坏字节，或者使用一个闭包来处理这些坏字节等。例如：

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}

let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

另请参阅: