有没有代码片段可以将欧洲语言中最常见的字符转换成其他形式?例如:
testáén
作为UTF-8编码的字符串(即十六进制字节:74 65 73 74 c3 a1 c3 a9 6e 0)
转换为
testaen
(我想使用C/C++和标准库,或者小型跨平台库)
这里是处理将字符从ISO-8859-1范围转换为ASCII的代码。对于ASCII范围之外的所有内容,使用替换字符。
#include <codecvt>
#include <array>
#include <string>
#include <iostream>
constexpr char const *rc = "?"; // replacement_char
// table mapping ISO-8859-1 characters to similar ASCII characters
std::array<char const *,96> conversions = {{
" ", "!","c","L", rc,"Y", "|","S", rc,"C","a","<<", rc, "-", "R", "-",
rc,"+/-","2","3","'","u", "P",".",",","1","o",">>","1/4","1/2","3/4", "?",
"A", "A","A","A","A","A","AE","C","E","E","E", "E", "I", "I", "I", "I",
"D", "N","O","O","O","O", "O","*","0","U","U", "U", "U", "Y", "P","ss",
"a", "a","a","a","a","a","ae","c","e","e","e", "e", "i", "i", "i", "i",
"d", "n","o","o","o","o", "o","/","0","u","u", "u", "u", "y", "p", "y"
}};
template <class Facet>
class usable_facet : public Facet {
public:
using Facet::Facet;
~usable_facet() {}
};
std::string to_ascii(std::string const &utf8) {
std::wstring_convert<usable_facet<std::codecvt<char32_t,char,std::mbstate_t>>,
char32_t> convert;
std::u32string utf32 = convert.from_bytes(utf8);
std::string ascii;
for (char32_t c : utf32) {
if (c<=U'\u007F')
ascii.push_back(static_cast<char>(c));
else if (U'\u00A0'<=c && c<=U'\u00FF')
ascii.append(conversions[c - U'\u00A0']);
else
ascii.append(rc);
}
return ascii;
}
int main() {
std::cout << to_ascii(u8"testáén\n");
}
有一个巨大的Unicode字符集需要处理,所以“小”的标准是不可能的标准。ICU库包含您需要的内容,但出于这个原因,您不会发现它很小。例如,您需要处理组合和非组合修饰符。
如果您只关心可能的Unicode字符的一小部分,则可以创建自己的简单映射表。
Ã
映射到a
,但为什么©
映射到e
?第二个Ã
怎么了?你有尝试编写这样的代码吗? - Cody Gray