在二进制模式下将UTF16写入文件

12

我正试图在二进制模式下使用ofstream将wstring写入文件,但我觉得我做错了什么。这是我尝试过的:

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

以UTF16编码在例如Firefox中打开test.txt,它会显示为:

h�e�l�l�o�

有人可以告诉我这是为什么吗?

编辑:

在十六进制编辑器中打开文件,我得到:

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00 

看起来每个字符之间我多了两个额外的字节,不知道为什么?


将一个 facet 添加到与流相关联的本地化对象中,以执行从 wchar_t 到正确输出的转换。请参见下面。 - Martin York
6个回答

14

在这里,我们遇到了很少使用的本地化属性。如果您将字符串输出为字符串(而不是原始数据),则可以让本地化自动执行适当的转换。

N.B.此代码未考虑wchar_t字符的字节序。

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}    

文件: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

1
请注意,您的 Facet 实现了 UCS-2 到/from 的转换,而不是 UTF-16。UTF-16 是一种可变长度编码,它使用所谓的代理对。UCS-2 是 Unicode 的一个子集,这也是发明 UTF-16 的原因。 - Jakob Riedle

6
我猜测在你的环境中,wchar_t的大小为4 - 也就是说它输出的是UTF-32/UCS-4而不是UTF-16。这正是十六进制转储看起来的样子。
测试很简单(只需打印sizeof(wchar_t)),但我非常确定这就是问题所在。
要从UTF-32 wstring转换为UTF-16,您需要应用适当的编码,因为代理对会发挥作用。

是的,你说得对,wchar_t 的大小为4,我在使用Mac电脑。这就解释了很多问题 :) 我知道UTF-16中的代理对,需要再深入了解一下。 - Cactuar
从输出中无法确定它是UTF-16还是UTF-32,它只显示wchar_t为4个字节宽。字符串的编码不是由语言定义的(尽管最可能是UCS-4)。 - Martin York

6
如果使用C++11标准,这将非常容易(因为有很多额外的包含文件,如“utf8”,可以永久解决这些问题)。
但是,如果您想使用旧标准编写多平台代码,可以使用以下流写作方法:
  1. Read the article about UTF converter for streams
  2. Add stxutif.h to your project from sources above
  3. Open the file in ANSI mode and add the BOM to the start of a file, like this:

    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;
    
    fs << smarker;
    fs.close();
    
  4. Then open the file as UTF and write your content there:

    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    
    std::locale utf8_locale(std::locale(), new utf8cvt<false>);
    fs.imbue(utf8_locale); 
    
    fs << .. // Write anything you want...
    

2
在Windows上使用wofstream和上面定义的utf16 facet会失败,因为wofstream将所有值为0A的字节转换为2个字节的0D 0A,无论您如何传递0A字节,'\x0A'、L'\x0A'、L'\x000A'、'\n'、L'\n'和std::endl都会得到相同的结果。在Windows上,您必须以二进制模式打开文件并使用ofstream(而不是wofsteam)编写输出,就像在原始帖子中所做的那样。

1
提供的 Utf16Facet 在处理大字符串时在 gcc 中无法正常工作,这是我使用的版本... 这样文件将以 UTF-16LE 保存。对于 UTF-16BE,只需反转 do_indo_out 中的赋值,例如 to[0] = from[1]to[1] = from[0]
#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

0

您应该使用十六进制编辑器(例如WinHex)查看输出文件,以便查看实际的位和字节,以验证输出是否实际上是UTF-16。请在此处发布并告诉我们结果。这将告诉我们是要责怪Firefox还是您的C++程序。

但我认为您的C++程序是有效的,而Firefox没有正确解释您的UTF-16。UTF-16要求每个字符使用两个字节。但Firefox打印的字符数是应有数量的两倍,因此它可能试图将您的字符串解释为UTF-8或ASCII,这些通常只有每个字符1个字节。

当您说“使用UTF16编码的Firefox”时,您是什么意思?我对此持怀疑态度。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接