在二进制模式下将UTF16写入文件

Question

在二进制模式下将UTF16写入文件

12

我正试图在二进制模式下使用ofstream将wstring写入文件，但我觉得我做错了什么。这是我尝试过的：

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

以UTF16编码在例如Firefox中打开test.txt，它会显示为：

h�e�l�l�o�

有人可以告诉我这是为什么吗？

编辑：

在十六进制编辑器中打开文件，我得到：

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

看起来每个字符之间我多了两个额外的字节，不知道为什么？

- Cactuar

将一个 facet 添加到与流相关联的本地化对象中，以执行从 wchar_t 到正确输出的转换。请参见下面。 - Martin York

6个回答

6

我猜测在你的环境中，wchar_t的大小为4 - 也就是说它输出的是UTF-32/UCS-4而不是UTF-16。这正是十六进制转储看起来的样子。

测试很简单（只需打印sizeof(wchar_t)），但我非常确定这就是问题所在。

要从UTF-32 wstring转换为UTF-16，您需要应用适当的编码，因为代理对会发挥作用。

- Jon Skeet

是的，你说得对，wchar_t 的大小为4，我在使用Mac电脑。这就解释了很多问题 :) 我知道UTF-16中的代理对，需要再深入了解一下。 - Cactuar

从输出中无法确定它是UTF-16还是UTF-32，它只显示wchar_t为4个字节宽。字符串的编码不是由语言定义的（尽管最可能是UCS-4）。 - Martin York

6

如果使用C++11标准，这将非常容易（因为有很多额外的包含文件，如“utf8”，可以永久解决这些问题）。

但是，如果您想使用旧标准编写多平台代码，可以使用以下流写作方法：

Read the article about UTF converter for streams
Add stxutif.h to your project from sources above

Open the file in ANSI mode and add the BOM to the start of a file, like this:

std::ofstream fs;
fs.open(filepath, std::ios::out|std::ios::binary);

unsigned char smarker[3];
smarker[0] = 0xEF;
smarker[1] = 0xBB;
smarker[2] = 0xBF;

fs << smarker;
fs.close();

Then open the file as UTF and write your content there:

std::wofstream fs;
fs.open(filepath, std::ios::out|std::ios::app);

std::locale utf8_locale(std::locale(), new utf8cvt<false>);
fs.imbue(utf8_locale); 

fs << .. // Write anything you want...

- Yarkov Anton

2

在Windows上使用wofstream和上面定义的utf16 facet会失败，因为wofstream将所有值为0A的字节转换为2个字节的0D 0A，无论您如何传递0A字节，'\x0A'、L'\x0A'、L'\x000A'、'\n'、L'\n'和std::endl都会得到相同的结果。在Windows上，您必须以二进制模式打开文件并使用ofstream（而不是wofsteam）编写输出，就像在原始帖子中所做的那样。

- gilbert

1

提供的 Utf16Facet 在处理大字符串时在 gcc 中无法正常工作，这是我使用的版本... 这样文件将以 UTF-16LE 保存。对于 UTF-16BE，只需反转 do_in 和 do_out 中的赋值，例如 to[0] = from[1] 和 to[1] = from[0]。

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

- NuNuNO Griesbach

0

您应该使用十六进制编辑器（例如WinHex）查看输出文件，以便查看实际的位和字节，以验证输出是否实际上是UTF-16。请在此处发布并告诉我们结果。这将告诉我们是要责怪Firefox还是您的C++程序。

但我认为您的C++程序是有效的，而Firefox没有正确解释您的UTF-16。UTF-16要求每个字符使用两个字节。但Firefox打印的字符数是应有数量的两倍，因此它可能试图将您的字符串解释为UTF-8或ASCII，这些通常只有每个字符1个字节。

当您说“使用UTF16编码的Firefox”时，您是什么意思？我对此持怀疑态度。

- David Grayson

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin York · Accepted Answer

在这里，我们遇到了很少使用的本地化属性。如果您将字符串输出为字符串（而不是原始数据），则可以让本地化自动执行适当的转换。

N.B.此代码未考虑wchar_t字符的字节序。

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}

文件: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};