在C++中重命名带有en dash的文件名

Question

在C++中重命名带有en dash的文件名

3

在我正在处理的项目中，我需要处理文件，并在继续之前检查它们是否存在。但是，如果文件路径中包含了“en dash”（指一种特殊的短线），似乎无法对其进行重命名或者其他操作。

std::string _old = "D:\\Folder\\This – by ABC.txt";
std::rename(_old.c_str(), "New.txt");

在这里，_old变量被解释为D:\Folder\This û by ABC.txt。我尝试过。

setlocale(LC_ALL, "");
//and
setlocale(LC_ALL, "C");
//or    
setlocale(LC_ALL, "en_US.UTF-8");

但是它们都没有起作用...应该怎么办呢？

- Viktor Danov

1

在C++源代码中，使用Unicode或UTF-8十六进制常量来表示连字符。如果编译器对该字符的解释不同，那么你就要依赖于C++代码编辑器来确定“-”实际上是什么，因此所有这些“setlocale”调用都对你没有任何作用。 - PaulMcKenzie

如果您将“-”替换为“\xe2\x80\x93”（utf8），会发生什么？ - Petr Skocik

你尝试过使用 u8"This – by ABC.txt" 吗？ - Jean-François Fabre

我尝试了一下，但是我得到了那些奇怪的符号。图片在此 - Viktor Danov

3个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrey Nasonov · Answer 1

这取决于操作系统。在Linux中，文件名是简单的字节数组：忘记编码，只需重命名文件。

但是看起来你正在使用Windows，文件名实际上是包含16位字符的空终止字符串。在这种情况下，最好的方法是使用wstring而不是搞乱编码。

不要尝试编写平台无关的代码来解决特定于平台的问题。Windows使用Unicode作为文件名，因此您必须编写特定于平台的代码，而不是使用标准函数rename。

只需编写L"D:\\Folder\\This \u2013 by ABC.txt"并调用_wrename即可。

- Cheers and hth. - Alf · Answer 2

Windows ANSI Western编码将Unicode n-dash，U+2013，“–”，作为代码点150（十进制）。

当您将其输出到具有活动代码页437的控制台时，原始IBM PC字符集或兼容字符集会将其解释为“û”。

因此，在您的字符串文字中，您需要正确的代码页1252字符，这要么是因为：

您正在使用Visual C++，它默认使用Windows ANSI代码页来编码窄字符串文字，或者
您正在使用旧版本的g++，它不执行标准规定的转换和检查，而只是直接通过其机器传递窄字符字节，并且您的源代码被编码为Windows ANSI Western（或兼容），或者
还有一些我没有想到的情况。

对于前两种可能性之一，rename调用将起作用。

我测试了它确实可以在Visual C++中工作。我没有旧版的g ++，但我测试了它可以与5.1版本一起使用。也就是说，我测试了该文件真正被重命名为New.txt。

// Source encoding: UTF-8
// Execution character set: Windows ANSI Western a.k.a. codepage 1252.
#include <stdio.h>      // rename
#include <stdlib.h>     // EXIT_SUCCESS, EXIT_FAILURE
#include <string>       // std::string
using namespace std;

auto main()
    -> int
{
    string const a = ".\\This – by ABC.txt";    // Literal encoded as CP 1252.
    return rename( a.c_str(), "New.txt" ) == 0? EXIT_SUCCESS : EXIT_FAILURE;
}

示例：

[C:\my\forums\so\265]
> dir /b *.txt
找不到文件
[C:\my\forums\so\265]
> g++ r.cpp -fexec-charset=cp1252

[C:\my\forums\so\265]
> type nul >"This – by ABC.txt"

[C:\my\forums\so\265]
> run a
退出代码 0
[C:\my\forums\so\265]
> dir /b *.txt
New.txt
[C:\my\forums\so\265]
> _

这里的run只是一个报告退出代码的批处理文件。

如果您的Windows ANSI代码页不是代码页1252，则需要使用您特定的Windows ANSI代码页。

您可以通过GetACP API函数或例如此命令来检查Windows ANSI代码页：

[C:\my\forums\so\265]
> wmic os get codeset /value | find "="
CodeSet=1252
[C:\my\forums\so\265]
> _

如果该代码页支持n-dash字符，则代码将起作用。

这种编码模型基于每个相关主要语言环境（包括字符编码）有一个可执行文件版本。

另一种选择是使用Unicode完成所有操作。这可以通过Boost文件系统进行可移植实现，该文件系统将在C++17标准库中被采用。或者您可以使用Windows API或Windows标准库的事实标准扩展，即_rename。

以下是使用Visual C++ 2015实验性文件系统模块的示例：

// Source encoding: UTF-8
// Execution character set: irrelevant (everything's done in Unicode).
#include <stdlib.h>     // EXIT_SUCCESS, EXIT_FAILURE

#include <filesystem>   // In C++17 and later, or Visual C++ 2015 and later.
using namespace std::tr2::sys;

auto main()
    -> int
{
    path const old_path = L".\\This – by ABC.txt";    // Literal encoded as wide string.
    path const new_path = L"New.txt";
    try
    {
        rename( old_path, new_path );
        return EXIT_SUCCESS;
    }
    catch( ... )
    {}
    return EXIT_FAILURE;
}

为了正确地进行可移植代码，您可以使用Boost，或者创建一个包装器头文件，使用任何可用的实现。

- Swift - Friday Pie · Answer 3

这真的取决于平台，Unicode很让人头疼。这取决于您使用哪个编译器。对于来自MS（VS2010或更早版本）的旧编译器，您需要使用MSDN中描述的API。此测试示例将创建一个名为您遇到问题的文件，然后将其重命名。

// #define _UNICODE // might be defined in project
#include <string>

#include <tchar.h>
#include <windows.h>

using namespace std;

// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
    if( wstr.empty() ) return std::string();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo( size_needed, 0 );
    WideCharToMultiByte                  (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    if( str.empty() ) return std::wstring();
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
    std::wstring wstrTo( size_needed, 0 );
    MultiByteToWideChar                  (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
    return wstrTo;
}

int _tmain(int argc, _TCHAR* argv[] ) {
    std::string pFileName = "C:\\This \xe2\x80\x93 by ABC.txt";
    std::wstring pwsFileName = utf8_decode(pFileName);

    // can use CreateFile id instead
    HANDLE hf = CreateFileW( pwsFileName.c_str() ,
                      GENERIC_READ | GENERIC_WRITE,
                      0,
                      0,
                      CREATE_NEW,
                      FILE_ATTRIBUTE_NORMAL,
                      0);
    CloseHandle(hf);
    MoveFileW(utf8_decode("C:\\This \xe2\x80\x93 by ABC.txt").c_str(), utf8_decode("C:\\This \xe2\x80\x93 by ABC 2.txt").c_str());
}

这些辅助函数仍然存在问题，以便您可以拥有一个空终止的字符串。

std::string utf8_encode(const std::wstring &wstr)
{
    std::string strTo;
    char *szTo = new char[wstr.length() + 1];
    szTo[wstr.size()] = '\0';
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
    strTo = szTo;
    delete[] szTo;
    return strTo;
}


// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    std::wstring wstrTo;
    wchar_t *wszTo = new wchar_t[str.length() + 1];
    wszTo[str.size()] = L'\0';
    MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wszTo, (int)str.length());
    wstrTo = wszTo;
    delete[] wszTo;
    return wstrTo;
}

转换字符大小问题...调用WideCharToMultiByte时将目标缓冲区大小设置为0允许获取所需的转换字符大小。然后它会返回所需的目标缓冲区字节数。所有这些代码操作解释了为什么像Qt这样的框架具有支持基于Unicode的文件系统的复杂代码。实际上，消除您所有可能出现的错误的最佳经济有效方法是使用此类框架。

适用于VS2015

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;

根据他们的文档，我无法检查那个。

针对mingw。

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();

输出包含正确的文件名...但对于文件 API，您仍需要进行适当的转换