当C字符串（字符数组）中包含非ASCII字符时，如何将其转换为Python字符串？

Question

当C字符串（字符数组）中包含非ASCII字符时，如何将其转换为Python字符串？

7

我已在C程序中嵌入了Python解释器。假设C程序将某些字节从文件读入到char数组中，并（以某种方式）知道这些字节代表某种编码的文本（例如ISO 8859-1、Windows-1252或UTF-8）。那么我应该如何将此char数组的内容解码为Python字符串？

Python字符串通常应为unicode类型——例如，Windows-1252编码输入中的0x93将变成u'\u0201c'。

我尝试使用PyString_Decode，但当字符串中有非ASCII字符时，它总是失败。以下是一个失败的示例：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

错误信息为UnicodeEncodeError: 'ascii'编解码器无法对位置0处的字符u'\u201c'进行编码：超出范围（128），这表明在调用PyString_Decode时使用了ascii编码，尽管我们指定了windows_1252。

以下代码通过使用PyString_FromString创建未解码字节的Python字符串，然后调用其decode方法来解决问题：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

- Vebjorn Ljosa

挑刺的说，C语言中的字符串是char[]类型，而不是char*类型。 - James Curran

1

挑剔一点，引用值时并不重要。无论如何，数组都作为指针传递给函数。 - gnud

3个回答

3

您不想将字符串解码为Unicode表示形式，只是想将其作为字节数组处理，对吗？

只需使用PyString_FromString：

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

现在，您拥有一个Python的str()对象。请参阅此处的文档：https://docs.python.org/2/c-api/string.html

我有些困惑如何指定“str”或“unicode”。如果您有非ASCII字符，则它们是完全不同的。如果您想解码C字符串并且您确切地知道它所在的字符集，则PyString_DecodeString是一个很好的起点。

- Dan

我想实际解码它，这样不管最终使用字符串的Python代码如何，都不需要知道它最初是如何编码的（在C程序的输入中）。谢谢你指出我不清楚的地方；我已经编辑了我的问题。 - Vebjorn Ljosa

2

尝试在“if (!py_string)”子句中调用PyErr_Print()。也许Python异常会给您更多信息。

- Alex Coventry

谢谢，我已经将信息整合到问题中了。 - Vebjorn Ljosa

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tony Meyer · Accepted Answer

PyString_Decode函数的作用如下：

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

基本上，它所做的就是你在第二个示例中所做的-将其转换为字符串，然后解码该字符串。问题出在PyString_AsDecodedString上，而不是PyString_AsDecodedObject上。PyString_AsDecodedString执行PyString_AsDecodedObject，但然后尝试使用默认编码（对于您来说，似乎是ASCII）将生成的unicode对象转换为字符串对象。这就是它失败的地方。

我认为你需要进行两次调用-但可以使用PyString_AsDecodedObject而不是调用Python的“decode”方法。类似以下内容：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

我不完全确定PyString_Decode工作方式背后的原因。一个非常古老的python-dev线程似乎表明这与链接输出有关，但由于Python方法没有做同样的事情，我不确定它是否仍然相关。