使用JS将Windows-1252转换为UTF-8

5

我有一些荷兰语字符串。我知道如何使用PHP对它们进行编码。

$str = iconv( "Windows-1252", "UTF-8", $str );

在JavaScript中,等价的代码是什么?

2
我很想知道在JS中是否有这样的东西。 - evolutionxbox
@evolutionxbox。我也是。 - Kiran Dash
由于我没有得到任何回应,所以我想使用AJAX进行转换,但仍然遇到问题。因此,如果有人正在阅读这篇文章,也许您可以看到我的新问题http://stackoverflow.com/questions/42432964/character-conversion-working-different-for-fgetcsv-and-jquery-get-with-ajax - Kiran Dash
3个回答

4

Windows-1252 是一种单字节编码,非常方便:你可以直接构建一个查找表。

<?php
$s = '';

for ($i = 0; $i < 256; $i++) {
    $converted = iconv('Windows-1252', 'UTF-8', chr($i));

    if ($converted === false) {
        $s .= "\xef\xbf\xbd";  # UTF-8 replacement character
    } else {
        $s .= $converted;
    }
}

echo $s;

假设您需要普通的JavaScript字符串作为结果(而不是UTF-8),并且输入是一个字符串,其中每个字符的Unicode码点实际上表示Windows-1252的一个字符,那么生成的表格可以被读取为UTF-8,放入JavaScript字符串文字中,如下所示:
var WINDOWS_1252 = '\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~€�‚ƒ„…†‡ˆ‰Š‹Œ�Ž��‘’“”•–—˜™š›œ�žŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ';

function fromWindows1252(binaryString) {
    var text = '';

    for (var i = 0; i < binaryString.length; i++) {
        text += WINDOWS_1252.charAt(binaryString.charCodeAt(i));
    }

    return text;
}

谢谢你的回答。我并没有完全理解它。为了最好地解释我的问题,我创建了一个新的问题。也许你可以在这里查看http://stackoverflow.com/questions/42432964/convert-windows-1252-to-utf-8-with-ajax - Kiran Dash
@KiranDash:为什么不使用原始的iconv将CSV文件编码为UTF-8呢?这样你只需要在服务器和客户端之间进行一次往返。 - Ry-

1

接受的答案对我没用,让我展示一下什么起作用了。

我的误解字符串来自一个可读流。当我运行await ReadableStreamVar.text()时,所得到的文本将所有变音符号都替换为奇怪的“�”字符。当我尝试手动在浏览器中打开同一页时,它显示正确,并且在控制台中键入document.characterSet后,我看到字符集是“windows-1252”,而不是UTF-8。

Ry的答案没有起作用,因为ReadableStream中的text方法已经使用错误的编码,并用相同的�字符替换了所有不同的变音符号。

查阅ReadableStream API,我没有找到任何传递所需字符集到.text方法的形式,因此我尝试了不同的方法:将ReadableStream作为数组(ReadableStream.arrayBuffer())读取,这样就行了。在数组中,我可以看到带有变音符号的字符具有正确字符的字符编码。因此,根据这个答案,我使用了TextDecoder API,现在我的可工作代码如下:

async function getDocumentTextContent(uri) {
    const response = await fetch(uri, { method: "GET" })
    if (!response.ok) throw new Error(`Problem retrieving the resource. Error message: ${response.statusText}`)
    const dec = new TextDecoder("windows-1252") //Here I can inform the desired charset
    const arrBuffer = await response.arrayBuffer()
    const ui8array = new Uint8Array(arrBuffer)
    const text = dec.decode(ui8array)
    console.log(text)
    return text
}

1

我用了 brute force 方法实现,可能不是最优雅的方案,但是它能够工作:

function bruteForceWindows1252toUTF16(s) {

    for (var i = 0; i < globalWin1252toUTF16table.length; i++) {
        if (s.includes(globalWin1252toUTF16table[i]['win1252'])) {
            s = s.replaceAll(globalWin1252toUTF16table[i]['win1252'], globalWin1252toUTF16table[i]['utf16']);
        }
    }
    return s;
}


const globalWin1252toUTF16table = [
        { win1252:'\xe2\x82\xac', utf16:'\u20AC' },
        { win1252:'\xe2\x80\x9a', utf16:'\u201A' },
        { win1252:'\xc6\x92', utf16:'\u0192' },
        { win1252:'\xe2\x80\x9e', utf16:'\u201E' },
        { win1252:'\xe2\x80\xa6', utf16:'\u2026' },
        { win1252:'\xe2\x80\xa0', utf16:'\u2020' },
        { win1252:'\xe2\x80\xa1', utf16:'\u2021' },
        { win1252:'\xcb\x86', utf16:'\u02C6' },
        { win1252:'\xe2\x80\xb0', utf16:'\u2030' },
        { win1252:'\xc5\xa0', utf16:'\u0160' },
        { win1252:'\xe2\x80\xb9', utf16:'\u2039' },
        { win1252:'\xc5\x92', utf16:'\u0152' },
        { win1252:'\xc5\xbd', utf16:'\u017D' },
        { win1252:'\xe2\x80\x98', utf16:'\u2018' },
        { win1252:'\xe2\x80\x99', utf16:'\u2019' },
        { win1252:'\xe2\x80\x9c', utf16:'\u201C' },
        { win1252:'\xe2\x80\x9d', utf16:'\u201D' },
        { win1252:'\xe2\x80\xa2', utf16:'\u2022' },
        { win1252:'\xe2\x80\x93', utf16:'\u2013' },
        { win1252:'\xe2\x80\x94', utf16:'\u2014' },
        { win1252:'\xcb\x9c', utf16:'\u02DC' },
        { win1252:'\xe2\x84\xa2', utf16:'\u2122' },
        { win1252:'\xc5\xa1', utf16:'\u0161' },
        { win1252:'\xe2\x80\xba', utf16:'\u203A' },
        { win1252:'\xc5\x93', utf16:'\u0153' },
        { win1252:'\xc5\xbe', utf16:'\u017E' },
        { win1252:'\xc5\xb8', utf16:'\u0178' },
        { win1252:'\xc2\xa0', utf16:'\u00A0' },
        { win1252:'\xc2\xa1', utf16:'\u00A1' },
        { win1252:'\xc2\xa2', utf16:'\u00A2' },
        { win1252:'\xc2\xa3', utf16:'\u00A3' },
        { win1252:'\xc2\xa4', utf16:'\u00A4' },
        { win1252:'\xc2\xa5', utf16:'\u00A5' },
        { win1252:'\xc2\xa6', utf16:'\u00A6' },
        { win1252:'\xc2\xa7', utf16:'\u00A7' },
        { win1252:'\xc2\xa8', utf16:'\u00A8' },
        { win1252:'\xc2\xa9', utf16:'\u00A9' },
        { win1252:'\xc2\xaa', utf16:'\u00AA' },
        { win1252:'\xc2\xab', utf16:'\u00AB' },
        { win1252:'\xc2\xac', utf16:'\u00AC' },
        { win1252:'\xc2\xad', utf16:'\u00AD' },
        { win1252:'\xc2\xae', utf16:'\u00AE' },
        { win1252:'\xc2\xaf', utf16:'\u00AF' },
        { win1252:'\xc2\xb0', utf16:'\u00B0' },
        { win1252:'\xc2\xb1', utf16:'\u00B1' },
        { win1252:'\xc2\xb2', utf16:'\u00B2' },
        { win1252:'\xc2\xb3', utf16:'\u00B3' },
        { win1252:'\xc2\xb4', utf16:'\u00B4' },
        { win1252:'\xc2\xb5', utf16:'\u00B5' },
        { win1252:'\xc2\xb6', utf16:'\u00B6' },
        { win1252:'\xc2\xb7', utf16:'\u00B7' },
        { win1252:'\xc2\xb8', utf16:'\u00B8' },
        { win1252:'\xc2\xb9', utf16:'\u00B9' },
        { win1252:'\xc2\xba', utf16:'\u00BA' },
        { win1252:'\xc2\xbb', utf16:'\u00BB' },
        { win1252:'\xc2\xbc', utf16:'\u00BC' },
        { win1252:'\xc2\xbd', utf16:'\u00BD' },
        { win1252:'\xc2\xbe', utf16:'\u00BE' },
        { win1252:'\xc2\xbf', utf16:'\u00BF' },
        { win1252:'\xc3\x80', utf16:'\u00C0' },
        { win1252:'\xc3\x81', utf16:'\u00C1' },
        { win1252:'\xc3\x82', utf16:'\u00C2' },
        { win1252:'\xc3\x83', utf16:'\u00C3' },
        { win1252:'\xc3\x84', utf16:'\u00C4' },
        { win1252:'\xc3\x85', utf16:'\u00C5' },
        { win1252:'\xc3\x86', utf16:'\u00C6' },
        { win1252:'\xc3\x87', utf16:'\u00C7' },
        { win1252:'\xc3\x88', utf16:'\u00C8' },
        { win1252:'\xc3\x89', utf16:'\u00C9' },
        { win1252:'\xc3\x8a', utf16:'\u00CA' },
        { win1252:'\xc3\x8b', utf16:'\u00CB' },
        { win1252:'\xc3\x8c', utf16:'\u00CC' },
        { win1252:'\xc3\x8d', utf16:'\u00CD' },
        { win1252:'\xc3\x8e', utf16:'\u00CE' },
        { win1252:'\xc3\x8f', utf16:'\u00CF' },
        { win1252:'\xc3\x90', utf16:'\u00D0' },
        { win1252:'\xc3\x91', utf16:'\u00D1' },
        { win1252:'\xc3\x92', utf16:'\u00D2' },
        { win1252:'\xc3\x93', utf16:'\u00D3' },
        { win1252:'\xc3\x94', utf16:'\u00D4' },
        { win1252:'\xc3\x95', utf16:'\u00D5' },
        { win1252:'\xc3\x96', utf16:'\u00D6' },
        { win1252:'\xc3\x97', utf16:'\u00D7' },
        { win1252:'\xc3\x98', utf16:'\u00D8' },
        { win1252:'\xc3\x99', utf16:'\u00D9' },
        { win1252:'\xc3\x9a', utf16:'\u00DA' },
        { win1252:'\xc3\x9b', utf16:'\u00DB' },
        { win1252:'\xc3\x9c', utf16:'\u00DC' },
        { win1252:'\xc3\x9d', utf16:'\u00DD' },
        { win1252:'\xc3\x9e', utf16:'\u00DE' },
        { win1252:'\xc3\x9f', utf16:'\u00DF' },
        { win1252:'\xc3\xa0', utf16:'\u00E0' },
        { win1252:'\xc3\xa1', utf16:'\u00E1' },
        { win1252:'\xc3\xa2', utf16:'\u00E2' },
        { win1252:'\xc3\xa3', utf16:'\u00E3' },
        { win1252:'\xc3\xa4', utf16:'\u00E4' },
        { win1252:'\xc3\xa5', utf16:'\u00E5' },
        { win1252:'\xc3\xa6', utf16:'\u00E6' },
        { win1252:'\xc3\xa7', utf16:'\u00E7' },
        { win1252:'\xc3\xa8', utf16:'\u00E8' },
        { win1252:'\xc3\xa9', utf16:'\u00E9' },
        { win1252:'\xc3\xaa', utf16:'\u00EA' },
        { win1252:'\xc3\xab', utf16:'\u00EB' },
        { win1252:'\xc3\xac', utf16:'\u00EC' },
        { win1252:'\xc3\xad', utf16:'\u00ED' },
        { win1252:'\xc3\xae', utf16:'\u00EE' },
        { win1252:'\xc3\xaf', utf16:'\u00EF' },
        { win1252:'\xc3\xb0', utf16:'\u00F0' },
        { win1252:'\xc3\xb1', utf16:'\u00F1' },
        { win1252:'\xc3\xb2', utf16:'\u00F2' },
        { win1252:'\xc3\xb3', utf16:'\u00F3' },
        { win1252:'\xc3\xb4', utf16:'\u00F4' },
        { win1252:'\xc3\xb5', utf16:'\u00F5' },
        { win1252:'\xc3\xb6', utf16:'\u00F6' },
        { win1252:'\xc3\xb7', utf16:'\u00F7' },
        { win1252:'\xc3\xb8', utf16:'\u00F8' },
        { win1252:'\xc3\xb9', utf16:'\u00F9' },
        { win1252:'\xc3\xba', utf16:'\u00FA' },
        { win1252:'\xc3\xbb', utf16:'\u00FB' },
        { win1252:'\xc3\xbc', utf16:'\u00FC' },
        { win1252:'\xc3\xbd', utf16:'\u00FD' },
        { win1252:'\xc3\xbe', utf16:'\u00FE' },
        { win1252:'\xc3\xbf', utf16:'\u00FF' }
        ];

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接