JavaScript和字符串操作与UTF-16代理对

Question

JavaScript和字符串操作与UTF-16代理对

18

我正在开发一款Twitter应用程序，刚刚接触到了UTF-8（16）世界。看起来大部分JavaScript字符串函数都对代理对视而不见，就像我一样。我必须重新编写一些代码以使其支持宽字符。

我有这个函数可以将字符串解析为数组并保留代理对。然后我将重新编写几个函数以处理数组而非字符串。

function sortSurrogates(str){
  var cp = [];                 // array to hold code points
  while(str.length){           // loop till we've done the whole string
    if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
                               // High surrogate found low surrogate follows
      cp.push(str.substr(0,2)); // push the two onto array
      str = str.substr(2);     // clip the two off the string
    }else{                     // else BMP code point
      cp.push(str.substr(0,1)); // push one onto array
      str = str.substr(1);     // clip one from string 
    }
  }                            // loop
  return cp;                   // return the array
}

我的问题是，我是否忽略了什么更简单的东西？我看到很多人重申javascript本地处理utf-16，然而我的测试让我相信，可能是数据格式，但函数还不知道。我是否缺少了一些简单的东西？

编辑：为了帮助说明这个问题：

var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = ""; // U+1D7D8 - U+1D7E1 4 bytes each
alert(a.length); // javascript shows 10
alert(b.length); // javascript shows 20

Twitter认为这两者都是10个字符长。

- BentFX

1

基本操作。Twitter不会直接返回链接，只是纯文本和URL，以及它们所属的索引。这些索引是基于代码点而不是16位字符的。我还有一个用于格式化推文的文本区域。Javascript将简单字符计数视为16位块的计数，而不是单个代码点。我可以解决它，只是不想在没有征询专家的情况下朝错误的方向前进。 - BentFX

6

Javascript 内部使用 UCS-2 编码，并非 UTF-16。由于这个原因，在 Javascript 中处理 Unicode 非常困难，我不建议尝试这样做。至于 Twitter 的做法，你似乎在说它是按代码点而不是按代码单元进行合理计数。 - tchrist

2

@tchrist：你是什么意思？JavaScript字符串，也就是对开发人员可见的内容，是UTF-16编码的。 - Tim Down

4

@Tim：它们以单独的代码单元显示为 UCS-2 字符串，而不是以 Unicode 代码点的字符串形式显示。你可以通过正则表达式证明这一点。尝试在模式中编写 [-] 并查看发生了什么。它就是有问题。如果 JavaScript 实际上使用 UTF-16，则我将能够编写 document.write(String.fromCharCode(0x1D49C)) 而不必编写 也不允许编写 document.write(String.fromCharCode(0xD835,0xDC9C))。这是错误的 UCS-2 荒唐行为。 - tchrist

@tchrist：你是正确的，抱歉。 - Tim Down

显示剩余5条评论

5个回答

12

我已经创建了一个Unicode字符串处理对象的起点。它创建了一个名为 UnicodeString()的函数，可以接受JavaScript字符串或表示Unicode代码点的整数数组，并提供 length 和 codePoints 属性以及 toString() 和 slice() 方法。添加正则表达式支持将非常复杂，但像 indexOf() 和 split() （不支持正则表达式）这样的东西应该很容易实现。

var UnicodeString = (function() {
    function surrogatePairToCodePoint(charCode1, charCode2) {
        return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
    }

    function stringToCodePointArray(str) {
        var codePoints = [], i = 0, charCode;
        while (i < str.length) {
            charCode = str.charCodeAt(i);
            if ((charCode & 0xF800) == 0xD800) {
                codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
            } else {
                codePoints.push(charCode);
            }
            ++i;
        }
        return codePoints;
    }

    function codePointArrayToString(codePoints) {
        var stringParts = [];
        for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) {
            codePoint = codePoints[i];
            if (codePoint > 0xFFFF) {
                offset = codePoint - 0x10000;
                codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
            } else {
                codePointCharCodes = [codePoint];
            }
            stringParts.push(String.fromCharCode.apply(String, codePointCharCodes));
        }
        return stringParts.join("");
    }

    function UnicodeString(arg) {
        if (this instanceof UnicodeString) {
            this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg;
            this.length = this.codePoints.length;
        } else {
            return new UnicodeString(arg);
        }
    }

    UnicodeString.prototype = {
        slice: function(start, end) {
            return new UnicodeString(this.codePoints.slice(start, end));
        },

        toString: function() {
            return codePointArrayToString(this.codePoints);
        }
    };


    return UnicodeString;
})();

var ustr = UnicodeString("fbar");
document.getElementById("output").textContent = "String: '" + ustr + "', length: " + ustr.length + ", slice(2, 4): " + ustr.slice(2, 4);

<div id="output"></div>

- Tim Down

感谢您的努力。我对JavaScript对象结构非常陌生，从您的示例中可以学到很多。我的编码只是娱乐性质的，我喜欢解决难题。在JavaScript中似乎Unicode就像魔方一样，但正确的解决方案却更少。 :) - BentFX

当一个拥有4个赞的StackOverflow答案成为处理UTF-16 Unicode的最佳方式时，这说明Javascript的当前状态令人失望。不过，对于我的当前任务(切分包含Emoji图标的推文)来说，这项工作做得很出色！ - Matt Vukas

6

以下是与JavaScript中代理对相关的一些脚本，可能会有所帮助：

ES6 Unicode shims for ES3+ 添加了 ECMAScript 6 中的 String.fromCodePoint 和 String.prototype.codePointAt 方法。ES3/5 中的 fromCharCode 和 charCodeAt 方法不能处理代理对，因此会给出错误的结果。
Full 21-bit Unicode code point matching in XRegExp with \u{10FFFF} 允许在 XRegExp 正则表达式中匹配任何单个代码点。

- slevithan

5

Javascript字符串迭代器可以提供实际的字符，而不是代理码点：

>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [...""]
["", "", "", "", "", "", "", "", "", ""]
>>> [..."0123456789"].length
10
>>> [...""].length
10

- rumpel

3

这差不多是我想要的，但需要更好地支持不同的字符串函数。随着我不断添加内容，我会更新这个答案。

function wString(str){
  var T = this; //makes 'this' visible in functions
  T.cp = [];    //code point array
  T.length = 0; //length attribute
  T.wString = true; // (item.wString) tests for wString object

//member functions
  sortSurrogates = function(s){  //returns array of utf-16 code points
    var chrs = [];
    while(s.length){             // loop till we've done the whole string
      if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
                                 // High surrogate found low surrogate follows
        chrs.push(s.substr(0,2)); // push the two onto array
        s = s.substr(2);         // clip the two off the string
      }else{                     // else BMP code point
        chrs.push(s.substr(0,1)); // push one onto array
        s = s.substr(1);         // clip one from string 
      }
    }                            // loop
    return chrs;
  };
//end member functions

//prototype functions
  T.substr = function(start,len){
    if(len){
      return T.cp.slice(start,start+len).join('');
    }else{
      return T.cp.slice(start).join('');
    }
  };

  T.substring = function(start,end){
    return T.cp.slice(start,end).join('');
  };

  T.replace = function(target,str){
    //allow wStrings as parameters
    if(str.wString) str = str.cp.join('');
    if(target.wString) target = target.cp.join('');
    return T.toString().replace(target,str);
  };

  T.equals = function(s){
    if(!s.wString){
      s = sortSurrogates(s);
      T.cp = s;
    }else{
        T.cp = s.cp;
    }
    T.length = T.cp.length;
  };

  T.toString = function(){return T.cp.join('');};
//end prototype functions

  T.equals(str)
};

测试结果：

// plain string
var x = "0123456789";
alert(x);                    // 0123456789
alert(x.substr(4,5))         // 45678
alert(x.substring(2,4))      // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length);             // 10

// wString object
x = new wString("");
alert(x);                    // 
alert(x.substr(4,5))         // 
alert(x.substring(2,4))      // 
alert(x.replace("","x")); // x
alert(x.length);             // 10

- BentFX

这看起来与我的努力相当相似。 - Tim Down

@Tim 是的，结构不同，但主要是原型化所需的函数。主要区别在于将代码单元编码为代码点。我选择不这样做，因为我没有用真正的代码点，JavaScript 无法显示它们，所以为什么要费心呢。对于我的用途来说，只需将它们分开，以便可以在合理的点进行计数和拆分就足够了。玩得开心！ - BentFX

1

好的，如果您不需要真正的代码点，那么就不要使用它们。如果您需要向服务器发送Unicode字符串，则可能需要它们。 - Tim Down

从Angular JS中： "...".replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g, function(value) { var hi = value.charCodeAt(0); var low = value.charCodeAt(1); return '&#' + (((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000) + ';'; }) 这将创建一个实体编码值，以便安全地插入到属性或元素主体中。 - Ajax

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tchrist · Accepted Answer

Javascript在内部使用UCS-2，而不是UTF-16。由于这一点，处理Unicode在Javascript中非常困难，我不建议尝试这样做。

至于Twitter的做法，您似乎在说它按代码点进行合理计数，而不是按代码单元进行疯狂计数。

除非你别无选择，否则应该使用实际支持Unicode并具有代码点接口而不是代码单元接口的编程语言。Javascript对此不够好，正如你已经发现的那样。

它有一个更糟糕的UCS-2诅咒，比已经足够糟糕的UTF-16诅咒还要糟糕。我在OSCON演讲中谈到了所有这些问题，Unicode Support Shootout: The Good, the Bad, & the (mostly) Ugly。

由于其可怕的诅咒，你必须在Javascript中手动模拟UCS-2的UTF-16，这简直是疯狂的。

Javascript还遭受着各种可怕的Unicode问题。它没有图形或规范化或排序的支持，这些都是你真正需要的。它的正则表达式也有问题，有时是因为诅咒，有时只是因为人们搞错了。例如，Javascript无法表达像[-]这样的正则表达式。Javascript甚至不支持大小写折叠，因此您无法编写像/ΣΤΙΓΜΑΣ/i这样的模式，并使其正确匹配στιγμας。

你可以尝试使用XRegEXp插件，但这并不能消除该问题。只有更改为支持Unicode的语言才能解决问题，而中文就不属于这种语言之一。