获取包含超过0xffff的Unicode字符的字符串长度

Question

获取包含超过0xffff的Unicode字符的字符串长度

6

我正在使用这个字符，双重升高音''，它的unicode编码是0x1d12a。
如果我在字符串中使用它，我无法得到正确的字符串长度：

str = "F"
str.length // returns 3, even though there are 2 characters!

我该如何使函数返回正确的答案，无论是否使用特殊的Unicode字符？

- Albizia

1

"̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24 - 一些字符比预期更长 - Adelin

2

这是一篇关于该主题的好博客。链接 - Adelin

这取决于你要寻找什么。在Javascript中，字符串由一系列16位字符“旧”的Unicode字符组成。因此，Unicode代码点大于0xffff被编码为UCS-2，带有“代理项”。所以两个旧的Unicode字符。新的Unicode支持code points到10FFFF，所以我们有UTF-16，并且我们应该将字符视为code point。[不考虑组合字符和一般字形计数] - Giacomo Catenazzi

@GiacomoCatenazzi：“Unicode代码点大于0xffff被编码为UCS-2” - 不，它们被编码为UTF-16。UCS-2早于UTF-16并且不支持代码点> U+FFFF，这就是为什么创建UTF-16的原因。 - Remy Lebeau

@RemyLebeau：这取决于观点。使用UTF-16时，应考虑每个代码点只有一个字符的一致性。许多语言早于UTF-16，因此它们使用UCS-2进行编码。在UCS-2中，您有“代理项”（但官方上不支持BMP之外的代码点，这是UTF-16设计的兼容技巧：与UCS-2字节兼容）。在UTF-16中不存在代理项。 - Giacomo Catenazzi

显示剩余2条评论

4个回答

1

这是我编写的函数，用于获取字符串长度的代码点长度。

function nbUnicodeLength(string){
    var stringIndex = 0;
    var unicodeIndex = 0;
    var length = string.length;
    var second;
    var first;
    while (stringIndex < length) {

        first = string.charCodeAt(stringIndex);  // returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
        if (first >= 0xD800 && first <= 0xDBFF && string.length > stringIndex + 1) {
            second = string.charCodeAt(stringIndex + 1);
            if (second >= 0xDC00 && second <= 0xDFFF) {
                stringIndex += 2;
            } else {
                stringIndex += 1;
            }
        } else {
            stringIndex += 1;
        }

        unicodeIndex += 1;
    }
    return unicodeIndex;
}

- Nathan B

另请参见https://www.lighttag.io/blog/unicode-surrogate-pairs/。 - Sohail Si

0

总结我的评论：

那只是该字符串的长度。

有些字符涉及到其他字符，即使看起来像一个单独的字符。"̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24

从这篇（很棒的）博客文章中，他们有一个函数可以返回正确的长度：

function fancyCount(str){
  const joiner = "\u{200D}";
  const split = str.split(joiner);
  let count = 0;
    
  for(const s of split){
    //removing the variation selectors
    const num = Array.from(s.split(/[\ufe00-\ufe0f]/).join("")).length;
    count += num;
  }
    
  //assuming the joiners are used appropriately
  return count / split.length;
}

console.log(fancyCount("F") == 2) // true

- Adelin

4

您的代码太多了。console.log([..."F"].length); // 2 - daxim

2

for (let i = 0; i < 0x110000; i++) {let c = String.fromCodePoint(i); console.log([...c].length, c);} - daxim

一流！正如你所看到的，我引用了别人的发现。请随意发布答案。 - Adelin

-1

Javascript（和Java）字符串采用UTF-16编码。

Unicode代码点U+0046（F）使用1个代码单元进行UTF-16编码：0x0046

Unicode代码点U+1D12A（）使用2个代码单元（称为“代理对”）进行UTF-16编码：0xD834 0xDD2A

这就是为什么你得到的是长度为3而不是2。 length计算的是编码后的代码单元数量，而不是Unicode代码点的数量。

- Remy Lebeau

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- daxim · Accepted Answer

String.prototype.codes = function() { return [...this].length };
String.prototype.chars = function() {
    let GraphemeSplitter = require('grapheme-splitter');
    return (new GraphemeSplitter()).countGraphemes(this);
}

console.log("F".codes());     // 2
console.log("‍❤️‍‍".codes());     // 8
console.log("❤️".codes());      // 2

console.log("F".chars());     // 2
console.log("‍❤️‍‍".chars());     // 1
console.log("❤️".chars());      // 1