按照utf-8字节位置提取子字符串

10

我有一个字符串和起始位置以及长度,需要从中提取一个子字符串。起始位置和长度都是基于原始UTF8字符串中的字节偏移量计算得到的。

但是,存在一个问题:

起始位置和长度都是按字节计算的,因此我无法使用 "substring" 方法。UTF8 字符串包含多个多字节字符。是否有一种超高效的方法来解决这个问题?(我不需要对字节进行解码...)

例如: var orig = '你好吗?'

如果 s 和 e 分别为 3 和 3,则提取第二个字符 好。我正在寻找

var result = orig.substringBytes(3,3);

求助!

更新#1在C/C++中,我会将其转换为字节数组,但不确定在JavaScript中是否有等价物。顺便说一句,我们可以将它解析成一个字节数组,然后再将其解析回一个字符串,但似乎应该有一种快速的方法来在正确的位置截断它。假设'orig'是1000000个字符,s = 6字节,l = 3字节。

更新#2感谢zerkms的帮助引导,我最终得到了以下结果,它对于多字节正常工作,但单字节却变得混乱。

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

更新 #3 我认为移动字符编码并不能真正解决问题。正确答案需要读取三个字节,而我始终会忘记这一点。在UTF8和UTF16中,代码点相同,但编码所占用的字节数取决于编码本身!因此,这不是解决问题的正确方法。


substr 的起始位置和长度是按字符而非字节计算的。 - nhahtdh
https://dev59.com/UnM_5IYBdhLWcg3ww2EQ - zerkms
1
@zerkms - 我也发现了,但我认为将整个字符串解码为字节,挑选子字符串并返回会非常低效。如果有一千万个字符而我想要第6到12个字节怎么办?似乎将整个字符串转换是一个可怕的想法。 - tofutim
我更新了我的答案,使代码与UTF-8输入兼容。现在它完全符合您的要求,并且不依赖于Buffer() - Kaii
如果可以的话,请将您的“start”和“length”参数的输入格式更改为字符。这样做会大大提高性能,因为JS实际上无法在字节级别处理utf-8字符串。(如前所述,所有输入都在内部转换为utf-16) - Kaii
显示剩余2条评论
7个回答

11

我很开心能够玩弄这个。希望这有所帮助。

因为JavaScript不允许对字符串进行直接的字节访问,所以找到起始位置的唯一方法是进行正向扫描。


更新#3 我认为移动字符代码并不能真正起作用。当正确答案是三个字节时,我读取两个字节...... 不知怎么的,我总是忘记这一点。UTF8和UTF16的代码点相同,但编码中占用的字节数取决于编码! 所以这不是正确的方法。

这是不正确的 - 实际上,在JavaScript中没有UTF-8字符串。根据ECMAScript 262规范,所有字符串 - 无论输入编码如何 - 都必须在内部存储为UTF-16(“ [sequence of] 16-bit unsigned integers”)。

考虑到这一点,8位移位是正确的(但不必要)。

错误的假设是您的字符存储为3字节序列...
事实上,JS(ECMA-262)字符串中的所有字符都是16位(2字节)长。

可以通过手动将多字节字符转换为utf-8来解决此问题,如下面的代码所示。


更新 此解决方案不能处理codepoints≥U+10000(包括emoji)。请参见APerson的答案以获得更完整的解决方案。


详细信息请参见我的示例代码:

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗?';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

请注意,此答案不处理U+10000或以上的代码点 - 包括表情符号。请参阅我的答案。 - APerson

8

@Kaii的答案几乎是正确的,但其中存在一个错误。它不能处理Unicode字符的字符编码在128到255之间的情况。 这是修改后的版本(只需将256改为128):

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >= 128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >= 128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗?©';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"
alert('res: ' + substr_utf8_bytes(orig, 15, 2)); // alerts: "©"

顺便提一下,这是一个错误修复,对于遇到同样问题的人应该会有用。


1
我将此作为信用,并编辑了我的答案。感谢您的敏锐眼力。 - Kaii

3
function substrBytes(str, start, length)
{
    var buf = new Buffer(str);
    return buf.slice(start, start+length).toString();
}

AYB


我尝试了这个,但是我没有Buffer()对象。你用的是哪个框架? - Kaii
这在Node.js中对我不起作用。返回一堆问号字符。常规的substr工作得很好。 - Gavin

1
也许可以使用这个来计算字节数和示例。它计算你的字符为2个字节,而不是3个字节,遵循@Kaii的函数:
jQuery.byteLength = function(target) {
    try {
        var i = 0;
        var length = 0;
        var count = 0;
        var character = '';
        //
        target = jQuery.castString(target);
        length = target.length;
        //
        for (i = 0; i < length; i++) {
            // 1 文字を切り出し Unicode に変換
            character = target.charCodeAt(i);
            //
            // Unicode の半角 : 0x0 - 0x80, 0xf8f0, 0xff61 - 0xff9f, 0xf8f1 -
            // 0xf8f3
            if ((character >= 0x0 && character < 0x81)
                    || (character == 0xf8f0)
                    || (character > 0xff60 && character < 0xffa0)
                    || (character > 0xf8f0 && character < 0xf8f4)) {
                // 1 バイト文字
                count += 1;
            } else {
                // 2 バイト文字
                count += 2;
            }
        }
        //
        return (count);
    } catch (e) {
        jQuery.showErrorDetail(e, 'byteLength');
        return (0);
    }
};

for (var j = 1, len = value.length; j <= len; j++) {
    var slice = value.slice(0, j);
    var slength = $.byteLength(slice);
    if ( slength == 106 ) {
        $(this).val(slice);
        break;
    }
}

1

Kaii的答案很好,但是它不能处理U+10000以上的代码点(如表情符号),因为它们会转换为代理对,这会导致encodeURIComponent抛出错误。我复制了它并更改了一些东西:

// return how many bytes the UTF-16 code unit `s` would be, if represented in utf8
function utf8_len(s) {
    var charCode = s.charCodeAt(0);
    if (charCode < 128) return 1;
    if (charCode < 2048) return 2;
    if ((55296 <= charCode) && (charCode <= 56319)) return 4; // UTF-16 high surrogate
    if ((56320 <= charCode) && (charCode <= 57343)) return 0; // UTF-16 low surrogate
    if (charCode < 65536) return 3;
    throw 'Bad char';
}

// Returns the substring of `str` starting at UTF-8 byte index `startInBytes`,
// that extends for `lengthInBytes` UTF-8 bytes. May misbehave if the
// specified string does NOT start and end on character boundaries.
function substr_utf8_bytes(str, startInBytes, lengthInBytes) {
    var currCharIdx = 0;

    // Scan through the string, looking for the start of the substring
    var bytePos = 0;
    while (bytePos < startInBytes) {
        var utf8Len = utf8_len(str.charAt(currCharIdx));
        bytePos += utf8Len;
        currCharIdx++;

        // Make sure to include low surrogate
        if ((utf8Len == 4) && (bytePos == startInBytes)) {
            currCharIdx++;
        }
    }

    // We've found the substring; copy it to resultStr character by character
    var resultStr = '';
    var currLengthInBytes = 0;
    while (currLengthInBytes < lengthInBytes) {
        var utf8Len = utf8_len(str.charAt(currCharIdx));
        currLengthInBytes += utf8Len;
        resultStr += str[currCharIdx];
        currCharIdx++;

        // Make sure to include low surrogate
        if ((utf8Len == 4) && (currLengthInBytes == lengthInBytes)) {
            resultStr += str[currCharIdx];
        }
    }

    return resultStr;
}

var orig2 = 'abc你好吗?';

console.log('res: ' + substr_utf8_bytes('', 0, 4));
console.log('res: ' + substr_utf8_bytes('', 0, 4));
console.log('res: ' + substr_utf8_bytes('', 4, 4));
console.log('res: ' + substr_utf8_bytes(orig2, 0, 2)); // alerts: "ab"
console.log('res: ' + substr_utf8_bytes(orig2, 2, 1)); // alerts: "c"
console.log('res: ' + substr_utf8_bytes(orig2, 3, 3)); // alerts: "你"
console.log('res: ' + substr_utf8_bytes(orig2, 6, 6)); // alerts: "好吗"

(请注意,变量名称中的“char”应该更像是“代码单元”,但我有点懒。)
(请注意,“char”在变量名中应该更像是“代码单元”,但我有点懒。)

哇,谢谢你的改进!由于这是更完整的解决方案,我更新了我的答案以引用你。 - Kaii
另外,我期望这个解决方案比我的原始代码要快得多,因为在循环中不断转义和取消转义每个字符非常低效,但是当时我并不知道更好的方法。虽然这只是一个概念验证。 - Kaii
没问题。你的回答很好,只是有一个小错误。 - APerson

1

对于IE用户,上面答案中的代码将输出undefined。因为在IE中,不支持str[n],换句话说,你不能把字符串当作数组使用。你需要用str.charAt(n)替换str[n]。代码应该是这样的;

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

    var resultStr = '';
    var startInChars = 0;

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {
        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str.charAt(startInChars)).length;
    }

    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str.charAt(n)).length;

        resultStr += str.charAt(n);
    }

    return resultStr;
}

-1

System.ArraySegment很有用,但你需要使用数组输入、偏移和索引器来构造它。


这是用JavaScript编写的吗?还是只是一个C#库? - tofutim

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接