如何将一个字符串分割成特定字节大小的块？

Question

如何将一个字符串分割成特定字节大小的块？

16

我正在与一个接受最大5KB大小字符串的API进行交互。

我想要将可能超过5KB的字符串分成小于5KB大小的块。

然后，我打算将每个小于5KB的字符串传递到API端点，并在所有请求完成后执行进一步操作，可能使用类似以下方式的内容：

await Promise.all([get_thing_from_api(string_1), get_thing_from_api(string_2), get_thing_from_api(string_3)])

我了解到字符串中的字符可以是1-4个字节。

因此，要计算字符串长度（以字节为单位），我们可以使用：

// in Node, string is UTF-8    
Buffer.byteLength("here is some text"); 

// in Javascript  
new Blob(["here is some text"]).size

Source:
https://dev59.com/gW035IYBdhLWcg3wSN4G#56026151
https://dev59.com/JnE95IYBdhLWcg3wp_kg#52254083

我的搜索结果显示，与将字符串拆分为特定字节长度的字符串相关的内容很少，大多数都是将字符串拆分为特定字符长度的字符串，例如：

var my_string = "1234 5 678905";

console.log(my_string.match(/.{1,2}/g));
// ["12", "34", " 5", " 6", "78", "90", "5"]

Source:
https://dev59.com/AGw05IYBdhLWcg3wxkur#7033662
https://dev59.com/VW025IYBdhLWcg3wAxJ9#6259543
https://gist.github.com/hendriklammers/5231994

问题

有没有一种方法可以将一个字符串分成特定字节长度的字符串？

我可以：

假设字符串只包含每个字符1个字节
允许“最坏的情况”，即每个字符为4个字节

但我更喜欢更准确的解决方案。

如果存在，则对Node和普通JavaScript解决方案都感兴趣。

编辑

通过迭代字符串中的字符，获取它们的字符代码并相应地增加byteLength来计算byteLength的这种方法可能会有所帮助：

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

来源: https://dev59.com/gW035IYBdhLWcg3wSN4G#23329386

这引导我对缓冲区底层数据结构进行了有趣的实验:

var buf = Buffer.from('Hey! ф');
// <Buffer 48 65 79 21 20 d1 84>  
buf.length // 7
buf.toString().charCodeAt(0) // 72
buf.toString().charCodeAt(5) // 1092  
buf.toString().charCodeAt(6) // NaN    
buf[0] // 72
for (let i = 0; i < buf.length; i++) {
  console.log(buf[i]);
}
// 72 101 121 33 32 209 132 undefined
buf.slice(0,5).toString() // 'Hey! '
buf.slice(0,6).toString() // 'Hey! �'
buf.slice(0,7).toString() // 'Hey! ф'

但正如@trincot在评论中指出的那样，处理多字节字符的正确方式是什么？我该如何确保块在空格上分割（以免“打破”一个单词）？

有关缓冲区的更多信息：https://nodejs.org/api/buffer.html#buffer_buffer

编辑

如果有助于其他人理解接受的答案中的精彩逻辑，下面的代码片段是我制作的版本，其中包含了大量注释，以便我能够更好地理解它。

/**
 * Takes a string and returns an array of substrings that are smaller than maxBytes.  
 *
 * This is an overly commented version of the non-generator version of the accepted answer, 
 * in case it helps anyone understand its (brilliant) logic.  
 *
 * Both plain js and node variations are shown below - simply un/comment out your preference  
 * 
 * @param  {string} s - the string to be chunked  
 * @param  {maxBytes} maxBytes - the maximum size of a chunk, in bytes   
 * @return {arrray} - an array of strings less than maxBytes (except in extreme edge cases)    
 */
function chunk(s, maxBytes) {
  // for plain js  
  const decoder = new TextDecoder("utf-8");
  let buf = new TextEncoder("utf-8").encode(s);
  // for node
  // let buf = Buffer.from(s);
  const result = [];
  var counter = 0;
  while (buf.length) {
    console.log("=============== BEG LOOP " + counter + " ===============");
    console.log("result is now:");
    console.log(result);
    console.log("buf is now:");
    // for plain js
    console.log(decoder.decode(buf));
    // for node  
    // console.log(buf.toString());
    /* get index of the last space character in the first chunk, 
    searching backwards from the maxBytes + 1 index */
    let i = buf.lastIndexOf(32, maxBytes + 1);
    console.log("i is: " + i);
    /* if no space is found in the first chunk,
    get index of the first space character in the whole string,
    searching forwards from 0 - in edge cases where characters
    between spaces exceeds maxBytes, eg chunk("123456789x 1", 9),
    the chunk will exceed maxBytes */
    if (i < 0) i = buf.indexOf(32, maxBytes);
    console.log("at first condition, i is: " + i);
    /* if there's no space at all, take the whole string,
    again an edge case like chunk("123456789x", 9) will exceed maxBytes*/
    if (i < 0) i = buf.length;
    console.log("at second condition, i is: " + i);
    // this is a safe cut-off point; never half-way a multi-byte
    // because the index is always the index of a space    
    console.log("pushing buf.slice from 0 to " + i + " into result array");
    // for plain js
    result.push(decoder.decode(buf.slice(0, i)));
    // for node
    // result.push(buf.slice(0, i).toString());
    console.log("buf.slicing with value: " + (i + 1));
    // slice the string from the index + 1 forwards  
    // it won't erroneously slice out a value after i, because i is a space  
    buf = buf.slice(i + 1); // skip space (if any)
    console.log("=============== END LOOP " + counter + " ===============");
    counter++;
  }
  return result;
}

console.log(chunk("Hey there! € 100 to pay", 12));

- user1063287

2

在多字节字符中间允许进行分割吗？ - trincot

好问题，这是用于将文本转换为语音的，如果文本少于5kb，则生成单个音频文件，否则生成多个音频文件，因此我想条件应该说类似于“在空格字符的实例处分割块”。 - user1063287

1

我喜欢你问题的布局和编辑！ - Ali Akram

1

有一个名为iter-ops的模块，其中包含了灵活的split和page操作符。 - vitaly-t

3个回答

1

一种可能的解决方案是计算每个字符的字节数。

function charByteCounter(char){
    let ch = char.charCodeAt(0)  // get char 
    let counter = 0
    while(ch) {
        counter++;
      ch = ch >> 8 // shift value down by 1 byte
    }  
   
    return counter
}

function * chunk(string, maxBytes) {
    let byteCounter = 0
    let buildString = ''
    for(const char of string){
        const bytes = charByteCounter(char)
        if(byteCounter + bytes > maxBytes){ // check if the current bytes + this char bytes is greater than maxBytes
            yield buildString // string with less or equal bytes number to maxBytes
            buildString = char
            byteCounter = bytes
            continue
        }
        buildString += char
        byteCounter += bytes
    }

    yield buildString
}

for (const s of chunk("Hey! , nice to meet you!", 12))
    console.log(s);

参考资料：

从JavaScript字符串中读取字节

- Ido

这对我来说是更好的解决方案。被接受的答案最终变成了3个块而不是2个。我想它是在寻找一个在我的字符串中不存在的空格字符。 - iocoker

-1

@trincot的回答中有一个小补充：

如果你要分割的字符串包含空格（" "），那么返回的数组至少被分成两个，即使整个字符串都可以放入maxBytes中（所以应该只返回1个项目）。

为了解决这个问题，我在while循环的第一行添加了一个检查：

export function chunkText (text: string, maxBytes: number): string[] {
  let buf = Buffer.from(text)
  const result = []
  while (buf.length) {
    let i = buf.length >= maxBytes ? buf.lastIndexOf(32, maxBytes + 1) : buf.length
    // If no space found, try forward search
    if (i < 0) i = buf.indexOf(32, maxBytes)
    // If there's no space at all, take the whole string
    if (i < 0) i = buf.length
    // This is a safe cut-off point; never half-way a multi-byte
    result.push(buf.slice(0, i).toString())
    buf = buf.slice(i+1) // Skip space (if any)
  }
  return result
}

- Christian Kaindl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- trincot · Accepted Answer

使用Buffer似乎确实是正确的方向。鉴于：

Buffer原型具有indexOf和lastIndexOf方法，以及
32是空格的ASCII代码，且
32永远不会作为多字节字符的一部分出现，因为组成多字节序列的所有字节始终具有最高有效位设置。

...您可以按以下方式继续进行：

function chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    const result = [];
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take the whole string
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        result.push(buf.slice(0, i).toString());
        buf = buf.slice(i+1); // Skip space (if any)
    }
    return result;
}

console.log(chunk("Hey there! € 100 to pay", 12)); 
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

您可以考虑将其扩展为查找TAB、LF或CR作为分隔符。如果是这样，并且您的输入文本可能有CRLF序列，则还需要检测这些序列，以避免在块中获取孤立的CR或LF字符。

您可以将上述函数转换为生成器，以便控制何时开始处理以获取下一个块：

function * chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield buf.slice(0, i).toString();
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

浏览器

Buffer 是特定于 Node 的。然而，浏览器实现了 TextEncoder 和 TextDecoder，这导致了类似的代码：

function * chunk(s, maxBytes) {
    const decoder = new TextDecoder("utf-8");
    let buf = new TextEncoder("utf-8").encode(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield decoder.decode(buf.slice(0, i));
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);