Ruby：通过字节长度限制UTF-8字符串

Question

Ruby：通过字节长度限制UTF-8字符串

rubystringutf-8byterabbitmq

16

这个 RabbitMQ 页面表示：

队列名称可以使用最多 255 个 UTF-8 字符。

在 Ruby（1.9.3）中，如何按字节数截断一个 UTF-8 字符串而不会在字符中间断开？结果字符串应该是最长的符合字节限制的有效 UTF-8 字符串。

- Kelvin

7个回答

10

bytesize函数返回字符串的字节长度，而像切片这样的操作只要字符串的编码设置正确就不会破坏字符串。

一个简单的方法是直接遍历该字符串。

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

如果你很聪明的话，可以直接复制前63个字符，因为任何Unicode字符在UTF-8中最多只有4个字节。

请注意，这仍然不是完美的。例如，假设你字符串的最后4个字节是'e'和重音符号。切片最后2个字节产生的字符串仍然是utf8，但从用户看到的内容来看，输出会从'é'变为'e'，这可能会改变文本的含义。当你只是给RabbitMQ队列命名时，这可能并不是很重要，但在其他情况下可能很重要。例如，在法语中，新闻通讯标题“Un policier tué”意思是“警察被杀了”，而“Un policier tue”则意为“警察杀了”。

- Frederick Cheung

4

+1 就是因为那个警察的例子 :)。谷歌翻译证实了它。不过发音听起来还是有足够的区别。 - Kelvin

只要大家知道，“组合字符”问题只会出现在分解字符中。如果e-acute等是单个字符，则没有问题。 - Kelvin

你可以通过先将其转换为规范形式C来避免这种情况。 - Frederick Cheung

谢谢！不过有一种比线性构建字符串更快的方法，我在这里发布了一个单独的答案。 - Ian

5

我想我找到了可行的解决方案。

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

使用方法：

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

更新...

没有使用active_support的分解行为:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

使用active_support进行行为分解：

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

- Kelvin

4

Rails 6将提供一个String#truncate_bytes，其行为类似于truncate，但采用字节数而不是字符数。当然，它返回一个有效的字符串（它不会盲目地在多字节字符中间切割）。

从文档中获取：

>> "".size
=> 20
>> "".bytesize
=> 80
>> "".truncate_bytes(20)
=> "…"

- akim

+1 那段代码相当聪明 - 它甚至通过使用 scan(/\X/) 来拆分成字符簇来正确处理分解字符。 - Kelvin

1

这个怎么样？

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog"
count = 0
while true
  more_truncate = "a" + (255-count).to_s
  s2 = s.unpack(more_truncate)[0]
  s2.force_encoding 'utf-8'

  if s2[-1].valid_encoding?
    break
  else
    count += 1
  end
end

s2.force_encoding 'utf-8'
puts s2

- halfelf

它能工作，但如果字符串很大怎么办？每次删除一个utf-8字符可能非常低效。 - Kelvin

@Kelvin的回答已经被编辑过了。现在应该好多了。由于utf-8字符不会超过6个字节，所以循环很快就会结束。 - halfelf

s2[0] 似乎是结果，但它使用的是 ascii-8bit 编码。如果我调用 .encode('utf-8')，会出现 Encoding::UndefinedConversionError 错误。 - Kelvin

真的很烦人...我又编辑了一遍。在 force_encoding 之后，你可以通过 encode('utf-8') 或 valid_encoding? 进行测试。顺便说一下，相信我，由于我的母语是一种复杂的语言，我也经历过字符编码的噩梦。 - halfelf

让我们在聊天室继续这个讨论。 - halfelf

显示剩余3条评论

0

没有Rails

Fredrick Cheung的答案是一个很好的O(n)起点，启发了这个O(log n)的解决方案：

def limit_bytesize(str, max_bytesize)
  return str unless str.bytesize > max_bytesize

  # find the minimum index that exceeds the bytesize, then subtract 1
  just_over = (0...str.size).bsearch { |l| str[0..l].bytesize > max_bytesize }
  str[0..(just_over - 1)]
end

我相信这也实现了那个答案中提到的自动 max_bytesize / 4 加速，因为 bsearch 从中间开始。

- Ian

0

Ruby的String#byteslice可以与范围一起使用。我建议尝试以下操作：

string.bytslice(0...max_bytesize)

这三个点将允许max_bytesize值为包括在内的。

- kayla.reopelle

我曾经对此抱有希望，但不幸的是，它可能会将一个unicode字符分裂开来，导致一个无效的字符串："abc\u2014d".byteslice(0, 5).valid_encoding? # => false。 - Kelvin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jogaco · Accepted Answer

对于Rails >= 3.0版本，您可以使用ActiveSupport::Multibyte::Chars limit方法。

来自API文档：

- (Object) limit(limit)

将字符串的字节大小限制在一定数量的字节内，同时不会破坏字符。当某些原因导致字符串的存储空间受限时可用。

示例：

'こんにちは'.mb_chars.limit(7).to_s # => "こん"