按字节截断字符串

Question

按字节截断字符串

9

我创建了以下代码用于在Java中将字符串截断为给定字节数的新字符串。

        String truncatedValue = "";
        String currentValue = string;
        int pivotIndex = (int) Math.round(((double) string.length())/2);
        while(!truncatedValue.equals(currentValue)){
            currentValue = string.substring(0,pivotIndex);
            byte[] bytes = null;
            bytes = currentValue.getBytes(encoding);
            if(bytes==null){
                return string;
            }
            int byteLength = bytes.length;
            int newIndex =  (int) Math.round(((double) pivotIndex)/2);
            if(byteLength > maxBytesLength){
                pivotIndex = newIndex;
            } else if(byteLength < maxBytesLength){
                pivotIndex = pivotIndex + 1;
            } else {
                truncatedValue = currentValue;
            }
        }
        return truncatedValue;

这是我想到的第一件事，我知道我可以改进它。我看到另一篇帖子在问类似的问题，但他们是使用字节来截取字符串而不是使用String.substring。我认为在我的情况下我更愿意使用String.substring。

编辑：我刚刚删除了UTF8的参考，因为我也想能够适用于不同的存储类型。

- stevebot

我会重新表述您的问题。您正在尝试将一个字符串适配到一个字节数组中，但该数组大小不能大于maxUTF8BytesLength。您想要使用UTF-8进行编码，并尽可能复制尽可能多的字符。是这样吗？ - gawi

正确，我会说那是正确的。我也希望能够高效地完成它。 - stevebot

我刚刚编辑了问题，不再提及UTF-8。对此我感到抱歉，这是误导性的。 - stevebot

12个回答

8

更为明智的解决方案是使用解码器：

final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();

- kan

2

在任意字节索引处进行切割可能会创建无效的编码数据，因为单个字符可能使用多个字节（特别是UTF-8）。更糟糕的是，在其他编码中，它可能会产生错误的有效字符，这些字符不会被忽略。您可以通过首先分配所需大小的ByteBuffer，然后将其与CharsetEncoder一起使用来轻松避免此问题。CharsetEncoder将自动仅编码适合缓冲区的有效字符数量，然后将缓冲区解码为String。类似的方法，但没有错误，并且更有效率，因为它不会对预期限制之外的字符进行编码。 - Holger

1

请参考这个答案。它甚至可以省略解码步骤。 - Holger

@Holger 我的解决方案通过 CodingErrorAction.IGNORE 忽略了被截断的多字节字符，因此它可以正常工作。我很想看到一个失败的例子。然而，我同意你的解决方案看起来更整洁，可能更高效。 - kan

2

是的，对于UTF-8，使用CodingErrorAction.IGNORE会做正确的事情。但是OP说：“我也想能够在不同的存储类型下执行此操作”，而对于其他编码，将多字节序列拆分可能会导致有效（但错误）的字符。 - Holger

5

我认为Rex Kerr的解决方案存在两个bug。

首先，如果非ASCII字符刚好在限制长度之前，它会截断到limit+1。截断“123456789á1”将导致“123456789á”，在UTF-8中表示为11个字符。
其次，我认为他误解了UTF标准。https://en.wikipedia.org/wiki/UTF-8#Description表明，UTF序列开头的110xxxxx告诉我们表示需要2个字符长（而不是3个）。这就是他的实现通常没有使用所有可用空间的原因（正如Nissim Avitan所指出的那样）。

请查看下面的纠正版本：

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}

我仍然认为这并不是很有效。因此，如果您不真正需要结果的字符串表示形式，并且字节数组可以使用，则可以使用以下内容：

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}

有趣的是，如果你从字节数组再次创建字符串，那么在实际20-500字节的限制下，它们几乎表现相同。请注意，这两种方法都假定输入是有效的utf-8格式，在使用Java的getBytes()函数后这是一个有效的假设。

- Zsolt Taskai

你还应该在 s.getBytes("UTF-8") 处捕获 UnsupportedEncodingException。 - asalamon74

我没有看到 getBytes 抛出任何异常。虽然 http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes%28java.lang.String%29 上说：“当这个字符串不能用给定的字符集进行编码时，此方法的行为是未指定的。” - Zsolt Taskai

1

你链接的页面显示它会抛出UnsupportedEncodingException异常："public byte[] getBytes(String charsetName) throws UnsupportedEncodingException"。 - asalamon74

谢谢！奇怪，我不知道在两年前发布这个解决方案时使用了哪个版本。更新上面的代码。 - Zsolt Taskai

你可以使用StandardCharsets类中的Charset常量而不是提供一个字符串作为编码名称，因为String#getBytes(Charset charset)方法不会抛出UnsupportedEncodingException异常。 - Pikachu

4

String s = "FOOBAR";

int limit = 3;
s = new String(s.getBytes(), 0, limit);

< p > s 的结果值：

FOO

- Ilya Lysenko

当MAX_LENGTH在多字节序列的中间中断字节数组时，结果字符串以“?”结尾。例如： s = "ää"; MAX_LENGTH = 3; 结果为："ä?"鉴于此代码的简单性，在某些情况下可能是一个选项。 - Martin Rust

请纠正我的注释：MAX_LENGTH = 5（为什么解决方案使用MAX_LENGTH - 2？）还要注意的是，从Java 1.6开始，应该将"UTF-8"替换为StandardCharsets.UTF_8。 - Martin Rust

3

使用UTF-8 CharsetEncoder，进行编码直到输出的ByteBuffer包含了你想要的字节数，可以通过查找CoderResult.OVERFLOW来实现。

- bmargulies

2

如前所述，Peter Lawrey的解决方案存在严重的性能劣势（10,000次需要约3,500msc），而Rex Kerr的效果要好得多（10,000次只需约500msc），但结果不准确——它会剪切比所需更多的字符（例如对于某些情况，它只应该保留4000个字节，但实际上只剩下了3500个字节）。这里提供我的解决方案（10,000次只需约250msc），假设UTF-8的最大长度字符以字节为单位为4（感谢WikiPedia）：

public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
    double MAX_UTF8_CHAR_LENGTH = 4.0;
    if(word.length()>dbLimit){
        word = word.substring(0, dbLimit);
    }
    if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
        int residual=word.getBytes("UTF-8").length-dbLimit;
        if(residual>0){
            int tempResidual = residual,start, end = word.length();
            while(tempResidual > 0){
                start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
                tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
                end=start;
            }
            word = word.substring(0, end);
        }
    }
    return word;
}

- Nissim Avitan

这个解决方案似乎没有防止尾随半代理对的情况？其次，如果getBytes（）.length会分别应用于代理对的两个半部分（我不确定它永远不会立即发生），则它也会低估代理对作为整体的UTF-8表示的大小，假设“替换字节数组”是单个字节。第三，所有4字节的UTF-8代码点在Java中都需要一个双字符代理对，因此每个Java字符的最大长度实际上只有3个字节。 - Stefan L

1

你可以将字符串转换为字节，然后再将这些字节转换回字符串。

public static String substring(String text, int maxBytes) {
   StringBuilder ret = new StringBuilder();
   for(int i = 0;i < text.length(); i++) {
       // works out how many bytes a character takes, 
       // and removes these from the total allowed.
       if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
       ret.append(text.charAt(i));
   }
   return ret.toString();
}

- Peter Lawrey

2

@nguyendat，这个程序的性能不太好的原因有很多。其中主要的原因是substring()和getBytes()函数会创建对象。然而，你会惊讶地发现在一毫秒内你可以做很多事情，通常这已经足够了。 - Peter Lawrey

1

该方法无法正确处理代理对，例如substring("\uD800\uDF30\uD800\uDF30", 4).getBytes("UTF-8").length将返回8而不是4。半个代理对由String.getBytes("UTF-8")表示为单字节"?"。 - Stefan L

@StefanL 我在这里发布了一个变体的答案（https://dev59.com/CFDTa4cB1Zd3GeqPGA8Y#41071240），它可以正确处理代理对。 - Hans Brende

0

这个可能不是最高效的解决方案，但是可以工作。

public static String substring(String s, int byteLimit) {
    if (s.getBytes().length <= byteLimit) {
        return s;
    }

    int n = Math.min(byteLimit-1, s.length()-1);
    do {
        s = s.substring(0, n--);
    } while (s.getBytes().length > byteLimit);

    return s;
}

- Saúl Martínez Vidals

0

通过使用以下正则表达式，您还可以删除双字节字符的前导和尾随空格。

stringtoConvert = stringtoConvert.replaceAll("^[\\s　]*", "").replaceAll("[\\s　]*$", "");

- Gokul Limbe

0

这是我的：

private static final int FIELD_MAX = 2000;
private static final Charset CHARSET =  Charset.forName("UTF-8"); 

public String trancStatus(String status) {

    if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
        int maxLength = FIELD_MAX;

        int left = 0, right = status.length();
        int index = 0, bytes = 0, sizeNextChar = 0;

        while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {

            index = left + (right - left) / 2;

            bytes = status.substring(0, index).getBytes(CHARSET).length;
            sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;

            if (bytes < maxLength) {
                left = index - 1;
            } else {
                right = index + 1;
            }
        }

        return status.substring(0, index);

    } else {
        return status;
    }
}

- Сергей Сенько

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rex Kerr · Accepted Answer

为什么不将其转换为字节并向前移动 - 遵循 UTF8 字符边界 - 直到获得最大数量，然后将这些字节转换回字符串？

或者，如果您记住应该发生剪切的位置，可以直接截取原始字符串：

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}

^{注意：已于2014年8月25日进行编辑以修复错误}