如何在Java中检测字符串是否包含西里尔字母？

Question

如何在Java中检测字符串是否包含西里尔字母？

10

我想检测一个字符串是否包含西里尔字母。

在 PHP 中，我做了这样的事情：

preg_match('/\p{Cyrillic}+/ui', $text)

在Java中会有相同的效果吗？

- knezmilos

3个回答

2

以下是使用Java 8中的流进行相同操作的另一种方法：

Original Answer翻译成"最初的回答"

text.chars()
        .mapToObj(Character.UnicodeBlock::of)
        .filter(Character.UnicodeBlock.CYRILLIC::equals)
        .findAny()
        .ifPresent(character -> ));

或者另一种方式，保留索引：

char[] textChars = text.toCharArray();
IntStream.range(0, textChars.length)
                 .filter(index -> Character.UnicodeBlock.of(textChars[index])
                                .equals(Character.UnicodeBlock.CYRILLIC))
                 .findAny() // can use findFirst()
                 .ifPresent(index -> );

请注意：我在这里使用字符数组而不是字符串，因为通过索引获取元素具有性能优势。

最初的回答：

请注意：由于通过索引获取元素具有性能优势，因此我在这里使用字符数组而不是字符串。

- Alexander Druzhynin

1

上述使用 UnicodeBlock 的示例将正常工作，但如果您喜欢的话，也可以使用 Character.UnicodeScript enum：

boolean containsCyrillic = "Your String Goes Here".codePoints()
    .mapToObj(Character.UnicodeScript::of)
    .anyMatch(Character.UnicodeScript.CYRILLIC::equals);

如果你不信任你的输入，你可以更加谨慎地使用Character.isValidCodePoint来进行防御。

boolean containsCyrillic =
    "Your Untrusted String Goes Here".codePoints()
        .filter(Character::isValidCodePoint)
        .mapToObj(Character.UnicodeScript::of)
        .anyMatch(s -> s == Character.UnicodeScript.CYRILLIC);

如果您有兴趣分析文本中的各种脚本，例如确定文本的主要脚本，您可以跟踪各种脚本中的代码点数量：

Map<Character.UnicodeScript,Long> scripts = 
    "Your Untrusted String Goes Here".codePoints()
        .filter(Character::isValidCodePoint)
        .mapToObj(Character.UnicodeScript::of)
        .collect(groupingBy(
            Function.identity(),
            counting()));

我们还可以更高效一点，使用{{link1：EnumMap}}，因为Character.UnicodeScript是一个enum类型：

Map<Character.UnicodeScript,Long> scripts = 
    "Your Untrusted String Goes Here".codePoints()
        .filter(Character::isValidCodePoint)
        .mapToObj(Character.UnicodeScript::of)
        .collect(groupingBy(
            Function.identity(),
            () -> new EnumMap<>(Character.UnicodeScript.class),
            counting()));

如果你只对多数投票感兴趣，那么你可以尝试这个：

Optional<Character.UnicodeScript> predominantScript = 
    "Your Untrusted String Goes Here".codePoints()
        .filter(Character::isValidCodePoint)
        .mapToObj(Character.UnicodeScript::of)
        .filter(s -> s != Character.UnicodeScript.COMMON
            && s != Character.UnicodeScript.INHERITED
            && s != Character.UnicodeScript.UNKNOWN)
        .collect(groupingBy(
            Function.identity(),
            () -> new EnumMap<>(Character.UnicodeScript.class),
            counting()))
        .entrySet()
        .stream()
        .sorted(
            Comparator
            .<Map.Entry<Character.UnicodeScript, Long>>comparingLong(Map.Entry::getValue)
            .reversed()
            .thenComparing(Map.Entry::getKey))
        .map(Map.Entry::getKey)
        .findFirst();

我们过滤掉Character.UnicodeScript.COMMON、Character.UnicodeScript.INHERITED和Character.UnicodeScript.UNKNOWN，因为它们是“通用”类别，用于共享的代码点，映射到所有代码点，或者只是未被识别的代码点（根据规范），而不是个别脚本。

- sigpwned

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- M A · Accepted Answer

请尝试以下方法：

Pattern.matches(".*\\p{InCyrillic}.*", text)

您也可以避免使用正则表达式，而使用类Character.UnicodeBlock：

for(int i = 0; i < text.length(); i++) {
    if(Character.UnicodeBlock.of(text.charAt(i)).equals(Character.UnicodeBlock.CYRILLIC)) {
        // contains Cyrillic
    }
}