Unicode字符范围未被正则表达式所消费

3

注:

已经有人提出了一个问题 C#带有模式中的\Uxxxxxxxx的正则表达式。这个问题的不同之处在于,它不是关于如何计算代理对的,而是关于如何在正则表达式中表示高于0的unicode平面。从我的问题中应该清楚,我已经知道为什么这些代码单元被表示为2个字符——它们是代理对(这就是另一个问题所问的)。我的问题是如何通用地转换它们(因为我无法控制向程序提供的正则表达式的样子),以便它们可以被.NET正则表达式引擎消耗。

请注意,我现在有一种方法可以做到这一点,并且想将我的答案添加到我的问题中,但由于现在标记为重复,我无法添加我的答案。

我有一些测试数据被传递到我正在移植到c#的Java库中。我将一个特定的问题案例作为示例进行了隔离。原始字符类的UTF-32为\U0001BCA0-\U0001BCA3,这在.NET中不能被轻松消耗-我们会得到"未识别的转义序列\U"错误消息。

我尝试将其转换为UTF-16,并且我已经确认了\U0001BCA0\U0001BCA3的结果是预期的。

UTF-32      | Codepoint   | High Surrogate  | Low Surrogate  | UTF-16
---------------------------------------------------------------------------
0x0001BCA0  | 113824      | 55343           | 56480          | \uD82F\uDCA0
0x0001BCA3  | 113827      | 55343           | 56483          | \uD82F\uDCA3

然而,当我将字符串"([\uD82F\uDCA0-\uD82F\uDCA3])"传递给Regex类的构造函数时,我会得到一个异常"[x-y] range in reverse order"
虽然字符顺序已经很清楚了(在Java中可以工作),但我尝试了反向输入并得到了相同的错误消息。
我还尝试将UTF-32字符从\U0001BCA0-\U0001BCA3更改为\x01BCA0-\x01BCA3,但仍然收到了异常"[x-y] range in reverse order"
那么,如何使.NET Regex类能够成功解析此字符范围?
注意:我尝试将代码更改为生成包含所有字符而不是范围的正则表达式字符类,并且似乎可以工作,但这将使我的几十个字符的正则表达式变成数千个字符,这对性能肯定没有好处。
实际的正则表达式示例:
再次强调,上述是一个失败的独立示例,我要找的是一种通用方法,将这些正则表达式转换为.NET Regex类可以解析的形式。
"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

1
这绝对不是重复的问题。问题不在于如何计算代理对,而在于如何在正则表达式中表示高于0的Unicode平面。 - Sefe
@WiktorStribiżew - 请重新打开我的问题,以便我可以将我的答案添加到其中。这不是链接问题的重复。 - NightOwl888
3个回答

4
你假设Regex会将"\uD82F\uDCA0"识别为一个复合字符,但事实并非如此,因为在.NET中字符串的内部表示是16位Unicode。
Unicode有代码点这个概念,它是独立于物理表示的抽象概念。根据所使用的编码方式,不是所有的代码点都能显示为单个字符。在UTF-8中,这一点非常明显,因为所有大于127的代码点都需要两个或更多个字符来表示。在.NET中,字符是Unicode的,这意味着对于高于0的平面,需要使用组合字符。但是,这些字符仍然被正则表达式引擎识别为单个字符。
简而言之:不要将字符组合视为代码点,而应将它们视为单个字符。因此,在你的情况下,正则表达式应该是:
using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])");
        Console.WriteLine(regex.Match("\uD82F\uDCA2").Success);
    }
}

你可以在这里尝试代码

我需要为没有指定相同高代理字符的范围做什么?再次强调,这个例子只是一个孤立的情况。我的实际字符串中有许多指定了代码点范围的字符类。 - NightOwl888
如果您需要在正则表达式中执行此操作,您必须将范围拆分为子范围,并使用(range1|range2)。如果您可以接受非正则表达式的解决方案,您可以使用Encoding.UTF32将其转换为二进制,并在二进制中搜索代码点。请注意,对于正则表达式解决方案,每个代码点范围最多需要3个子范围。 - Sefe

1

C#中的字符串采用UTF-16编码。这就是为什么这个正则表达式被视为:

  • 符号'\uD82F',或
  • 范围\uDCA0-\uD82F,或
  • 符号'\uDCA3'

范围\uDCA0-\uD82F显然是不正确的,会导致[x-y] range in reverse order异常。

不幸的是,你的问题没有简单的解决方案,因为它是由C#字符串的本质引起的。你不能将UTF-32符号放入一个C#字符中,也不能使用多字符字符串作为范围边界。

可能的解决方法是使用半正则表达式解决方案:从字符串中提取这些符号,并通过纯C#代码进行比较。当然,这看起来很丑陋,但我没有看到在C#中使用原始正则表达式完成此操作的另一种方式。


谢谢。至少现在我有一个合理的解释为什么会发生这种情况。然而,代码的整个要点是它使用一组文件驱动的规则来构建正则表达式,然后将该正则表达式与生产代码进行比较,以确保它的工作方式相同。我将不得不考虑如何最好地处理这个问题。 - NightOwl888

1
虽然其他回答者提供了一些线索,但我需要一个答案。我的测试是一个由文件输入构建的正则表达式驱动的规则引擎,因此将逻辑硬编码到C#中不是一个选项。
然而,我在这里学到了:
1. .NET的Regex类不支持代理对 2. 通过使用正则表达式选择性地模拟代理对范围,可以伪造对代理对范围的支持
但当然,在我的数据驱动情况下,我不能手动更改正则表达式以使其符合.NET的格式 - 我需要自动化它。因此,我创建了下面的Utf32Regex类,它直接在构造函数中接受UTF32字符,并在内部将它们转换为.NET理解的正则表达式。
例如,它将转换正则表达式:
"[abc\\U00011DEF-\\U00013E07]"

"(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"

或者

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

"((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + 
"\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" + 
"\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" + 
"\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" + 
"\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"

Utf32Regex.cs

using System;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;

/// <summary>
/// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
/// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
/// like <c>\U00010000-\U00010001</c>.
/// </summary>
public class Utf32Regex : Regex
{
    private const char MinLowSurrogate = '\uDC00';
    private const char MaxLowSurrogate = '\uDFFF';

    private const char MinHighSurrogate = '\uD800';
    private const char MaxHighSurrogate = '\uDBFF';

    // Match any character class such as [A-z]
    private static readonly Regex characterClass = new Regex(
        "(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
        RegexOptions.Compiled);

    // Match a UTF32 range such as \U000E01F0-\U000E0FFF
    // or an individual character such as \U000E0FFF
    private static readonly Regex utf32Range = new Regex(
        "(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
        RegexOptions.Compiled);

    public Utf32Regex()
        : base()
    {
    }

    public Utf32Regex(string pattern)
        : base(ConvertUTF32Characters(pattern))
    {
    }

    public Utf32Regex(string pattern, RegexOptions options)
        : base(ConvertUTF32Characters(pattern), options)
    {
    }

    public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
        : base(ConvertUTF32Characters(pattern), options, matchTimeout)
    {
    }

    private static string ConvertUTF32Characters(string regexString)
    {
        StringBuilder result = new StringBuilder();
        // Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
        // equivalent UTF16 characters
        ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
        // Now find all of the individual characters that were not in ranges and
        // fix those as well.
        ConvertUTF32CharactersToUTF16(result);

        return result.ToString();
    }

    private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
    {
        Match match = characterClass.Match(regexString); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string characterClass = match.Groups[1].Value;
                string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);

                result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                result.Append(convertedCharacterClass); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(regexString.Substring(lastEnd)); // Append tail
    }

    private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
    {
        StringBuilder result = new StringBuilder();
        StringBuilder chars = new StringBuilder();

        Match match = utf32Range.Match(characterClass); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string utf16Chars;
                string rangeBegin = match.Groups["begin"].Value.Substring(2);

                if (!string.IsNullOrEmpty(match.Groups["end"].Value))
                {
                    string rangeEnd = match.Groups["end"].Value.Substring(2);
                    utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
                }
                else
                {
                    utf16Chars = UTF32ToUTF16Chars(rangeBegin);
                }

                result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                chars.Append(utf16Chars); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(characterClass.Substring(lastEnd)); // Append tail of character class

        // Special case - if we have removed all of the contents of the
        // character class, we need to remove the square brackets and the
        // alternation character |
        int emptyCharClass = result.IndexOf("[]");
        if (emptyCharClass >= 0)
        {
            result.Remove(emptyCharClass, 2);
            // Append replacement ranges (exclude beginning |)
            result.Append(chars.ToString(1, chars.Length - 1));
        }
        else
        {
            // Append replacement ranges
            result.Append(chars.ToString());
        }

        if (chars.Length > 0)
        {
            // Wrap both the character class and any UTF16 character alteration into
            // a non-capturing group.
            return "(?:" + result.ToString() + ")";
        }
        return result.ToString();
    }

    private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
    {
        while (true)
        {
            int where = result.IndexOf("\\U00");
            if (where < 0)
            {
                break;
            }
            string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
            result.Replace(where, where + 10, cp);
        }
    }

    private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
    {
        var result = new StringBuilder();
        int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
        int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);

        var beginChars = char.ConvertFromUtf32(beginCodePoint);
        var endChars = char.ConvertFromUtf32(endCodePoint);
        int beginDiff = endChars[0] - beginChars[0];

        if (beginDiff == 0)
        {
            // If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        else
        {
            // If the begin character is not the same, create 3 ranges
            // 1. The remainder of the first
            // 2. A range of all of the middle characters
            // 3. The beginning of the last

            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, MaxLowSurrogate);
            result.Append(']');

            // We only need a middle range if the ranges are not adjacent
            if (beginDiff > 1)
            {
                result.Append("|");
                // We only need a character class if there are more than 1
                // characters in the middle range
                if (beginDiff > 2)
                {
                    result.Append('[');
                }
                AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
                if (beginDiff > 2)
                {
                    result.Append('-');
                    AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
                    result.Append(']');
                }
                result.Append('[');
                AppendUTF16Character(result, MinLowSurrogate);
                result.Append('-');
                AppendUTF16Character(result, MaxLowSurrogate);
                result.Append(']');
            }

            result.Append("|");
            AppendUTF16Character(result, endChars[0]);
            result.Append('[');
            AppendUTF16Character(result, MinLowSurrogate);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        return result.ToString();
    }

    private static string UTF32ToUTF16Chars(string hex)
    {
        int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
        return UTF32ToUTF16Chars(codePoint);
    }

    private static string UTF32ToUTF16Chars(int codePoint)
    {
        StringBuilder result = new StringBuilder();
        UTF32ToUTF16Chars(codePoint, result);
        return result.ToString();
    }

    private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
    {
        // Use regex alteration to on the entire range of UTF32 code points
        // to ensure each one is treated as a group.
        result.Append("|");
        AppendUTF16CodePoint(result, codePoint);
    }

    private static void AppendUTF16CodePoint(StringBuilder text, int cp)
    {
        var chars = char.ConvertFromUtf32(cp);
        AppendUTF16Character(text, chars[0]);
        if (chars.Length == 2)
        {
            AppendUTF16Character(text, chars[1]);
        }
    }

    private static void AppendUTF16Character(StringBuilder text, char c)
    {
        text.Append(@"\u");
        text.Append(Convert.ToString(c, 16).ToUpperInvariant());
    }
}

StringBuilderExtensions.cs

public static class StringBuilderExtensions
{
    /// <summary>
    /// Searches for the first index of the specified character. The search for
    /// the character starts at the beginning and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value)
    {
        return IndexOf(text, value, 0);
    }

    /// <summary>
    /// Searches for the index of the specified character. The search for the
    /// character starts at the specified offset and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <param name="startIndex">The starting offset.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value, int startIndex)
    {
        if (text == null)
            throw new ArgumentNullException("text");
        if (value == null)
            throw new ArgumentNullException("value");

        int index;
        int length = value.Length;
        int maxSearchLength = (text.Length - length) + 1;

        for (int i = startIndex; i < maxSearchLength; ++i)
        {
            if (text[i] == value[0])
            {
                index = 1;
                while ((index < length) && (text[i + index] == value[index]))
                    ++index;

                if (index == length)
                    return i;
            }
        }

        return -1;
    }

    /// <summary>
    /// Replaces the specified subsequence in this builder with the specified
    /// string.
    /// </summary>
    /// <param name="text">this builder.</param>
    /// <param name="start">the inclusive begin index.</param>
    /// <param name="end">the exclusive end index.</param>
    /// <param name="str">the replacement string.</param>
    /// <returns>this builder.</returns>
    /// <exception cref="IndexOutOfRangeException">
    /// if <paramref name="start"/> is negative, greater than the current
    /// <see cref="StringBuilder.Length"/> or greater than <paramref name="end"/>.
    /// </exception>
    /// <exception cref="ArgumentNullException">if <paramref name="str"/> is <c>null</c>.</exception>
    public static StringBuilder Replace(this StringBuilder text, int start, int end, string str)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }
        if (start >= 0)
        {
            if (end > text.Length)
            {
                end = text.Length;
            }
            if (end > start)
            {
                int stringLength = str.Length;
                int diff = end - start - stringLength;
                if (diff > 0)
                { // replacing with fewer characters
                    text.Remove(start, diff);
                }
                else if (diff < 0)
                {
                    // replacing with more characters...need some room
                    text.Insert(start, new char[-diff]);
                }
                // copy the chars based on the new length
                for (int i = 0; i < stringLength; i++)
                {
                    text[i + start] = str[i];
                }
                return text;
            }
            if (start == end)
            {

                text.Insert(start, str);
                return text;
            }
        }
        throw new IndexOutOfRangeException();
    }
}

请注意,这个功能测试得不是很充分,可能不够健壮,但是用于测试目的应该还可以。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接