如何根据正则表达式将文本拆分成多行？

Question

如何根据正则表达式将文本拆分成多行？

4

我有一些文本片段，想要将它们分成几行。问题是这些文本已经被格式化了，所以我不能像通常那样进行分割，比如这样：

 _text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            .ToArray();

这是样本文字：

 adj 1: around the middle of a scale of evaluation of physical
        measures; "an orange of average size"; "intermediate
        capacity"; "a plane with intermediate range"; "medium
        bombers" [syn: {average}, {intermediate}]
 2: (of meat) cooked until there is just a little pink meat
    inside
 n 1: a means or instrumentality for storing or communicating
      information
 2: the surrounding environment; "fish require an aqueous
    medium"
 3: an intervening substance through which signals can travel as
    a means for communication
 4: (bacteriology) a nutrient substance (solid or liquid) that
    is used to cultivate micro-organisms [syn: {culture medium}]
 5: an intervening substance through which something is
    achieved; "the dissolving medium is called a solvent"
 6: a liquid with which pigment is mixed by a painter
 7: (biology) a substance in which specimens are preserved or
    displayed
 8: a state that is intermediate between extremes; a middle
    position; "a happy medium"

格式总是相同的：

可能会出现1-3个字母
数字1-10
冒号
空格
可能分布在多行上的文本。

因此，在这种情况下，换行符必须是像1-3个字符的单词后跟着1-2个字符的数字，再加上一个冒号。

有人能给我一些建议吗？我该如何使用split或其他方法来实现这一点？

更新：Steven的答案，但不确定如何将其适配到我的函数中。这里我展示了我的原始代码以及Steven提供的建议，但有一部分我不确定。

    public parser(string text)
    {
        //_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            // .ToArray();

        string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
        foreach (Match m in Regex.Matches(text, pattern))
        {
            if (m.Success)
            {
                string entry = string.Join(Environment.NewLine,
                    m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
                // ...
            }
        }
    }

为了测试目的，这里以不同的格式提供文本:

“medium adj 1: 在物理测量评估范围中间的位置；“一个大小适中的橙子”；“中等能力”；“中程飞机”；“中型轰炸机” [同义词：{average}，{intermediate}] 2:（肉类）煮到咬一口有点粉色的肉味儿 n 1: 用于存储或传达信息的手段或工具 2: 环境；“鱼需要一种水介质” 3: 作为通信媒介的介质 4:（细菌学）培养微生物所使用的营养物质（固体或液体）[同义词：{culture medium}] 5: 完成某事的介质；“溶解介质称为溶剂” 6: 艺术家混合颜料的液体 7:（生物学）标本保存或展示的物质 8: 处于极端之间的状态；中间位置；“一个幸福的平衡状态” 9: 在生与死之间充当调解人的人；“他咨询了几位灵媒” [同义词：{spiritualist}] 10: 广泛传播给公众的传输[同义词：{mass medium}] 11: 你特别适合的职业；“在法律方面，他找到了自己真正的天职” [同义词：{metier}] [也作：{media}（复数）]”

- Samantha J T Star

不觉得正则表达式是最好的选择，就用困难的方法做吧。 - pm100

你已经定义了业务规则。其次，对于正则表达式没有太多的经验。为了可维护性，我建议您仅编写逐行解析的逻辑以检查条件。 - Myrtle

2个回答

2

试试这个

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ConsoleApplication106
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            List<Data> data = new List<Data>();
            string pattern = @"(?'prefix'\w*)?\s*?(?'index'\d+):(?'text'.*)";
            StreamReader reader = new StreamReader(FILENAME);
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                Match match = Regex.Match(inputLine, pattern);
                Data newData = new Data();
                data.Add(newData);
                newData.prefix = match.Groups["prefix"].Value;
                newData.index = int.Parse(match.Groups["index"].Value);
                newData.text = match.Groups["text"].Value;
            }
        }
    }
    public class Data
    {
        public string prefix { get; set; }
        public int index { get; set; }
        public string text { get; set; }
    }
}

- jdweng

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Steven Doggart · Accepted Answer

2

正则表达式对此非常有效。例如：

public parser(string text)
{
    string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
    var entries = new List<string>();
    foreach (Match m in Regex.Matches(text, pattern))
        if(m.Success)
            entries.Add(string.Join(" ", 
                m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
    _text = entries.ToArray();
}

- Steven Doggart

1

我没有对它进行分割。只是把那部分留下来，这样你就可以看到我的原始代码的位置 :-) - Samantha J T Star

啊，抱歉。那么，_text是一个字符串数组，对吧？它应该被设置成什么呢？所有行的第一个条目，还是所有条目，或者其他什么？我对你想要得到的输出感到困惑。 - Steven Doggart

私有字符串数组 _text；它是一个本地数组，每个元素包含一行。希望这样说得清楚 :-) 如果我只是从该函数返回 _text，可能会更加简洁。 - Samantha J T Star

在这个例子中，数组的第一个元素应包含：adj 1：在物理测量评估的尺度中间左右；“一只中等大小的橙子”；“中等能力”；“一架中程飞机”；“中型轰炸机” [同义词：{average}，{intermediate}]。 - Samantha J T Star

1

它没有起作用的原因是因为在每个条目的第一行中，三个字母单词和数字之前有一个空格。根据您对格式的描述，我假设那不是这种情况。我的模式依赖于该空格不存在，以确定每个条目的结束位置。即使有了那个空格，也可以通过一些更具挑战性的方式来实现。我已更新我的答案以展示其中一种方法。 - Steven Doggart

显示剩余14条评论