正则表达式性能下降

Question

正则表达式性能下降

6

我正在编写一个C#应用程序，对大量文本（约2500万个字符串）进行多个正则表达式（约10个）匹配。我尝试了谷歌搜索，但是与“减慢”有关的正则表达式教程都是关于如何使用反向引用等技术，而不是解决匹配速度变慢的问题。我认为这不是我的问题，因为我的正则表达式一开始非常快，但最后变得很慢。

前100万个字符串，运行正则表达式每1000个字符串大约需要60毫秒。到最后，速度大约会减慢到600毫秒。有人知道原因吗？

我曾经使用RegEx的实例代替缓存版本，并编译了能够编译的表达式，这些改进使改善了性能。

但我有些正则表达式需要根据用户名称的不同而有所变化，例如有些是mike said (\w*)或john said (\w*)。我所理解的是不可能编译这些正则表达式并传递参数（例如saidRegex.Match(inputString, userName)）。

有没有什么建议？

- mike1952

6

你能发布一些代码吗？ - Dave Bish

17

正则表达式引擎本身减速的可能性不大。更有可能的情况是您的应用程序正在保存结果，因此内存正在增长，这会导致总体性能下降。监视进程内存大小。还要检查内存泄漏问题。 - Barmar

1

另外，你是如何确定正则表达式本身是导致速度变慢的原因的？在循环中是否还有其他可能导致速度变慢的操作，例如检索“当前”字符串等？ - GalacticCowboy

我的最差的正则表达式是这样创建的：

var myRegex = new Regex(string.Format("{0}.*(?:and spent|and paid).*[\\$£](\\d+[\\.,]?\\d{{0,2}})", Regex.Escape(playerName)),                     RegexOptions.None);

我在像这样的字符串上运行它："Mike went into the supermarked and spent $1.57" 我想知道是不是Mike花了钱以及他花了多少。 - mike1952

很抱歉，为了更加清晰明了 - 在我运行这个正则表达式的时候，我已经知道了这个字符串游戏中的玩家。这意味着只有2-3个玩家，我知道他们的名字，需要与给定的字符串进行比较。我尝试寻找(/w*)(?:and spent|and paid).*[\\$£](\\d+[\\.,]?\\d{{0,2}})然后将第一个捕获组与已知的名称进行匹配，但速度明显变慢。 - mike1952

显示剩余18条评论

2个回答

0

正则表达式需要计算时间。但是，您可以使用一些技巧将其压缩。您也可以使用 C# 中的字符串函数来避免使用正则表达式函数。

代码可能会很长，但可能会提高性能。字符串有几个函数可以剪切和提取字符，并根据您的需要进行模式匹配。例如：IndeOfAny，LastIndexOf，Contains....

string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};

if(str2.IndexOfAny(str) >= 0)
{
  //success code//
}

- Arshad

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Troy Alford · Accepted Answer

这可能不是关于正则表达式性能下降的直接答案 - 这有点迷人。然而 - 在阅读上面所有评论和讨论后 - 我建议以下操作：

解析数据一次，将匹配的数据拆分到数据库表中。看起来你正在尝试捕获以下字段：

Player_Name | Monetary_Value

如果您创建一个包含每行这些值的数据库表，并在创建每个新行时捕获它，解析它并附加到数据表中，您可以轻松地针对数据执行任何类型的分析/计算 - 而无需一遍又一遍地解析2500万行（这是浪费）。

此外，在第一次运行时，如果您将2500万条记录分成100,000条记录块，然后运行250次算法（100,000 x 250 = 25,000,000），您可以享受所有描述性能而没有减速的好处，因为您正在切分任务。

换句话说，请考虑以下内容：

Create a database table as follows:

CREATE TABLE PlayerActions (
    RowID          INT PRIMARY KEY IDENTITY,
    Player_Name    VARCHAR(50) NOT NULL,
    Monetary_Value MONEY       NOT NULL
)

Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.

public void ParseFullDataSet(IEnumerable<String> dataSource) {
    var rowCount = dataSource.Count();
    var setCount = Math.Floor(rowCount / 100000) + 1;

    if (rowCount % 100000 != 0)
        setCount++;

    for (int i = 0; i < setCount; i++) {
        var set = dataSource.Skip(i * 100000).Take(100000);
        ParseSet(set);
    }
}

public void ParseSet(IEnumerable<String> dataSource) {
    String playerName = String.Empty;
    decimal monetaryValue = 0.0m;

    // Assume here that the method reflects your RegEx generator.
    String regex = RegexFactory.Generate();

    for (String data in dataSource) {
        Match match = Regex.Match(data, regex);
        if (match.Success) {
            playerName = match.Groups[1].Value;

            // Might want to add error handling here.
            monetaryValue = Convert.ToDecimal(match.Groups[2].Value);

            db.PlayerActions.Add(new PlayerAction() {
                // ID = ..., // Set at DB layer using Auto_Increment
                Player_Name = playerName,
                Monetary_Value = monetaryValue
            });
            db.SaveChanges();

            // If not using Entity Framework, use another method to insert
            // a row to your database table.
        }
    }
}

Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
```
ParseSet(new List<String>() { newValue });
```
or if multiples are created at once, call:
```
ParseSet(newValues); // Where newValues is an IEnumerable<String>
```

现在，您可以从数据中进行任何计算分析或数据挖掘，而不必担心即时处理超过25m行的性能问题。