Delphi 2010中是否有Boyer-Moore字符串搜索、快速搜索和替换函数以及快速字符串计数的功能?

19
我需要三个快速的针对大字符串的函数:快速查找、快速查找和替换,以及在字符串中快速计算子字符串数量。
我已经在C++和Python中使用了Boyer-Moore字符串搜索,但我发现唯一一个用于实现快速搜索和替换的Delphi Boyer-Moore算法是Peter Morris的FastStrings,他曾经在DroopyEyes软件公司工作,但他的网站和电子邮件都已失效。
我已经将FastStrings移植到Delphi 2009/2010中,使其适用于AnsiStrings,其中一个字节等于一个AnsiChar,但是要使它们也适用于Delphi 2010中的String(UnicodeString)似乎不容易。
使用这个Boyer-Moore算法,应该可以轻松地进行大小写不敏感的搜索,以及进行大小写不敏感的搜索和替换,而不需要任何临时字符串(使用StrUpper等),也不需要调用Pos(),当需要重复在相同文本上进行搜索时,Boyer-Moore搜索比它更快。
2个回答

12
这个答案已经完整并且适用于区分大小写模式,但不适用于不区分大小写模式,而且可能存在其他错误,因为它没有进行很好的单元测试,并且可能可以进一步优化,例如我重复了本地函数__SameChar而没有使用比较函数回调,这样会更快,实际上,允许用户为所有这些传递比较函数对于想要提供一些额外逻辑(某些语言的等价Unicode字形集)的Unicode用户来说是很好的。基于Dorin Dominica的代码,我构建了以下内容。
{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

这是一个非常好的算法,因为:

  • 使用这种方法计算字符串Y中子串X的实例要快得多,效果惊人。
  • 仅替换Pos(),_FindStringBoyer()比由FastCode项目人员贡献给Delphi的纯汇编版本的Pos()更快,而如果你需要不区分大小写,则可以想象当我们不必在100兆字节的字符串上调用UpperCase时,性能提升会有多大。(好吧,你的字符串不会那么大。但是,高效算法是一件美妙的事情。)

好的,我用Boyer-Moore风格编写了一个字符串替换函数:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

好的,问题是:这个堆栈占用了多少空间:
var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

再见CPU地狱,你好栈地狱。如果我使用动态数组,那么我必须在运行时调整它的大小。因此,这个东西基本上很快,因为计算机上的虚拟内存系统不会在堆栈中增加256K而感到困扰,但这并不总是最优的代码。尽管如此,我的电脑对于像这样的大堆栈内容并不会有任何反应。这不会成为Delphi标准库默认或赢得任何快速编码挑战,因为它占用了太多空间。我认为,重复搜索是应该将上述代码编写为类的情况,而跳跃表应该是该类中的数据字段。然后,您可以一次构建Boyer-Moore表,并随着时间的推移,如果字符串不变,则重复使用该对象进行快速查找。

2
我有几个基于SSE##的SIMD汇编暴力搜索版本,对于我的目的轻松击败了BM和Pos()。在新的CPU上,利用PCMPxSTRx的SSE4.2基础例程(可以处理2字节字符)是一个独特的选择。也许可以作为另一个快速代码挑战的内容? - PhiS
2
如果你开始检查 Delphi 的 StringReplace 和上面的代码,你会发现毫无疑问,Delphi 的 StringReplace 需要被替换成更加智能的东西。 - Warren P
1
我知道这已经很老了,但感谢您编写了唯一一个在Delphi中实际可用的Boyer-Moore算法实现,+1。 - Seth Carnegie
关于堆栈使用:您能否在堆上分配它,第一次使用这些方法时进行分配,并在程序退出时释放它?或者说,速度的一部分取决于它特定地位于堆栈中吗? - David
这对于堆栈来说实在太多了(默认堆栈大小为1MB,因此四分之一的大小远远超过了)。我建议添加一个参数,用户可以提供内存,如果为空,则函数执行getmem、freemem(因此不进行内存初始化)。 - mrabat
显示剩余17条评论

2

因为我也在寻找同样的内容: Jedi JCL使用Boyer-Moore算法在jclUnicode.pas中实现了一个支持Unicode的搜索引擎。 我还不知道它的好坏和速度。


我很好奇它是否使用基于堆栈的大表格,还是在运行中主要使用堆内存。 - Warren P
我还没有进行详细分析,但是通过初步的观察,我会说它使用了堆。 - dummzeuch

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接