Excel - 删除文本中的重复值

3
我正在处理一些英国地址数据,这些数据在Excel单元格中被逗号分隔成不同的部分。
我找到了一些来源于网络的VBA代码,用于去除完全重复的条目,但是仍然存在大量数据具有重复的片段,其中一些是连续的,一些则不是。如下图所示,我希望将它们合并,以便于数据的管理和分析。
请问您是否有更好的方法实现这一功能?附上已经使用过的代码,供参考。
Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")

For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal

stringOfUniques = Join(dict.Keys(), ",")
End Function

我已经成功地清除了其中一些,但是我还要处理大量的问题,所以自动化将是不可思议的。

理想结果


一个带有反向引用的正则表达式是另一种可能的选择。 - brettdj
3个回答

4
也许不是最优雅的方法,但这样做可以解决问题。 我使用 Split 命令在每个逗号处拆分字符串。 此操作返回的结果为:
bat ball banana

代码:

Option Explicit
Private Sub test()
 Dim Mystring As String
 Dim StrResult As String

 Mystring = "bat,ball,bat,ball,banana"
 StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
 Dim MySplitz() As String
 Dim J As Integer
 Dim K As Integer
 Dim BooMatch As Boolean
 Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
 Dim StrResult As String


 MySplitz = Split(Mystring, ",")
  For J = 0 To UBound(MySplitz)
     BooMatch = False
     For K = 0 To UBound(StrTemp)
         If MySplitz(J) = StrTemp(K) Then
            BooMatch = True
            Exit For
         End If
     Next K
    If Not BooMatch Then
       StrTemp(J) = MySplitz(J)
    End If
Next
For J = 0 To UBound(StrTemp)
   If Len(StrTemp(J)) > 0 Then ' ignore blank entries
      StrResult = StrResult + StrTemp(J) + " "
   End If
Next J
Debug.Print StrResult
End Function

3

您可以使用正则表达式替换:

^(\d*\s*([^,]*),.*)\2(,|$)

替换模式是:
$1$3

查看正则表达式演示。以下是模式解释

  • ^ - 字符串的开头(如果 .MultiLine = True 则为行的开头)
  • (\d*\s*([^,]*),.*) - 匹配组 1(稍后可以使用替换模式中的反向引用 $1 引用它):
    • \d* - 后跟 0+ 数字
    • \s* - 0+ 空白字符
    • ([^,]*) - 匹配除逗号以外的 0+ 字符的组 2(稍后我们可以使用此子模式捕获的值在模式中进行引用 \2
    • ,.* - 逗号后跟除换行符以外的 0+ 字符
  • \2 - 组 2 捕获的文本
  • (,|$) - 匹配组 3(稍后可以使用替换模式中的 $3 来恢复逗号),匹配逗号或字符串的结尾(如果 .MultiLine = True 则为行的结尾)。

注意:如果您只检查包含一个地址的单个单元格,则不需要 .MultiLine = True

以下是演示如何在 VBA 中使用此代码的示例子程序:

Sub test()
  Dim regEx As Object
  Set regEx = CreateObject("VBScript.RegExp")
  With regEx
      .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
      .Global = True
      .MultiLine = True ' Remove if individual addresses are matched
  End With
  s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
      "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
      "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
      "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
  MsgBox regEx.Replace(s, "$1$3")
End Sub

enter image description here


干得好!不过需要注意一下,在正则表达式替换后去掉重复的逗号会更好。 - brettdj
1
我从未见过正则表达式函数,这是一个很好的解决方案! - Mr Deans

1

第一个解决方案是使用字典来获取唯一分段的列表。然后在分割分段之前简单地跳过第一个地址号即可:

Function RemoveDuplicates1(text As String) As String
  Static dict As Object
  If dict Is Nothing Then
    Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = 1  ' set the case sensitivity to All
  Else
    dict.RemoveAll
  End If

  ' Get the position just after the address number
  Dim c&, istart&, segment
  For istart = 1 To Len(text)
    c = Asc(Mid$(text, istart, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Split the segments and add each one of them to the dictionary. No need to keep 
  ' a reference to each segment since the keys are returned by order of insertion.
  For Each segment In Split(Mid$(text, istart), ",")
    If Len(segment) Then dict(segment) = Empty
  Next

  ' Return the address number and the segments by joining the keys
  RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function

第二种解决方案是提取所有片段,然后搜索每个片段是否出现在先前的位置:
Function RemoveDuplicates2(text As String) As String
  Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&

  ' Get the position just after the address number
  For ifirst = 1 To Len(text)
    c = Asc(Mid$(text, ifirst, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Get the segments without the address number and add a leading/trailing comma
  segments = "," & Mid$(text, ifirst) & ","
  istart = 1

  ' iterate each segment
  Do While istart < Len(segments)

    ' Get the next segment position
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
    If iend - istart Then

      ' Get the segment
      segment = Mid$(segments, istart, iend - istart + 2)

      ' Rewrite the segment if not present at a previous position
      If InStr(1, segments, segment, vbTextCompare) = istart Then
        Mid$(segments, length + 1) = segment
        length = length + Len(segment) - 1
      End If
    End If

    istart = iend + 1
  Loop

  ' Return the address number and the segments
  RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)

End Function

第三种解决方案是使用正则表达式来删除所有重复的片段:
Function RemoveDuplicates3(ByVal text As String) As String

  Static re As Object
  If re Is Nothing Then
    Set re = CreateObject("VBScript.RegExp")
    re.Global = True
    re.IgnoreCase = True
    ' Match any duplicated segment separated by a comma.
    ' The first segment is compared without the first digits.
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
  End If

  ' Remove each matching segment
  Do While re.test(text)
    text = re.Replace(text, "$1")
  Loop

  RemoveDuplicates3 = text
End Function

这是10000次迭代的执行时间(时间越短越好):
input text  : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"

RemoveDuplicates1 (dictionary)  : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex)       : 1469 ms

再次感谢您提供如此多的优雅解决方案,每个都似乎能满足我的需求。非常感激! - Mr Deans

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接