Excel - 删除文本中的重复值

Question

Excel - 删除文本中的重复值

3

我正在处理一些英国地址数据，这些数据在Excel单元格中被逗号分隔成不同的部分。

我找到了一些来源于网络的VBA代码，用于去除完全重复的条目，但是仍然存在大量数据具有重复的片段，其中一些是连续的，一些则不是。如下图所示，我希望将它们合并，以便于数据的管理和分析。

请问您是否有更好的方法实现这一功能？附上已经使用过的代码，供参考。

Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")

For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal

stringOfUniques = Join(dict.Keys(), ",")
End Function

我已经成功地清除了其中一些，但是我还要处理大量的问题，所以自动化将是不可思议的。

- Mr Deans

一个带有反向引用的正则表达式是另一种可能的选择。 - brettdj

3个回答

3

您可以使用正则表达式替换：

^(\d*\s*([^,]*),.*)\2(,|$)

替换模式是：

$1$3

查看正则表达式演示。以下是模式解释：

^ - 字符串的开头（如果 .MultiLine = True 则为行的开头）
(\d*\s*([^,]*),.*) - 匹配组 1（稍后可以使用替换模式中的反向引用 $1 引用它）：
- \d* - 后跟 0+ 数字
- \s* - 0+ 空白字符
- ([^,]*) - 匹配除逗号以外的 0+ 字符的组 2（稍后我们可以使用此子模式捕获的值在模式中进行引用 \2）
- ,.* - 逗号后跟除换行符以外的 0+ 字符
\2 - 组 2 捕获的文本
(,|$) - 匹配组 3（稍后可以使用替换模式中的 $3 来恢复逗号），匹配逗号或字符串的结尾（如果 .MultiLine = True 则为行的结尾）。

注意：如果您只检查包含一个地址的单个单元格，则不需要 .MultiLine = True。

以下是演示如何在 VBA 中使用此代码的示例子程序：

Sub test()
  Dim regEx As Object
  Set regEx = CreateObject("VBScript.RegExp")
  With regEx
      .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
      .Global = True
      .MultiLine = True ' Remove if individual addresses are matched
  End With
  s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
      "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
      "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
      "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
  MsgBox regEx.Replace(s, "$1$3")
End Sub

- Wiktor Stribiżew

干得好！不过需要注意一下，在正则表达式替换后去掉重复的逗号会更好。 - brettdj

1

我从未见过正则表达式函数，这是一个很好的解决方案！ - Mr Deans

1

第一个解决方案是使用字典来获取唯一分段的列表。然后在分割分段之前简单地跳过第一个地址号即可：

Function RemoveDuplicates1(text As String) As String
  Static dict As Object
  If dict Is Nothing Then
    Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = 1  ' set the case sensitivity to All
  Else
    dict.RemoveAll
  End If

  ' Get the position just after the address number
  Dim c&, istart&, segment
  For istart = 1 To Len(text)
    c = Asc(Mid$(text, istart, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Split the segments and add each one of them to the dictionary. No need to keep 
  ' a reference to each segment since the keys are returned by order of insertion.
  For Each segment In Split(Mid$(text, istart), ",")
    If Len(segment) Then dict(segment) = Empty
  Next

  ' Return the address number and the segments by joining the keys
  RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function

第二种解决方案是提取所有片段，然后搜索每个片段是否出现在先前的位置：

Function RemoveDuplicates2(text As String) As String
  Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&

  ' Get the position just after the address number
  For ifirst = 1 To Len(text)
    c = Asc(Mid$(text, ifirst, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Get the segments without the address number and add a leading/trailing comma
  segments = "," & Mid$(text, ifirst) & ","
  istart = 1

  ' iterate each segment
  Do While istart < Len(segments)

    ' Get the next segment position
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
    If iend - istart Then

      ' Get the segment
      segment = Mid$(segments, istart, iend - istart + 2)

      ' Rewrite the segment if not present at a previous position
      If InStr(1, segments, segment, vbTextCompare) = istart Then
        Mid$(segments, length + 1) = segment
        length = length + Len(segment) - 1
      End If
    End If

    istart = iend + 1
  Loop

  ' Return the address number and the segments
  RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)

End Function

第三种解决方案是使用正则表达式来删除所有重复的片段：

Function RemoveDuplicates3(ByVal text As String) As String

  Static re As Object
  If re Is Nothing Then
    Set re = CreateObject("VBScript.RegExp")
    re.Global = True
    re.IgnoreCase = True
    ' Match any duplicated segment separated by a comma.
    ' The first segment is compared without the first digits.
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
  End If

  ' Remove each matching segment
  Do While re.test(text)
    text = re.Replace(text, "$1")
  Loop

  RemoveDuplicates3 = text
End Function

这是10000次迭代的执行时间（时间越短越好）：

input text  : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"

RemoveDuplicates1 (dictionary)  : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex)       : 1469 ms

- Florent B.

再次感谢您提供如此多的优雅解决方案，每个都似乎能满足我的需求。非常感激！ - Mr Deans

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MiguelH · Accepted Answer

也许不是最优雅的方法，但这样做可以解决问题。我使用 Split 命令在每个逗号处拆分字符串。此操作返回的结果为：

bat ball banana

代码：

Option Explicit
Private Sub test()
 Dim Mystring As String
 Dim StrResult As String

 Mystring = "bat,ball,bat,ball,banana"
 StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
 Dim MySplitz() As String
 Dim J As Integer
 Dim K As Integer
 Dim BooMatch As Boolean
 Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
 Dim StrResult As String


 MySplitz = Split(Mystring, ",")
  For J = 0 To UBound(MySplitz)
     BooMatch = False
     For K = 0 To UBound(StrTemp)
         If MySplitz(J) = StrTemp(K) Then
            BooMatch = True
            Exit For
         End If
     Next K
    If Not BooMatch Then
       StrTemp(J) = MySplitz(J)
    End If
Next
For J = 0 To UBound(StrTemp)
   If Len(StrTemp(J)) > 0 Then ' ignore blank entries
      StrResult = StrResult + StrTemp(J) + " "
   End If
Next J
Debug.Print StrResult
End Function