列表中最常见的单词

3

我有一个Excel列表,其中的子集如下所示:

Food and Human Nutrition
Food and Human Nutrition with Placement
Food and Nutrition with Professional Experience
Food Marketing and Nutrition
Food Marketing and Nutrition with Placement
Food, Nutrition and Health

我希望能找出该列表中最常见的n个单词。我尝试使用以下公式来找到最常见的单词:

=INDEX(rng,MODE(MATCH(rng,rng,0)))

这种方法将每个单元格视为一个字符串,由于6行都不同,因此无法找到最常见的单词。我希望它能够输出“食品”、“营养”和“和”作为最常见的单词,然后是“市场营销”、“位置”、“带有”等。

如果可以的话,使用Google Sheets公式会更容易,因为它具有SPLIT()和正则表达式。否则,您可以将其复制到Word中,用"^p"(段落标记)替换非单词字符,然后将其粘贴回Excel以获取单词列。 - Slai
或者尝试在互联网上搜索类似于“在线词频计数器”的东西 http://countwordsfree.com/ - Slai
3个回答

2
这里有一个提供您所需功能的VBA宏。
  • 使用字典对象测试唯一性
  • 在字典中进行计数
  • 然后对结果进行排序
请仔细阅读代码中的注释以了解需要做出的假设和需要设置的引用。同时,注意标点符号可能会导致同一个单词被计入不同的类别。如果这可能是个问题,我们只需要以不同的方式拆分源数据,例如在拆分空格之前消除所有标点符号,或者使用正则表达式进行拆分。
'Set Reference to Microsoft Scripting Runtime

Option Explicit
Sub UniqueWordCounts()
    Dim wsSrc As Worksheet, wsRes As Worksheet
    Dim rSrc As Range, rRes As Range
    Dim vSrc As Variant, vRes As Variant
    Dim vWords As Variant
    Dim dWords As Dictionary
    Dim I As Long, J As Long
    Dim V As Variant, vKey As Variant

'Assume source data is in column 1, starting at A1
'  Could easily be anyplace
Set wsSrc = Worksheets("sheet2")
With wsSrc
    Set rSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With

'Results to go a few columns over
Set wsRes = Worksheets("sheet2")
    Set rRes = rSrc(1, 1).Offset(0, 2)

'Read source data into vba array (for processing speed)
vSrc = rSrc

'Collect individual words and counts into dictionary
Set dWords = New Dictionary
    dWords.CompareMode = TextCompare

For I = 1 To UBound(vSrc, 1)

    'Split the sentence into individual words
    For Each vKey In Split(vSrc(I, 1))
        If Not dWords.Exists(vKey) Then
            dWords.Add Key:=vKey, Item:=1
        Else
            dWords(vKey) = dWords(vKey) + 1
        End If
    Next vKey
Next I

'Size results array
ReDim vRes(0 To dWords.Count, 1 To 2)

'Column headers
    vRes(0, 1) = "Word"
    vRes(0, 2) = "Count"

'Populate the columns
    I = 0
    For Each V In dWords.Keys
        I = I + 1
        vRes(I, 1) = V
        vRes(I, 2) = dWords(V)
    Next V

'Size results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))

'Populate, format and sort the Results range
With rRes
    .EntireColumn.Clear
    .Value = vRes
    With .Rows(1)
        .Font.Bold = True
        .HorizontalAlignment = xlCenter
    End With
    .EntireColumn.AutoFit
    .Sort key1:=.Columns(2), order1:=xlDescending, key2:=.Columns(1), order2:=xlAscending, MatchCase:=False, Header:=xlYes
End With

End Sub

enter image description here


1
如果您了解并且想要使用VBA,那么这将是一个非常有趣的任务。因此,一些自定义公式,例如=MostCommonWords(Range;Optional WordsNumber),会给您这样的结果:

enter image description here

这是公式背后的代码:

Public Function MostCommonWords(inputRange As Range, _
                        Optional NumberOfWords As Long = 1) As String

    Dim myCell      As Range
    Dim inputString As String, tempString As String, myResult As String
    Dim myArr       As Variant, myKey As Variant
    Dim cnt         As Long, topNumber As Long
    Dim myColl      As Object

    Set myColl = CreateObject("Scripting.Dictionary")        
    For Each myCell In inputRange
        tempString = LCase(Replace(myCell, ",", ""))
        inputString = inputString & " " & tempString
    Next myCell        
    myArr = Split(inputString)        
    For cnt = LBound(myArr) To UBound(myArr)
        If myColl.exists(myArr(cnt)) Then
            myColl(myArr(cnt)) = myColl(myArr(cnt)) + 1
        Else
            myColl.Add myArr(cnt), 1
        End If
    Next cnt 
    For cnt = 1 To NumberOfWords
        topNumber = 0
        myResult = vbNullString            
        For Each myKey In myColl
            If topNumber < myColl(myKey) Then
                topNumber = myColl(myKey)
                myResult = myKey
            End If
        Next myKey            
        MostCommonWords = MostCommonWords & " " & myResult
        myColl.Remove myResult
    Next cnt

End Function

它是如何工作的?

  • 将输入范围中的所有单元格添加到字符串中,命名为inputString
  • 删除逗号并将所有单词转换为小写。
  • 使用字典结构,每个单词都被添加为键,使用次数作为值。
  • 根据需要显示的单词数量,在字典上执行最大值检查,并呈现其键。
  • 一旦找到最大值,就从字典中删除该键,因此可以找到第二大的值-myColl.Remove myResult

@pnuts - OP想要在列表中找到最常见的n个单词。这个公式正好可以做到这一点。如果有两个单词出现频率相同,它只会显示其中一个而不是两个。 - Vityata

0
最简单的方法可能是使用 Concordance 程序(例如 Word),但也可以将内容转换成一个 Word 中的单列表,然后在 Excel 中透视表格。如果仅在空格上拆分表格,则 FoodFood 将显示为不同的单词。因此,建议先删除标点符号(查找/替换)。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接