SQL Server中最常见的词语/字符组合

3
我正在分析一个名为'RawDataDescriptions'的表格中的数据,其中有一个名为'description'的字段供用户输入。我正在寻找一种将描述按短语或经常出现的字符串进行广泛分类的方法(包括它们出现的次数)。我没有特定的单词或短语需要查找,可以使用'like'语句,而是在字段之间寻找共性。在查找其他问题时,我找到了一个查询,我对自己的表格进行了调整以获取最常见的单词(如下所示),但是仅有一个单词提供了很少 -如果有任何-关于描述的见解。是否可能制作一个查询,提供短语而不仅仅是单个单词的计数呢?如果可能的话,它的主要组成部分会是什么?
WITH E1(N) AS 
(
    SELECT 1 
    FROM (VALUES
        (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT
    x.Item,
    COUNT(*)
FROM RawDataDescriptions p
CROSS APPLY (
SELECT 
        ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = LTRIM(RTRIM(SUBSTRING(p.[Description], l.N1, l.L1)))
        FROM (
            SELECT s.N1,
                L1 = ISNULL(NULLIF(CHARINDEX(' ',p.[Description],s.N1),0)-
s.N1,4000)
            FROM(
                SELECT 1 UNION ALL
                SELECT t.N+1 
                FROM(
                    SELECT TOP (ISNULL(DATALENGTH(p.[Description])/2,0))
                        ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
                    FROM E4
                ) t(N)
                WHERE SUBSTRING(p.[Description] ,t.N,1) = ' '
            ) s(N1)
        ) l(N1, L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC

*编辑 - 不可行。期望的替代结果:

示例表格

Id | Description  
---+--------------------------
01 | Customer didn't like it  
02 | Person liked it  
03 | Person didn't like it  
04 | Client didn't like it  
05 | person liked it   

@Parameter = 3

期望的结果:

string           | count  
-----------------+-------
didn't like it   | 3  
Person liked it  | 2  

编辑2**:原始问题是可以解决的 - 请参考答案。


如果我用类似于OP中的查询方式,但是针对三个单词而不仅仅是一个单词,会怎样呢? - Saleem Khan
返回翻译后的文本:@int参数将是单词数量,“count>1”表示这3个单词在2个或更多记录中出现。 - Saleem Khan
什么三个单词?你有想法,但我不明白你想做什么。请提供一些细节。 - Sean Lange
所以我的列是[description](varchar)。表是RawDataDescription。[Description]充满了用户输入的文本。我想查看每行中x个单词序列的出现频率,其中x是我传递的参数(这样我可以将其设置为3、4或5个单词等)。如果它们只出现一次(其中许多将),我不需要选择它们,因此想法是使计数> 1。 - Saleem Khan
edited in the OP - Saleem Khan
显示剩余7条评论
2个回答

4

这里有一种选择。我有几个担忧,例如标点符号、控制字符,特别是在大表上的性能问题。

例子

Declare @RawDataDescriptions Table ([Id] varchar(50),[Description] varchar(50))
Insert Into @RawDataDescriptions Values 
 ('01','Customer didn''t like it')
,('02','Person liked it')
,('03','Person didn''t like it')
,('04','Client didn''t like it')
,('05','person liked it')

;with cte as (
    Select Id
          ,B.* 
      From  @RawDataDescriptions A
      Cross Apply (
                    Select RetSeq = Row_Number() over (Order By (Select null))
                          ,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
                    From  (Select x = Cast('<x>' + replace((Select replace(A.[Description],' ','§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A 
                    Cross Apply x.nodes('x') AS B(i)
                  ) B 
)
Select Phrase
      ,Cnt  = count(*)
 From  cte A
 Cross Apply (
     Select Phrase = stuff((Select ' '+RetVal
                            From  cte 
                            Where ID = A.ID
                              and RetSeq between A.RetSeq and A.RetSeq+2
                            Order By RetSeq
                            For XML Path('')),1,1,'')

             ) B
  Where Phrase like '% % %'
  Group By Phrase
  Having count(*)>1
  Order By 2 Desc

返回值

Phrase           Cnt
didn't like it   3
Person liked it  2

更新 - TVF - 更好的性能

我决定将此转换为表值函数,并对其性能提升感到惊讶。例如,我从 FRED(美国联邦储备经济数据)得到了 13 万个描述,我能够在 9 秒内生成常见短语列表(n 个词)。

使用方法

Select Phrase = B.RetVal
      ,Cnt    = count(*)
 From YourTable A
 Cross Apply [dbo].[tvf-Str-Parse-Phrase](A.YourColumn,' ',4) B
 Group By B.RetVal
 Having count(*)>1
 Order By 2 Desc

如果您感兴趣,TVF 是什么呢?
CREATE FUNCTION [dbo].[tvf-Str-Parse-Phrase] (@String varchar(max),@Delimeter varchar(25),@WordCnt int)
Returns Table 
As
Return (  
 with cte as (
      Select RetSeq = Row_Number() over (Order By (Select null))
            ,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
      From  (Select x = Cast('<x>' + replace((Select replace(@String,@Delimeter,'§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml).query('.')) as A 
      Cross Apply x.nodes('x') AS B(i)
)
Select RetSeq = Row_Number() over (Order By (Select Null))
      ,B.RetVal
 From  cte A
 Cross Apply (Select RetVal = stuff((Select ' '+RetVal From cte Where RetSeq between A.RetSeq and A.RetSeq+@WordCnt-1 For XML Path('')),1,1,'') ) B
 Where B.RetVal like Replicate('% ',@WordCnt-1)+'%'
);
--Select * from [dbo].[tvf-Str-Parse-Phrase]('This is some text that I want parsed',' ',4)

1
这很不错,约翰。我为此苦苦挣扎了一会儿(愚蠢的区分大小写排序)。但它运行得很好。我甚至稍微调整了一下,加入了一个长度参数,它也很好用。我只是调整了 Where 子句,使用了 replicate。 - Sean Lange
1
@SeanLange 如果你有兴趣,我把这个转换成了一个TVF。性能提升令人震惊。请看EDIT/更新。 - John Cappelletti
@SaleemKhan Happy it helped - John Cappelletti

0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接