SQL-选择最相似的产品

Question

SQL-选择最相似的产品

4

好的，我有一个存储两个键的关系，即产品 ID 和属性 ID。我想找出哪个产品与给定的产品最相似。（实际上，属性是数字，但为了简化视觉表示，它们已经被更改为字母。）

Prod_att

Product | Attributes  
   1   |    A     
   1   |    B  
   1   |    C  
   2   |    A  
   2   |    B  
   2   |    D  
   3   |    A  
   3   |    E  
   4   |    A

最初看起来似乎很简单，只需选择产品具有的属性，然后计算每个产品共享的属性数量。然后将此结果与产品具有的属性数量进行比较，就可以看出两个产品有多相似。这对于具有相对较多属性的产品而言是有效的，但是当产品具有非常少的属性时会出现问题。例如，产品3几乎会与其他每个产品打成平局（因为A非常普遍）。

SELECT Product, count(Attributes)  
FROM Prod_att  
WHERE Attributes IN  
(SELECT Attributes  
FROM prod_att  
WHERE Product = 1)  
GROUP BY Product
;

有关如何修复此问题或改进当前查询的建议吗？谢谢！

*编辑：所有产品的Product 4将返回count() = 1。我想显示Product 3更相似，因为它具有较少的不同属性。

- Crp

定义一组最小的相似属性怎么样？这可以通过使用 HAVING 子句来实现。 - Luiggi Mendoza

http://stackoverflow.com/questions/384276/how-to-create-search-engines-like-google - Denis de Bernardy

你正在使用哪个数据库？ - Whit Kemmey

你想返回具有最高匹配属性数量的行吗？ - Matthew

1

在这种情况下，应该如何处理产品3？听起来你需要某种额外的因素来折扣那些属性较少的产品的相似性...但是如果不知道你想要什么结果，很难提出建议。 - Dan J

显示剩余3条评论

3个回答

0

尝试使用"Bernoulli参数的Wilson得分置信区间下限"。它明确处理了当你有少量n时的统计置信问题。尽管看起来涉及大量数学，但实际上这是你需要做正确此类事情的最少量数学。并且网站解释得很好。

这假定可以从正/负评分转换为匹配/不匹配属性的问题。

以下是正/负评分和95% CL的示例：

SELECT widget_id, ((positive + 1.9208) / (positive + negative) - 
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) / 
(positive + negative)) / (1 + 3.8416 / (positive + negative)) 
AS ci_lower_bound FROM widgets WHERE positive + negative > 0 
ORDER BY ci_lower_bound DESC;

- criticalfix

0

你可以编写一个小视图，以获取两个产品之间共享属性的总数。

create view vw_shared_attributes as
select a.product, 
      b.product 'product_match', 
      count(*) 'shared_attributes'
from  your_table a
  inner join test b on b.attribute = a.attribute and b.product <> a.product
group by a.product, b.product

然后使用该视图选择最佳匹配项。

   select product,
      (select top 1 s.product_match from vw_shared_attributes s where t.product = s.product order by s.shared_attributes desc)
    from your_table t
    group by product

请参考此示例http://www.sqlfiddle.com/#!6/53039/1

- Nate

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Akash · Accepted Answer

试试这个

SELECT 
  a_product_id, 
  COALESCE( b_product_id, 'no_matchs_found' ) AS closest_product_match
FROM (
  SELECT 
    *,  
    @row_num := IF(@prev_value=A_product_id,@row_num+1,1) AS row_num,
    @prev_value := a_product_id
  FROM 
    (SELECT @prev_value := 0) r
    JOIN (
        SELECT 
         a.product_id as a_product_id,
         b.product_id as b_product_id,
         count( distinct b.Attributes ),
         count( distinct b2.Attributes ) as total_products
        FROM
          products a
          LEFT JOIN products b ON ( a.Attributes = b.Attributes AND a.product_id <> b.product_id )
          LEFT JOIN products b2 ON ( b2.product_id = b.product_id )
       /*WHERE */
         /*  a.product_id = 3 */
        GROUP BY
         a.product_id,
         b.product_id
        ORDER BY 
          1, 3 desc, 4
  ) t
) t2 
WHERE 
  row_num = 1

以上的查询会得到所有产品的最接近匹配项，您可以在最内层的查询中包含 product_id，以获取特定 product_id 的结果。我使用了 LEFT JOIN，这样即使一个产品没有匹配项，它也会被显示出来。

希望这能帮到您。 SQLFIDDLE