UserA-UserB-UserC-UserD-UserF
由'-'连接的用户互相认识。
我需要一个算法来完成以下两个任务:
- 计算从用户X到用户Y的路径。
- 对于用户X,计算不超过3步的所有用户。
是否有有效的解决方案?
编辑
我的目的不是证明它的正确或错误,而是在必要时实时计算结果。
此外,我认为最具表现力的方式是代码,甚至是伪代码。
再次编辑
我已决定这种工作必须在数据库内完成,因此必须使用SQL解决方案!
UserA-UserB-UserC-UserD-UserF
由'-'连接的用户互相认识。
我需要一个算法来完成以下两个任务:
是否有有效的解决方案?
编辑
我的目的不是证明它的正确或错误,而是在必要时实时计算结果。
此外,我认为最具表现力的方式是代码,甚至是伪代码。
再次编辑
我已决定这种工作必须在数据库内完成,因此必须使用SQL解决方案!
用图表表示这个用户列表
- 计算从UserX到UserY的路径
- 对于UserX,计算所有不超过3步的用户。
这些问题非常相关,实际上同样的算法可以解决这两个问题。你可以称其为"所有边权重为1的Dijkstra算法"或"广度优先搜索"。
基本上,从第一个节点开始,访问所有相关节点;然后标记它们都已经被访问过,记录到它们的最短路径(到它们的最短路径+您刚遍历的边),并对每个节点重复此操作。对于问题#1,在到达目的地后停止;对于问题#2,在最短路径> 3后停止。
六度分隔的最快O(n)算法可能是找到所有距离UserX和UserY 1步的用户集,并找到这两个集合的交集。如果没有,则添加距离UserX 2步的用户并相交;然后添加距离UserY 2步的用户并相交;等到3步为止。
如果每个人平均有100个朋友,则这可能需要查看多达2,020,200个用户,而Dijkstra算法需要查看1,010亿个用户。在实践中,这些数字会小得多,因为通常你的两个朋友也互相认识。
这是解决六度分隔的唯一方法(到目前为止提到的)(已经提到的)可以在实践中使用。
图形算法可以帮助你,学习它们也很有趣!
如果要查找两个用户之间的最短路径,则应使用Dijkstra或类似方法。它们之间可能有多条路径,Dijkstra会在找到比以前找到的更短的路径时进行记录。
要查找所有用户之间的最短路径,您必须使用类似Floyd-Warshall的东西。这是动态编程的一个很好的例子,并且非常容易实现。维基百科上的伪代码如下:
1 /* Assume a function edgeCost(i,j) which returns the cost of the edge from i to j
2 (infinity if there is none).
3 Also assume that n is the number of vertices and edgeCost(i,i) = 0
4 */
5
6 int path[][];
7 /* A 2-dimensional matrix. At each step in the algorithm, path[i][j] is the shortest path
8 from i to j using intermediate vertices (1..k−1). Each path[i][j] is initialized to
9 edgeCost(i,j) or infinity if there is no edge between i and j.
10 */
11
12 procedure FloydWarshall ()
13 for k := 1 to n
14 for i := 1 to n
15 for j := 1 to n
16 path[i][j] = min ( path[i][j], path[i][k]+path[k][j] );
CREATE PROCEDURE FindPath (@UserX int, @UserY int)
CREATE TABLE #SixDegrees(
ConnectedUser int,
Depth int,
Path varchar(100),
CONSTRAINT PK_SixDegrees PRIMARY KEY CLUSTERED (
ConnectedUser
)
)
DECLARE @Depth int,
@PathFound varchar(100)
SET @Depth = 0
INSERT INTO #SixDegrees (@UserX, 0, CAST(@UserX as varchar))
/*Next line just in case X & Y are the same*/
SET @PathFound = (SELECT Path
FROM #SixDegrees
WHERE ConnectedUser = @UserY)
WHILE @Depth < 6 AND @PathFound IS NULL
BEGIN
SET @Depth = @Depth + 1
INSERT INTO #SixDegrees
SELECT k.KnowsPersonID, @Depth, (SELECT Path
FROM #SixDegrees
WHERE ConnectedUser = k.Link) + ',' + CAST(k.KnowsPersonID AS varchar)
FROM (
SELECT MIN(ConnectedUser) Link, KnowsPersonID
FROM #SixDegrees
JOIN Connections ON
PersonID = ConnectedUser
WHERE Depth = @Depth
/*EDIT: Added the following*/
AND KnowsPersonID NOT IN (
SELECT ConnectedUser
FROM #SixDegrees
)
GROUP BY KnowsPersonID
) k
SET @PathFound = (SELECT Path
FROM #SixDegrees
WHERE ConnectedUser = @UserY)
END
IF @Path IS NULL
PRINT 'No path found'
ELSE
PRINT @Path
GO
编辑:在上面的解决方案中,我最初忘记排除已经在#SixDegrees临时表中的用户。
2) 在上述解决方案的基础上进行一些微调,始终循环到深度为3,将使您拥有包含所有感兴趣的用户的#SixDegrees。
但是,以下纯集合解决方案应该更有效:
SELECT DISTINCT KnowsPersonID
FROM Connections
WHERE PersonID IN (
SELECT DISTINCT KnowsPersonID
FROM Connections
WHERE PersonID IN (
SELECT KnowsPersonID
FROM Connections
WHERE PersonID = @UserX
) l1
) l2
create table #connections (
my_user varchar(10) not null ,
knows varchar(10) not null ,
CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)
)
create table #traversed (id varchar(10) primary key)
insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')
DECLARE @str_sql varchar(200)
DECLARE @str_order varchar(60)
declare @start varchar(10)
set @start = ('UserD')
declare @end varchar(10)
set @end = ('UserA')
if (@start >= @end)
set @str_order = " order by id desc"
else
set @str_order = " order by id asc"
INSERT INTO #traversed VALUES (@start)
WHILE (select count(*) from #traversed where id = @end) = 0
BEGIN
INSERT INTO #traversed (id)
SELECT DISTINCT knows
FROM #connections e JOIN #traversed p ON p.id = e.my_user
WHERE e.knows NOT IN (SELECT id FROM #traversed)
AND e.knows between (select case when @start < @end then @start else @end end)
and (select case when @start < @end then @end else @start end)
END
set @str_sql = "SELECT #traversed.id FROM #traversed" + @str_order
exec (@str_sql)
问题 2.
create table #connections (
my_user varchar(10) not null ,
knows varchar(10) not null ,
CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)
)
create table #traversed (id varchar(10) primary key)
insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')
declare @start varchar(10)
set @start = ('UserB')
declare @higher_counter int
declare @lower_counter int
set @higher_counter = 0
set @lower_counter = 0
INSERT INTO #traversed VALUES (@start)
WHILE (@higher_counter < 3)
BEGIN
INSERT INTO #traversed (id)
SELECT DISTINCT knows
FROM #connections e JOIN #traversed p ON p.id = e.my_user
WHERE e.knows NOT IN (SELECT id FROM #traversed)
AND e.knows > @start
set @higher_counter = @higher_counter +1
END
WHILE (@lower_counter < 3)
BEGIN
INSERT INTO #traversed (id)
SELECT DISTINCT knows
FROM #connections e JOIN #traversed p ON p.id = e.my_user
WHERE e.knows NOT IN (SELECT id FROM #traversed)
AND e.knows < @start
set @lower_counter = @lower_counter +1
END
SELECT #traversed.id FROM #traversed
P
的连通性。M
,其中:M(A, B)= A is directly connected to B
那么
(M(A, B))^P= A is connected to B within P links.
AND
进行乘法和OR
进行加法:使用MariaDB和OQGraph,实际上有一种相当高效的方法来完成此操作。
假设数据包含在两个表中:
CREATE TABLE `entity` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`type` enum('ACTOR','MOVIE','TV MOVIE','TV MINI','TV SERIES','VIDEO MOVIE','VIDEO GAME','VOICE','ARCHIVE') NOT NULL,
`name` varchar(128) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `type` (`type`,`name`) USING BTREE
) ENGINE=InnoDB;
CREATE TABLE `link` (
`rel_id` int(11) NOT NULL AUTO_INCREMENT,
`link_from` int(11) NOT NULL,
`link_to` int(11) NOT NULL,
PRIMARY KEY (`rel_id`),
KEY `link_from` (`link_from`,`link_to`),
KEY `link_to` (`link_to`)
) ENGINE=InnoDB;
OQGraph 虚拟表的声明如下:
CREATE TABLE movie_graph (
latch SMALLINT UNSIGNED NULL,
origid BIGINT UNSIGNED NULL,
destid BIGINT UNSIGNED NULL,
weight DOUBLE NULL,
seq BIGINT UNSIGNED NULL,
linkid BIGINT UNSIGNED NULL,
KEY (latch, origid, destid) USING HASH,
KEY (latch, destid, origid) USING HASH
) ENGINE=OQGRAPH
data_table='link' origid='link_from' destid='link_to';
然后可以查询数据:
MariaDB [imdb]> SELECT
-> GROUP_CONCAT(name ORDER BY seq SEPARATOR ' -> ') AS path
-> FROM imdb_graph JOIN entity ON (id=linkid)
-> WHERE latch=1
-> AND origid=(SELECT a.id FROM entity a
-> WHERE name='Kevin Bacon')
-> AND destid=(SELECT b.id FROM entity b
WHERE name='N!xau')\G
*************************** 1. row ***************************
path: Kevin Bacon -> The 45th Annual Golden Globe Awards (1988) -> Richard Attenborough -> In Darkest Hollywood: Cinema and Apartheid (1993) -> N!xau
1 row in set (10 min 6.55 sec)
这是一张拥有大约370万个节点和3000万条边的图表。表格大约占用3.5GB,InnoDB在笔记本硬盘上配置了512MB的缓冲池。大约有1600万个次要关键字读取。没有数据预加载到缓冲池中,因此速度会很慢。这个例子来自于:https://www.slideshare.net/AntonyTCurtis/oqgraph-scale2013-17332168/21
当然,如果表格可以保留在缓冲池中,速度会更快。