内存中的列表全外连接

3

我该如何编写类似SQL的全连接操作,用于以下签名的列表?

fullJoin :: [a] -> [b] -> (a -> b -> Bool) -> [(Maybe a, Maybe b)]
fullJoin xs [] _ = map (\x -> (Just x, Nothing)) xs
fullJoin [] ys _ = map (\y -> (Nothing, Just y)) ys
fullJoin xs@(xh:xt) ys@(yh:yt) on = 
  if on xh yh
    then (Just xh, Just yh) : fullJoin xt yt
    else ???

这个条件是由 a -> b -> Bool 提供的。在这个例子中,它被设置为 (\n i -> length n == i),意思是从 names 中选择与 nums 中数字相同长度的记录。
示例:
names = ["Foo","Bar", "John" , "Emily", "Connor"]
nums = [1,2,3,4,5]

fullJoin names nums (\n i -> length n == i)
  == [ (Just "Foo", Just 3)
     , (Just "Bar", Just 3)
     , (Just "John", Just 4)
     , (Just "Emily", Just 5)
     , (Just "Connor", Nothing)
     , (Nothing, Just 1)
     , (Nothing, Just 2)
     ]

为了明确该函数的SQL等效写法,以下是在PostgreSQL中编写的方法:
create table names as
select * from (values 
               ('Foo'),
               ('Bar'),
               ('John'),
               ('Emily'),
               ('Connor')
              ) 
              as z(name);

create table nums as
select * from (values 
               (1),
               (2),
               (3),
               (4),
               (5)
              ) 
              as z(num);

select * from names
full join nums
on char_length(name) = num

运行此命令将产生以下结果:

name    num
(null)  1
(null)  2
Foo     3
Bar     3
John    4
Emily   5
Connor  (null)

Fiddle链接:sqlfiddle

评论不是用于长时间讨论的;此对话已被移至聊天室 - Samuel Liew
3个回答

2
在条件rel(x, y)下,表XY之间的完全外连接可以看作是三个(不相交)部分的并集:
  • (x,y)对的集合,其中rel(x,y)
  • 对于所有的y in Ynot (rel(x0, y))(x0,null)对的集合
  • 对于所有的x in Xnot (rel(x, y0))(null,y0)对的集合
我们可以用类似的方式构建Haskell程序:
fullJoin :: [a] -> [b] -> (a -> b -> Bool) -> [(Maybe a, Maybe b)]
fullJoin xs ys rel = concat $
  [[(Just x, Just y) | y <- ys, x `rel` y] | x <- xs] -- for each x, find all the related ys
  <> [[(Just x, Nothing)] | x <- xs, all (\y -> not (x `rel` y)) ys] -- find all xs that have no related ys
  <> [[(Nothing, Just y)] | y <- ys, all (\x -> not (x `rel` y)) xs] -- find all ys that have no related xs

问题的提出方式,你不能比 O(n^2) 更快地解决它。尽管如此,我的解决方案虽然是 O(n^2),但不是最优的:它需要遍历两次 xs。你可以考虑一下,为了只遍历一次 xs,你需要做什么。这与找到一种跟踪已经看到哪些 xs 的方式有关。


1
在函数签名中加入一个累加器列表参数acc是否有助于性能?不断搜索增长的acc以查找已经遇到的xs会如何影响计算复杂度? - shayan
嗯,我撤回之前的说法。按照我的想法,你最终会得到 O(i*log(i)*j),而我的原始建议是 O(i*j)(其中 i, jxs, ys 的长度)。 - Fried Brice
话虽如此,我相信有一种方法可以改进我的答案,只是不是我想的那种方式。 - Fried Brice
1
我认为你的解决方案甚至比我的更加低效。 :) - Will Ness
@WillNess 但它的语义很清晰!因此,它可以用作测试更快实现正确性的参考实现 XD。 - Fried Brice
1
@FriedBrice 这个朴素的方法很简单。其余的都是在边缘优化。如果你想要更快,你必须按照我的建议去做,就像数据库内部一样提取键来匹配列表,这样你甚至不需要查看大多数可能的配对。 - btilly

1

只是简单地实现你想要它做的事情。超级低效:

fullJoin :: [a] -> [b] -> (a -> b -> Bool) 
         -> [(Maybe a, Maybe b)]
fullJoin xs ys p = concatMap (map (Just *** Just) . snd) a
                    ++ [(Just x, Nothing) | (x, []) <- a]
                    ++ [(Nothing, Just y) | (y, []) <- b]
  where
    a = [(x, [(x, y) | y <- ys, p x y]) | x <- xs]
    b = [(y, [(x, y) | x <- xs, p x y]) | y <- ys]

如果我们可以在类型ab上添加Eq约束,我们可以使用Data.List.\\来查找未匹配的元素,而不是进行第二次扫描。
测试:
> mapM_ print $ fullJoin names nums (\n i -> length n == i)
(Just "Foo",Just 3)
(Just "Bar",Just 3)
(Just "John",Just 4)
(Just "Emily",Just 5)
(Just "Connor",Nothing)
(Nothing,Just 1)
(Nothing,Just 2)

你能否解释一下你的方法比@FriedBrice的答案更高效的原因? - shayan
1
我进行了两次扫描(x<-xsy<-ys),他进行了三次。在渐进意义下,它们都是O(n^2)。 - Will Ness

1
这里是一个简单的 Python 实现。它的时间复杂度为 O(n^2)。
#! /usr/bin/env python

def full_join_cond (list1, list2, condition_fn):
    answer = []
    for x in list1:
        matches = []
        for y in list2:
            if condition_fn(x, y):
                matches.append(y)
        if len(matches):
            answer.extend([[x, y] for y in matches])
        else:
            answer.append([x, None])

    for y in list2:
        matched = False
        for x in list1:
            if condition_fn(x, y):
                matched = True
                break
        if not matched:
            answer.append([None, y])

    return answer


names = ["Foo","Bar", "John" , "Emily", "Connor"]
nums = [1,2,3,4,5]
cond = lambda x, y: len(x) == y

print(full_join_cond(names, nums, cond))

以下是更接近数据库执行此连接的实现。 它是O(size_of_output),通常为O(n)

def list_map_to_dict (l, m):
    d = {}
    for x in l:
        mx = m(x)
        if mx in d:
            d[mx].append(x)
        else:
            d[mx] = [x]
    return d

def full_join_map (list1, map1, list2, map2):
    dict1 = list_map_to_dict(list1, map1)
    dict2 = list_map_to_dict(list2, map2)

    answer = []
    for mx, xs in dict1.iteritems():
        if mx in dict2:
            for y in dict2[mx]:
                answer.extend([[x, y] for x in xs])
            dict2.pop(mx)
        else:
            answer.extend([[x, None] for x in xs])

    for my, ys in dict2.iteritems():
        answer.extend([[None, y] for y in ys])
    return answer

print(full_join_map(names, lambda x: len(x), nums, lambda y: y))

为了好玩,这两种方法可以结合在一个函数中,既通用又能快速运行。(我没有试图使函数签名合理-只是试图展示这个想法。)

def full_join (list1, map1, list2, map2, cond=None):
    if map1 is None:
        map1 = lambda x: None

    if map2 is None:
        map2 = lambda y: None

    if cond is None:
        cond = lambda x, y: True

    dict1 = list_map_to_dict(list1, map1)
    dict2 = list_map_to_dict(list2, map2)

    answer = []
    for mx, xs in dict1.iteritems():
        if mx in dict2:
            answer.extend(full_join_cond(xs, dict2[mx], cond))
            dict2.pop(mx)
        else:
            answer.extend([[x, None] for x in xs])

    for my, ys in dict2.iteritems():
        answer.extend([[None, y] for y in ys])
    return answer

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接