在Pyspark中,与scala.util.Try相当的是什么?

8

我有一个糟糕的HTTPD访问日志,只想跳过“糟糕”的行。

在Scala中,这很简单:

import scala.util.Try

val log = sc.textFile("access_log")

log.map(_.split(' ')).map(a => Try(a(8))).filter(_.isSuccess).map(_.get).map(code => (code,1)).reduceByKey(_ + _).collect()

对于Python,我通过显式定义函数来得到以下解决方案,与使用“lambda”符号相反:

log = sc.textFile("access_log")

def wrapException(a):
    try:
        return a[8]
    except:
        return 'error'

log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()

有没有更好的方法在pyspark中执行这个操作(例如,像Scala一样)?
非常感谢!
2个回答

7
更好是一个主观的词汇,但你可以尝试一些方法。
  • The simplest thing you can do in this particular case is to avoid exceptions whatsoever. All you need is a flatMap and some slicing:

    log.flatMap(lambda s : s.split(' ')[8:9])
    

    As you can see it means no need for an exception handling or subsequent filter.

  • Previous idea can be extended with a simple wrapper

    def seq_try(f, *args, **kwargs):
        try:
            return [f(*args, **kwargs)]
        except:
            return []
    

    and example usage

    from operator import div # FYI operator provides getitem as well.
    
    rdd = sc.parallelize([1, 2, 0, 3, 0, 5, "foo"])
    
    rdd.flatMap(lambda x: seq_try(div, 1., x)).collect()
    ## [1.0, 0.5, 0.3333333333333333, 0.2]
    
  • finally more OO approach:

    import inspect as _inspect
    
    class _Try(object): pass    
    
    class Failure(_Try):
        def __init__(self, e):
            if Exception not in _inspect.getmro(e.__class__):
                msg = "Invalid type for Failure: {0}"
                raise TypeError(msg.format(e.__class__))
            self._e = e
            self.isSuccess =  False
            self.isFailure = True
    
        def get(self): raise self._e
    
        def __repr__(self):
            return "Failure({0})".format(repr(self._e))
    
    class Success(_Try):
        def __init__(self, v):
            self._v = v
            self.isSuccess = True
            self.isFailure = False
    
        def get(self): return self._v
    
        def __repr__(self):
            return "Success({0})".format(repr(self._v))
    
    def Try(f, *args, **kwargs):
        try:
            return Success(f(*args, **kwargs))
        except Exception as e:
            return Failure(e)
    

    and example usage:

    tries = rdd.map(lambda x: Try(div, 1.0, x))
    tries.collect()
    ## [Success(1.0),
    ##  Success(0.5),
    ##  Failure(ZeroDivisionError('float division by zero',)),
    ##  Success(0.3333333333333333),
    ##  Failure(ZeroDivisionError('float division by zero',)),
    ##  Success(0.2),
    ##  Failure(TypeError("unsupported operand type(s) for /: 'float' and 'str'",))]
    
    tries.filter(lambda x: x.isSuccess).map(lambda x: x.get()).collect()
    ## [1.0, 0.5, 0.3333333333333333, 0.2]
    

    You can even use pattern matching with multipledispatch

    from multipledispatch import dispatch
    from operator import getitem
    
    @dispatch(Success)
    def check(x): return "Another great success"
    
    @dispatch(Failure)
    def check(x): return "What a failure"
    
    a_list = [1, 2, 3]
    
    check(Try(getitem, a_list, 1))
    ## 'Another great success'
    
    check(Try(getitem, a_list, 10)) 
    ## 'What a failure'
    

    If you like this approach I've pushed a little bit more complete implementation to GitHub and pypi.


2

首先,让我生成一些随机数据以开始工作。

import random
number_of_rows = int(1e6)
line_error = "error line"
text = []
for i in range(number_of_rows):
    choice = random.choice([1,2,3,4])
    if choice == 1:
        line = line_error
    elif choice == 2:
        line = "1 2 3 4 5 6 7 8 9_1"
    elif choice == 3:
        line = "1 2 3 4 5 6 7 8 9_2"
    elif choice == 4:
        line = "1 2 3 4 5 6 7 8 9_3"
    text.append(line)

现在我有一个字符串text,它看起来像这样:
  1 2 3 4 5 6 7 8 9_2
  error line
  1 2 3 4 5 6 7 8 9_3
  1 2 3 4 5 6 7 8 9_2
  1 2 3 4 5 6 7 8 9_3
  1 2 3 4 5 6 7 8 9_1
  error line
  1 2 3 4 5 6 7 8 9_2
  ....

您的解决方案:

def wrapException(a):
    try:
        return a[8]
    except:
        return 'error'

log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()

#[('9_3', 250885), ('9_1', 249307), ('9_2', 249772)]

这是我的解决方案:
from operator import add
def myfunction(l):
    try:
        return (l.split(' ')[8],1)
    except: 
        return ('MYERROR', 1) 
log.map(myfunction).reduceByKey(add).collect()
#[('9_3', 250885), ('9_1', 249307), ('MYERROR', 250036), ('9_2', 249772)]

评论:

(1)我强烈建议也计算“错误”的行,因为它不会增加太多开销,并且还可以用于检查,例如,所有计数应该加起来等于日志中的总行数,如果你过滤掉这些行,你就不知道这些是真正的坏行还是你的编码逻辑出了问题。

(2)我将尝试将所有行级操作打包在一个函数中,以避免链接mapfilter函数,这样更易读。

(3)从性能角度来看,我生成了100万条记录的样本,我的代码在3秒内完成,而你的代码在2秒内完成,这不是一个公平的比较,因为数据太小,我的集群非常强大,我建议你生成一个更大的文件(1e12?)并对你的代码进行基准测试。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接