Clojure: 在解析大型日志文件时出现“OutOfMemoryError Java heap space”。

4

所有的。
我希望使用Clojure解析大型日志文件。
每行记录的结构是“用户ID,纬度,经度,时间戳”。
我的实现步骤如下:
----> 读取日志文件并获取前n个用户列表
----> 查找每个前n个用户的记录并存储在单独的日志文件(UserID.log)中。

实现源代码:

;======================================================
(defn parse-file
  ""
  [file n]
  (with-open [rdr (io/reader file)]
    (println "001 begin with open ")
    (let [lines (line-seq rdr)
          res (parse-recur lines)
          sorted
          (into (sorted-map-by (fn [key1 key2]
                                 (compare [(get res key2) key2]
                                          [(get res key1) key1])))
                res)]
      (println "Statistic result : " res)
      (println "Top-N User List : " sorted)
      (find-write-recur lines sorted n)
      )))

(defn parse-recur
  ""
  [lines]
  (loop [ls  lines
         res {}]
    (if ls
      (recur (next ls)
               (update-res res (first ls))) 
      res)))

(defn update-res
  ""
  [res line]
  (let [params (string/split line #",")
        id     (if (> (count params) 1) (params 0) "0")]
    (if (res id)
      (update-in res [id] inc)
      (assoc res id 1))))

(defn find-write-recur
  "Get each users' records and store into separate log file"
  [lines sorted n]
  (loop [x n
         sd sorted
         id (first (keys sd))]
    (if (and (> x 0) sd)
      (do (create-write-file id
                             (find-recur lines id))
          (recur (dec x)
                 (rest sd)
                 (nth (keys sd) 1))))))

(defn find-recur
  ""
  [lines id]
  (loop [ls lines
           res []]
    (if ls
      (recur (next ls)
               (update-vec res id (first ls)))
      res)))

(defn update-vec
  ""
  [res id line]
  (let [params (string/split line #",")
        id_        (if (> (count params) 1) (params 0) "0")]
        (if (= id id_ )
          (conj res line)
          res)))

(defn create-write-file
  "Create a new file and write information into the file."
  ([file info-lines]
   (with-open [wr (io/writer (str MAIN-PATH file))]
     (doseq [line info-lines] (.write wr (str line "\n")))
     ))
  ([file info-lines append?]
   (with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
     (doseq [line info-lines] (.write wr (str line "\n"))))
   ))
;======================================================

我在REPL中使用命令(parse-file "./DATA/log.log" 3)测试了这个clj,并得到以下结果:

记录数-----大小-----时间----结果
1,000-------42KB-----<1秒-----OK
10,000------420KB----<1秒-----OK
100,000-----4.3MB----3秒------OK
1,000,000---43MB-----15秒-----OK
6,000,000---258MB---->20M----"OutOfMemoryError Java heap space java.lang.String.substring (String.java:1913)"

======================================================
问题如下:
1. 当尝试解析大于200MB的日志文件时,如何修复错误?
2. 如何优化函数以使其运行更快?
3. 如果存在大小超过1G的日志,则该函数如何处理。

我还是Clojure的新手,非常感谢您提供任何建议或解决方案~
谢谢

3个回答

9
作为对你问题的直接回答;从我一点点的Clojure经验来看。
  1. The quick and dirty fix for running out of memory boils down to giving the JVM more memory. You can try adding this to your project.clj:

    :jvm-opts ["-Xmx1G"] ;; or more
    

    That will make Leiningen launch the JVM with a higher memory cap.

  2. This kind of work is going to use a lot of memory no matter how you work it. @Vidya's suggestion ot use a library is definitely worth considering. However, there's one optimization that you can make that should help a little.

    Whenever you're dealing with your (line-seq ...) object (a lazy sequence) you should make sure to maintain it as a lazy seq. Doing next on it will pull the whole thing into memory at once. Use rest instead. Take a look at the clojure site, especially the section on laziness:

    (rest aseq) - returns a possibly empty seq, never nil

    [snip]

    a (possibly) delayed path to the remaining items, if any

    You may even want to traverse the log twice--once to pull just the username from each line as a lazy-seq, again to filter out those users. This will minimize the amount of the file you're holding onto at any one time.

  3. Making sure your function is lazy should reduce the sheer overhead that having the file as a sequence in memory creates. Whether that's enough to parse a 1G file, I don't think I can say.


我曾认为增加内存只是治标不治本,但你提到保持懒惰的观点很好。 - Vidya

2
你绝对不需要使用Cascalog或Hadoop来解析一个超出Java堆大小的文件。这个SO问题提供了一些处理大文件的惰性计算的工作示例。主要的点是在遍历惰性序列时需要保持文件打开状态。以下是我在类似情况下使用的方法:
(defn lazy-file-lines [file]
  (letfn [(helper [rdr]
                  (lazy-seq
                    (if-let [line (.readLine rdr)]
                      (cons line (helper rdr))
                      (do (.close rdr) nil))))]
         (helper (clojure.java.io/reader file))))

你可以在这个惰性序列上执行 mapreducecount 等操作。
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>

“解析是一个单独、更简单的问题。”

0

我对Clojure也比较新,所以我看不出明显的优化方法。希望更有经验的人能提供一些建议。但我觉得这只是数据量太大,超出了手头工具的处理能力。

因此,我建议使用Cascalog,它是一个在Hadoop或本地机器上使用Clojure的抽象层。我认为查询大型日志文件的语法对你来说应该很简单。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接