所有的。
我希望使用Clojure解析大型日志文件。
每行记录的结构是“用户ID,纬度,经度,时间戳”。
我的实现步骤如下:
----> 读取日志文件并获取前n个用户列表
----> 查找每个前n个用户的记录并存储在单独的日志文件(UserID.log)中。
实现源代码:
;======================================================
(defn parse-file
""
[file n]
(with-open [rdr (io/reader file)]
(println "001 begin with open ")
(let [lines (line-seq rdr)
res (parse-recur lines)
sorted
(into (sorted-map-by (fn [key1 key2]
(compare [(get res key2) key2]
[(get res key1) key1])))
res)]
(println "Statistic result : " res)
(println "Top-N User List : " sorted)
(find-write-recur lines sorted n)
)))
(defn parse-recur
""
[lines]
(loop [ls lines
res {}]
(if ls
(recur (next ls)
(update-res res (first ls)))
res)))
(defn update-res
""
[res line]
(let [params (string/split line #",")
id (if (> (count params) 1) (params 0) "0")]
(if (res id)
(update-in res [id] inc)
(assoc res id 1))))
(defn find-write-recur
"Get each users' records and store into separate log file"
[lines sorted n]
(loop [x n
sd sorted
id (first (keys sd))]
(if (and (> x 0) sd)
(do (create-write-file id
(find-recur lines id))
(recur (dec x)
(rest sd)
(nth (keys sd) 1))))))
(defn find-recur
""
[lines id]
(loop [ls lines
res []]
(if ls
(recur (next ls)
(update-vec res id (first ls)))
res)))
(defn update-vec
""
[res id line]
(let [params (string/split line #",")
id_ (if (> (count params) 1) (params 0) "0")]
(if (= id id_ )
(conj res line)
res)))
(defn create-write-file
"Create a new file and write information into the file."
([file info-lines]
(with-open [wr (io/writer (str MAIN-PATH file))]
(doseq [line info-lines] (.write wr (str line "\n")))
))
([file info-lines append?]
(with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
(doseq [line info-lines] (.write wr (str line "\n"))))
))
;======================================================
我在REPL中使用命令(parse-file "./DATA/log.log" 3)测试了这个clj,并得到以下结果:
记录数-----大小-----时间----结果
1,000-------42KB-----<1秒-----OK
10,000------420KB----<1秒-----OK
100,000-----4.3MB----3秒------OK
1,000,000---43MB-----15秒-----OK
6,000,000---258MB---->20M----"OutOfMemoryError Java heap space java.lang.String.substring (String.java:1913)"
======================================================
问题如下:
1. 当尝试解析大于200MB的日志文件时,如何修复错误?
2. 如何优化函数以使其运行更快?
3. 如果存在大小超过1G的日志,则该函数如何处理。
我还是Clojure的新手,非常感谢您提供任何建议或解决方案~
谢谢