我已经编写了一段Perl代码来处理大量的CSV文件并得到输出结果,完成整个过程需要0.8326秒。
my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;
open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
my $line = $_;
chomp $line;
my $severity = (split(",", $line))[6];
next if $severity =~ m/NORMAL/i;
$hash{$time}{$severity}++;
}
close(IN);
}
foreach my $time (sort {$b <=> $a} keys %hash) {
foreach my $severity ( keys %{$hash{$time}} ) {
print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
}
}
现在我正在用Java写相同的逻辑,但需要2600毫秒即2.6秒才能完成。我的问题是为什么Java需要这么长时间?如何实现与Perl相同的速度? 注意:我忽略了VM初始化和类加载时间。
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
static String opname;
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
String severity=line.split(",")[6];
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
System.out.println(store);
}
public static void main(String[] args) throws IOException
{
opname = args[0];
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
}
}
文件输入格式(A~B~C~D~E~20150715080000.csv),大约有500个文件,每个文件大小约为1MB。
A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G
Java版本:1.7
////////////////////更新///////////////////
根据下面的评论, 我用正则表达式替换了分割操作,性能得到了很大提升。 现在我正在循环中执行此操作,并且在3-10次迭代后性能相当可接受。
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>();
static String opname="Etis_Egypt";
static Pattern pattern1=Pattern.compile("(\\d+\\.)");
static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
static long currentsystime=System.currentTimeMillis();
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
Matcher matcher=pattern1.matcher(mf.getName());
matcher.find();
//String timestamp=mf.getName().split("~")[5].replace(".csv", "");
String timestamp=matcher.group();
BufferedReader br= new BufferedReader(new FileReader(mf));
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
matcher=pattern2.matcher(line);
matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
//String severity=line.split(",")[6];
String severity=matcher.group();
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
br.close();
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
//System.out.println(time+"ms");
//System.out.println(store);
}
public static void main(String[] args) throws IOException
{
//opname = args[0];
for(int i=0;i<20;i++){
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println("Time taken for "+i+" is "+time+"ms");
}
}
}
但我现在有另一个问题,
在运行小数据集时查看结果。
**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
在最开始的几个实例中,所花费的时间更长,然后时间逐渐缩短,为什么?
谢谢。
Text::CSV
可能更安全,但它可能比您现有的实现要慢。 - mob