远程运行Hadoop作业

7
我正在尝试从集群外部运行MapReduce作业。例如,Hadoop集群正在Linux机器上运行。我们有一个运行在Windows机器上的Web应用程序。我们想要从这个远程的Web应用程序运行Hadoop作业,并且希望检索Hadoop输出目录并将其呈现为图形。我们编写了以下代码:
Configuration conf = new Configuration();

Job job = new Job(conf);

conf.set("mapred.job.tracker", "192.168.56.101:54311"); 

conf.set("fs.default.name", "hdfs://192.168.56.101:54310");

job.setJarByClass(Analysis.class) ;
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);



//job.set

job.setInputFormatClass(CustomFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);


job.waitForCompletion(true);

这是我们得到的错误。即使我们关闭了hadoop 1.1.2集群,错误仍然相同。

14/03/07 00:23:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/07 00:23:37 ERROR security.UserGroupInformation: PriviledgedActionException as:user cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at LineCounter.main(LineCounter.java:86)

1
尝试将集群中的mapred-ste.xml和hdfs-site.xml包含在应用程序的类路径中。通过调用addResource(Path)方法,将这些文件添加到您的Configuration对象中。 - Venkat
请查看此链接:https://stackoverflow.com/questions/17444720/running-hadoop-in-windows-7-setting-via-cygwin-priviledgedactionexception-asp?rq=1 - Daniel S.
底线是,Hadoop内部的某些代码不希望在Windows环境中运行。您可以编写补丁来修复它,但如果太麻烦的话,最好的方法可能是在运行之前将作业发送到Linux机器上。 - Daniel S.
1个回答

2

在从远程系统运行时,您应该作为远程用户运行。您可以在主类中按如下方式执行:

public static void main(String a[]) {
     UserGroupInformation ugi
     = UserGroupInformation.createRemoteUser("root");

     try {


        ugi.doAs(new PrivilegedExceptionAction<Void>() {

            public Void run() throws Exception {
                 Configuration conf = new Configuration();

                 Job job = new Job(conf);

                 conf.set("hadoop.job.ugi", "root");

                 // write your remaining piece of code here. 

                return null;
            }
        });

    } catch (Exception e) {
        e.printStackTrace();
    }

}

同时在提交MapReduce作业时,应将Java类及其相关的依赖jar包复制到Hadoop集群中,在那里执行MapReduce作业。您可以在这里阅读更多信息。
因此,您需要在可运行的JAR文件中创建您的代码(在您的情况下为Analysis主类),并在其清单类路径中包含所有依赖的jar文件。然后从命令行运行您的JAR文件。
java -jar job-jar-with-dependencies.jar arguments

HTH!


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接