使用Spring和Spark一起

33
我正在开发一个Spark应用程序,我习惯于使用Spring作为依赖注入框架。现在我遇到了一个问题,处理部分使用了Spring的@Autowired功能,但被Spark序列化和反序列化了,导致出现了问题。
因此,以下代码让我陷入了麻烦:
Processor processor = ...; // This is a Spring constructed object
                           // and makes all the trouble
JavaRDD<Txn> rdd = ...; // some data for Spark
rdd.foreachPartition(processor);

处理器长这样:

public class Processor implements VoidFunction<Iterator<Txn>>, Serializeable {
    private static final long serialVersionUID = 1L;

    @Autowired // This will not work if the object is deserialized
    private transient DatabaseConnection db;

    @Override
    public void call(Iterator<Txn> txns) {
        ... // do some fance stuff
        db.store(txns);
    }
}

所以我的问题是:是否有可能将Spring与Spark结合使用?如果不行,那么实现类似功能的最优雅的方式是什么?谢谢帮助!


2
如果问题在于您反序列化对象并且@Autowired仅在第一次初始化时运行,那么您可以从技术上获取ApplicationContext并强制它手动注入您的瞬态对象。 - EpicPandaForce
1个回答

32

来自问题提问者:补充:如果不修改您自己的类,直接干扰反序列化部分,请使用paraplupluspring-spark项目。该项目会在Spring反序列化时自动装配您的bean。


编辑:

要使用Spark,您需要以下设置(也可以在此存储库中看到):

  • Spring Boot + Spark:

.

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>1.5.2.RELEASE</version>
    <relativePath/>
    <!-- lookup parent from repository -->
</parent>

...

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
        <exclusions>
            <exclusion>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-classic</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.1.0</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
            <exclusion>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.1.0</version>
    </dependency>

    <!-- fix java.lang.ClassNotFoundException: org.codehaus.commons.compiler.UncheckedCompileException -->
    <dependency>
        <groupId>org.codehaus.janino</groupId>
        <artifactId>commons-compiler</artifactId>
        <version>2.7.8</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.slf4j/log4j-over-slf4j -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>log4j-over-slf4j</artifactId>
        <version>1.7.25</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.5</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.6.4</version>
    </dependency>

</dependencies>

然后,您需要应用程序类,与Spring Boot一样:

@SpringBootApplication
public class SparkExperimentApplication {

    public static void main(String[] args) {
        SpringApplication.run(SparkExperimentApplication.class, args);
    }
}

然后是将所有内容绑定在一起的配置。
@Configuration
@PropertySource("classpath:application.properties")
public class ApplicationConfig {

    @Autowired
    private Environment env;

    @Value("${app.name:jigsaw}")
    private String appName;

    @Value("${spark.home}")
    private String sparkHome;

    @Value("${master.uri:local}")
    private String masterUri;

    @Bean
    public SparkConf sparkConf() {
        SparkConf sparkConf = new SparkConf()
                .setAppName(appName)
                .setSparkHome(sparkHome)
                .setMaster(masterUri);

        return sparkConf;
    }

    @Bean
    public JavaSparkContext javaSparkContext() {
        return new JavaSparkContext(sparkConf());
    }

    @Bean
    public SparkSession sparkSession() {
        return SparkSession
                .builder()
                .sparkContext(javaSparkContext().sc())
                .appName("Java Spark SQL basic example")
                .getOrCreate();
    }

    @Bean
    public static PropertySourcesPlaceholderConfigurer propertySourcesPlaceholderConfigurer() {
        return new PropertySourcesPlaceholderConfigurer();
    }
}

然后您可以使用SparkSession类与Spark SQL进行通信:

/**
 * Created by achat1 on 9/23/15.
 * Just an example to see if it works.
 */
@Component
public class WordCount {
    @Autowired
    private SparkSession sparkSession;

    public List<Count> count() {
        String input = "hello world hello hello hello";
        String[] _words = input.split(" ");
        List<Word> words = Arrays.stream(_words).map(Word::new).collect(Collectors.toList());
        Dataset<Row> dataFrame = sparkSession.createDataFrame(words, Word.class);
        dataFrame.show();
        //StructType structType = dataFrame.schema();

        RelationalGroupedDataset groupedDataset = dataFrame.groupBy(col("word"));
        groupedDataset.count().show();
        List<Row> rows = groupedDataset.count().collectAsList();//JavaConversions.asScalaBuffer(words)).count();
        return rows.stream().map(new Function<Row, Count>() {
            @Override
            public Count apply(Row row) {
                return new Count(row.getString(0), row.getLong(1));
            }
        }).collect(Collectors.toList());
    }
}

涉及到这两个类:

public class Word {
    private String word;

    public Word() {
    }

    public Word(String word) {
        this.word = word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public String getWord() {
        return word;
    }
}

public class Count {
    private String word;
    private long count;

    public Count() {
    }

    public Count(String word, long count) {
        this.word = word;
        this.count = count;
    }

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public long getCount() {
        return count;
    }

    public void setCount(long count) {
        this.count = count;
    }
}

然后你可以运行它,查看它是否返回正确的数据:
@RequestMapping("api")
@Controller
public class ApiController {
    @Autowired
    WordCount wordCount;

    @RequestMapping("wordcount")
    public ResponseEntity<List<Count>> words() {
        return new ResponseEntity<>(wordCount.count(), HttpStatus.OK);
    }
}

说:
[{"word":"hello","count":4},{"word":"world","count":1}]

2
我不知道,这是我基于我所知道的唯一一个解决方案 - 但它可能不是最好的 :) - EpicPandaForce
1
好的,这在任何情况下都对我有所帮助。我只需用稍微不同的方式使用它。我将在自动反序列化后注入依赖项。因此,我将扩展kryo序列化类。当我做到这一点时,我将编辑您的答案,因为这个解决方案将基于您的答案。 - itsme
1
好的,我在 Github 上为此创建了一个项目,并将其链接到您的帖子中。希望这对您来说没问题。非常感谢您的提示! - itsme
@selvinsource 很遗憾,我把这个东西组合起来并且感到很高兴,但最后我们并没有“需要”Spark,所以我实际上不确定那部分是否正确,因为我不必弄清楚它。 - EpicPandaForce
1
@EpicPandaForce,我正在进行一些测试,看起来你可以通过设置.config("spark.jars", "target/simple-project-1.0.jar")来实现,除了你建议的主URL之外。无论如何,还是谢谢。 - selvinsource
显示剩余8条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接