使用Apache Spark将数据持久化到DynamoDB

Question

使用Apache Spark将数据持久化到DynamoDB

apache-sparkapache-spark-sqlamazon-dynamodbamazon-emr

5

我是一名有用的助手，可以为您翻译以下内容：

我有一个应用程序，其中： 1. 我使用SqlContext.read.json从S3读取JSON文件到Dataframe中 2. 然后对DataFrame进行一些转换 3. 最后，我想使用一个记录值作为键和其余JSON参数作为值/列将记录持久化到DynamoDB中。

我正在尝试这样做：

JobConf jobConf = new JobConf(sc.hadoopConfiguration());
jobConf.set("dynamodb.servicename", "dynamodb");
jobConf.set("dynamodb.input.tableName", "my-dynamo-table");   // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com");
jobConf.set("dynamodb.regionid", "us-east-1");
jobConf.set("dynamodb.throughput.read", "1");
jobConf.set("dynamodb.throughput.read.percent", "1");
jobConf.set("dynamodb.version", "2011-12-05");

jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");

DataFrame df = sqlContext.read().json("s3n://mybucket/abc.json");
RDD<String> jsonRDD = df.toJSON();
JavaRDD<String> jsonJavaRDD = jsonRDD.toJavaRDD();
PairFunction<String, Text, DynamoDBItemWritable> keyData = new PairFunction<String, Text, DynamoDBItemWritable>() {
    public Tuple2<Text, DynamoDBItemWritable> call(String row) {
        DynamoDBItemWritable writeable = new DynamoDBItemWritable();
        try {
            System.out.println("JSON : " + row);
            JSONObject jsonObject = new JSONObject(row);

            System.out.println("JSON Object: " + jsonObject);

            Map<String, AttributeValue> attributes = new HashMap<String, AttributeValue>();
            AttributeValue attributeValue = new AttributeValue();
            attributeValue.setS(row);
            attributes.put("values", attributeValue);

            AttributeValue attributeKeyValue = new AttributeValue();
            attributeValue.setS(jsonObject.getString("external_id"));
            attributes.put("primary_key", attributeKeyValue);

            AttributeValue attributeSecValue = new AttributeValue();
            attributeValue.setS(jsonObject.getString("123434335"));
            attributes.put("creation_date", attributeSecValue);
            writeable.setItem(attributes);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return new Tuple2(new Text(row), writeable);
    }
};

JavaPairRDD<Text, DynamoDBItemWritable> pairs = jsonJavaRDD
        .mapToPair(keyData);

Map<Text, DynamoDBItemWritable> map = pairs.collectAsMap();
System.out.println("Results : " + map);
pairs.saveAsHadoopDataset(jobConf);

然而，我没有看到任何数据被写入DynamoDB，也没有收到任何错误消息。

- lazywiz

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Timvw74 · Accepted Answer

我不确定，但是你的代码似乎比必要的复杂。

我已经使用以下代码成功地将RDD写入到DynamoDB中：

val ddbInsertFormattedRDD = inputRDD.map { case (skey, svalue) =>
    val ddbMap = new util.HashMap[String, AttributeValue]()

    val key = new AttributeValue()
    key.setS(skey.toString)
    ddbMap.put("DynamoDbKey", key)


    val value = new AttributeValue()
    value.setS(svalue.toString)
    ddbMap.put("DynamoDbKey", value)

    val item = new DynamoDBItemWritable()
    item.setItem(ddbMap)

    (new Text(""), item)
}

val ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "my-dynamo-table")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)

此外，您是否检查过正确提高容量？