使用Hive读取Hadoop SequenceFiles

Question

使用Hive读取Hadoop SequenceFiles

6

我有一些来自Common Crawl的mapred数据，已经以SequenceFile格式存储。我多次尝试将这些数据“原样”使用Hive，以便在各个阶段进行查询和抽样。但是，我总是在作业输出中收到以下错误：

LazySimpleSerDe: expects either BytesWritable or Text object!

我甚至构建了一个更简单(且更小)的数据集[文本，LongWritable]，但这也失败了。如果我将数据输出到文本格式，然后在其上创建一个表格，那么它就能正常工作：

hive> create external table page_urls_1346823845675
    >     (pageurl string, xcount bigint) 
    >     location 's3://mybucket/text-parse/1346823845675/';
OK
Time taken: 0.434 seconds
hive> select * from page_urls_1346823845675 limit 10;
OK
http://0-italy.com/tag/package-deals    643    NULL
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9    NULL
http://01fishing.com/fly-fishing-knots/    3437    NULL
http://01fishing.com/flyin-slab-creek/    1005    NULL
...

我尝试使用自定义输入格式：

// My custom input class--very simple
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class UrlXCountDataInputFormat extends 
     SequenceFileInputFormat<Text, LongWritable> {  }

我使用以下代码创建表：

create external table page_urls_1346823845675_seq 
  (pageurl string, xcount bigint) 
  stored as inputformat 'my.package.io.UrlXCountDataInputFormat' 
  outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'  
  location 's3://mybucket/seq-parse/1346823845675/';

但我仍然得到相同的SerDer错误。

我确定我错过了一些基础知识，但我似乎无法做到。此外，我必须能够原地解析SequenceFiles（即我不能将我的数据转换为文本）。因此，我需要为项目的未来部分找出SequenceFile方法。

解决方案： 正如@mark-grover在下面指出的那样，问题是Hive默认忽略键。只有一个列（即值），SerDer无法映射我的第二个列。

解决方案是使用一个自定义InputFormat，它比我最初使用的要复杂得多。我通过查看关于使用键而不是值的链接来找到一个答案，并将其修改以适应我的需求：从内部SequenceFile.Reader中获取键和值，然后将它们组合成最终的BytesWritable。例如，像这样（从自定义Reader中提取，因为所有的艰苦工作都发生在那里）：

// I used generics so I can use this all with 
// other output files with just a small amount
// of additional code ...
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> {

    public synchronized boolean next(K key, BytesWritable value) throws IOException {
        if (!more) return false;

        long pos = in.getPosition();
        V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf);
        boolean remaining = in.next((Writable)key, (Writable)trueValue);
        if (remaining) combineKeyValue(key, trueValue, value);
        if (pos >= end && in.syncSeen()) {
          more = false;
        } else {
          more = remaining;
        }
        return more;
    }

    protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue);

}

// from my final implementation
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable>
    @Override
    protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) {
        // TODO I think we need to use straight bytes--I'm not sure this works?
        StringBuilder builder = new StringBuilder();
        builder.append(key);
        builder.append('\001');
        builder.append(trueValue.get());
        newValue.set(new BytesWritable(builder.toString().getBytes()) );
    }
}

这样一来，我就得到了我所有的列！

http://0-italy.com/tag/package-deals    643
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9
http://01fishing.com/fly-fishing-knots/ 3437
http://01fishing.com/flyin-slab-creek/  1005
http://01fishing.com/pflueger-1195x-automatic-fly-reels/    1999

- codingmonk

在这里找到了有关使用键而不是值的更详细讨论：apache hive thread，这引导我到了gist，其中包含自定义格式和读取器。使用这两个链接加上其他信息，使我能够构建以上内容。 - codingmonk

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Grover · Answer 1

不确定这是否会影响您，但是Hive在读取SequenceFiles时会忽略键。您可能需要创建自定义InputFormat（除非您可以在网上找到一个:-)）

参考：http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C@metaweb.com%3E