Upload compressed files to Elastic MapReduce

I have a group of snappy-compressed server logs in S3 and I need to process them using streaming on Elastic MapReduce. How do I tell Amazon and Hadoop that the logs are already compressed (before they get pulled into HFS!) So that they can be unpacked before being sent to the stream converter script?

The only documentation I can find is here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopDataCompression.html#emr-using-snappy , and it seems to relate to intermediate compression, not to files that are compressed when they arrive in HFS.

By the way, I mainly work on python, so bonus points if you have a solution in boto!

+5
source share
2 answers

The answer is "this is not possible." At least not for the specific case of hadoop streaming for snappy-compressed files occurring outside of chaos.

I (completely!) Explored two main options to come to this conclusion: (1) try using the built-in instant adoop compression as suggested with a high degree of protection, or (2) write my own stream module for consuming and unpacking instant files.

For option (1), it looks like hasoop adds some markup to files when compressed using snappy. Since my files are compressed using snappy outside of chaos, the hasoop built-in codec cannot decompress files.

One symptom of this problem was a heap error:

2013-04-03 20:14:49,739 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:102)
    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
    at java.io.InputStream.read(InputStream.java:85)
    ...

mapred.child.java.opts, :

java.io.IOException: IO error in map input file s3n://my-bucket/my-file.snappy

Hadoop snappy codec , .

(2) , \n,\r \r\n. , , . :

2013-04-03 22:29:50,194 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    ...

Java- adoop (, ) , , \r vs\n. , , , hadoop, Java. , , .

, , , , gzip lzo.

PS - (2) (, textinputformat.record.delimiter = X), .

PPS. S3, , -copyFromLocal, HDFS. , , .

+7

, TextInputFormat ( ), .snappy .

, lzo (.gz extenstion) . , . Cloudera :

, Snappy , , Avro, , , , MapReduce. LZO, LZO , LZO .

+1

All Articles