I am trying to parse data from commoncrawl.org using chaotic streams. I set up local chaos to test my code, and have a simple Ruby mapper that uses a streaming reader of ARC files. When I call my code to myself like
cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb
It works as expected.
It seems like hadoop automatically sees that the file has a .gz extension and unpacks it before passing it to mapper, but it will convert \ r \ n lines in the stream to \ n. Because ARC relies on the length of the entry in the title bar, this change breaks the parser (as the data length has changed).
To double check, I changed my cartographer to expect uncompressed data, and did:
cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb
And it works.
I do not mind automatic decompression (although I can happily handle .gz stream files), but if necessary, I need to unpack it in binary without any string conversion or similar. I believe that the default behavior is to feed the unpacked files to one cartographer per file, which is ideal.
How can I ask it not to unzip .gz (renaming files is not an option) or make it unpack correctly? I would prefer not to use the special InputFormat class, which I should send to the bank, if at all possible.
All this will work in AWS ElasticMapReduce.
source
share