Hasoop converts \ r \ n to \ n and splits the ARC format

I am trying to parse data from commoncrawl.org using chaotic streams. I set up local chaos to test my code, and have a simple Ruby mapper that uses a streaming reader of ARC files. When I call my code to myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

It works as expected.

It seems like hadoop automatically sees that the file has a .gz extension and unpacks it before passing it to mapper, but it will convert \ r \ n lines in the stream to \ n. Because ARC relies on the length of the entry in the title bar, this change breaks the parser (as the data length has changed).

To double check, I changed my cartographer to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

And it works.

I do not mind automatic decompression (although I can happily handle .gz stream files), but if necessary, I need to unpack it in binary without any string conversion or similar. I believe that the default behavior is to feed the unpacked files to one cartographer per file, which is ideal.

How can I ask it not to unzip .gz (renaming files is not an option) or make it unpack correctly? I would prefer not to use the special InputFormat class, which I should send to the bank, if at all possible.

All this will work in AWS ElasticMapReduce.

+4
source share
1 answer

It seems that Hadoop PipeMapper.java is to blame (at least in 0.20.2):

106 TextInputFormat ( \r\n), PipeMapper stdout \n.

PipeMapper.java, , , (, ).

+2

All Articles