Writing one file per group in Pig Latin

Problem: I have many files that contain Apache web server log entries. These entries are not listed temporarily and are scattered across all files. I am trying to use Pig to read the daily values ​​of files, group and organize the log entries by date, and then write them to files named during the day and hour of the records it contains.

Setup: Once I have imported my files, I use Regex to get the date field, then I truncate it to an hour. This creates a set in which the record is recorded in one field, and the date is truncated to an hour in another. From here I group the date and time field.

First attempt: My first thought was to use the STORE command when iterating through my groups using FOREACH and quickly found that it wasn’t cool with Pig.

Second attempt: My second attempt was to use the MultiStorage () method in the piggy bank, which worked fine until I looked at the file. The problem is that MulitStorage wants to write all the fields to a file, including the field that I used to group. I really want only the original record written to the file.

Question: So ... am I using Pig for something that it is not intended for, or is there a better way for me to approach this problem with Pig? Now that I have this question, I will work on a simple code example to further explain my problem. As soon as I have it, I will post it here. Thanks in advance.

+2
source share
1 answer

Out of the box, Pig does not have much functionality. It does the main stuff, but more time than not, I have to write custom UDFs or load / store funcs to get the form 95% of the way to 100% of the way there. I usually find it worth it, because just writing a small store function is much less Java than the whole MapReduce program.

, . / MultiStorage, . putNext, , . , Tuple remove delete, . , , , , , Tuple.

/ , : http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

+2

All Articles