Problem:
I have many files that contain Apache web server log entries. These entries are not listed temporarily and are scattered across all files. I am trying to use Pig to read the daily values of files, group and organize the log entries by date, and then write them to files named during the day and hour of the records it contains.
Setup:
Once I have imported my files, I use Regex to get the date field, then I truncate it to an hour. This creates a set in which the record is recorded in one field, and the date is truncated to an hour in another. From here I group the date and time field.
First attempt:
My first thought was to use the STORE command when iterating through my groups using FOREACH and quickly found that it wasn’t cool with Pig.
Second attempt:
My second attempt was to use the MultiStorage () method in the piggy bank, which worked fine until I looked at the file. The problem is that MulitStorage wants to write all the fields to a file, including the field that I used to group. I really want only the original record written to the file.
Question:
So ... am I using Pig for something that it is not intended for, or is there a better way for me to approach this problem with Pig? Now that I have this question, I will work on a simple code example to further explain my problem. As soon as I have it, I will post it here. Thanks in advance.
source
share