Import tiered log directories into hadoop / pig

We save our logs on S3, and one of our (Pig) requests will capture three different types of logs. Each type of log is in sets of subdirectories based on type / date. For instance:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

My request would like to download all three types of magazines in order to give time. For instance:

type1 = load 's3:/logs/type1/2011/03/08' as ...
type2 = load 's3:/logs/type2/2011/03/08' as ...
type3 = load 's3:/logs/type3/2011/03/08' as ...
result = join type1 ..., type2, etc...

my requests will be executed against all these magazines.

What is the most effective way to handle this?

  • Do I need to use the bash script extension? Not sure if this works with multiple directories, and I doubt it would be effective (or even possible) if 10k logs were loaded.
  • Are we creating a service to aggregate all the logs and push them directly to hdfs?
  • Custom java / python importers?
  • Other thoughts?

, , .

+3
3

Globeing PigStorage, :

type1 = load 's3:/logs/type{1,2,3}/2011/03/08' as ..

type1 = load 's3:/logs/*/2011/03/08' as ..

+5

, , , , , :

type1 = load 's3:/logs/type1/2011/03/' as ...

type1 type2. , , :

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

( ) , , .

0

Hive , PiggyBank (, AllLoader) , , , :

.../type=value1/...
.../type=value2/...
.../type=value3/...

Then you can DOWNLOAD the file, and then FILTER BY type = 'value1'.

Example:

REGISTER piggybank.jar;
I = LOAD '/hive/warehouse/mytable' using AllLoader() AS ( a:int, b:int );
F = FILTER I BY type = 1 OR type = 2;
0
source

All Articles