Import tiered log directories into hadoop / pig

Question

Import tiered log directories into hadoop / pig

We save our logs on S3, and one of our (Pig) requests will capture three different types of logs. Each type of log is in sets of subdirectories based on type / date. For instance:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

My request would like to download all three types of magazines in order to give time. For instance:

type1 = load 's3:/logs/type1/2011/03/08' as ...
type2 = load 's3:/logs/type2/2011/03/08' as ...
type3 = load 's3:/logs/type3/2011/03/08' as ...
result = join type1 ..., type2, etc...

my requests will be executed against all these magazines.

What is the most effective way to handle this?

Do I need to use the bash script extension? Not sure if this works with multiple directories, and I doubt it would be effective (or even possible) if 10k logs were loaded.
Are we creating a service to aggregate all the logs and push them directly to hdfs?
Custom java / python importers?
Other thoughts?

, , .

+3

hadoop hdfs apache-pig

Joshua Ball 11 . '11 20:01

3

, , , , , :

type1 = load 's3:/logs/type1/2011/03/' as ...

type1 type2. , , :

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

( ) , , .

0

frail 13 . '11 13:06

Hive , PiggyBank (, AllLoader) , , , :

.../type=value1/...
.../type=value2/...
.../type=value3/...

Then you can DOWNLOAD the file, and then FILTER BY type = 'value1'.

Example:

REGISTER piggybank.jar;
I = LOAD '/hive/warehouse/mytable' using AllLoader() AS ( a:int, b:int );
F = FILTER I BY type = 1 OR type = 2;

0

Samuel kerrien Aug 2 '12 at 10:30

source share

Romain · Accepted Answer · 2011-03-14T18:18:11+0000

Globeing PigStorage, :

type1 = load 's3:/logs/type{1,2,3}/2011/03/08' as ..

type1 = load 's3:/logs/*/2011/03/08' as ..

Import tiered log directories into hadoop / pig

More articles: