We save our logs on S3, and one of our (Pig) requests will capture three different types of logs. Each type of log is in sets of subdirectories based on type / date. For instance:
/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*
My request would like to download all three types of magazines in order to give time. For instance:
type1 = load 's3:/logs/type1/2011/03/08' as ...
type2 = load 's3:/logs/type2/2011/03/08' as ...
type3 = load 's3:/logs/type3/2011/03/08' as ...
result = join type1 ..., type2, etc...
my requests will be executed against all these magazines.
What is the most effective way to handle this?
- Do I need to use the bash script extension? Not sure if this works with multiple directories, and I doubt it would be effective (or even possible) if 10k logs were loaded.
- Are we creating a service to aggregate all the logs and push them directly to hdfs?
- Custom java / python importers?
- Other thoughts?
, , .