Multiple source files for s3distcp

Is there a way to copy a list of files from S3 to hdfs instead of a full folder using s3distcp? this is when srcPattern cannot work.

I have several files in s3 folder having different names. I want to copy only certain files to the hdfs directory. I did not find a way to specify several paths to the source files for s3distcp.

The workaround I am currently using is to list all file names in srcPattern

hadoop jar s3distcp.jar
    --src s3n://bucket/src_folder/
    --dest hdfs:///test/output/
    --srcPattern '.*somefile.*|.*anotherone.*'

Can this work when the number of files is too large? about 10,000?

0
source share
2 answers

, . -copyFromManifest,

+2

hadoop distcp . distcp s3 hdfs.

, .

http://hadoop.apache.org/docs/r1.2.1/distcp.html

URL-

: , s3 (test-bucket) test1.

abc.txt
abd.txt
defg.txt

test2

hijk.txt
hjikl.txt
xyz.txt

hdfs hdfs://localhost.localdomain:9000/user/test/

distcp .

hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
+4

All Articles