hadoop - Merging MapReduce output -

- February 15, 2012

i have 2 mapreduce jobs produce files in 2 separate directories so:

 directory output1:  ------------------  /output/20140102-r-00000.txt  /output/20140102-r-00000.txt  /output/20140103-r-00000.txt  /output/20140104-r-00000.txt   directory output2:  ------------------  /output-update/20140102-r-00000.txt

i want merge these 2 directories in new directory /output-complete/ 20140102-r-00000.txt replaces original file in /output directory , of "-r-0000x" removed file name. 2 original directories empty , resulting directory should follows:

 directory output3:  -------------------  /output-complete/20140102.txt  /output-complete/20140102.txt  /output-complete/20140103.txt  /output-complete/20140104.txt

what best way this? can use hdfs shell commands? need create java program traverse both directories , logic?

you can use pig ...

get_data = load '/output*/20140102*.txt' using loader() store get_data "/output-complete/20140102.txt"

or hdfs command...

hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt

single qoutes may not work, try double quotes

Search This Blog

Sp

hadoop - Merging MapReduce output -

Comments

Post a Comment

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

symfony - imagine_filter() not generating the correct url in LiipImagineBundle -