hadoop - Downloading list of files in parallel in Apache Pig -

- April 15, 2010

i have simple text file contains list of folders on ftp servers. each line separate folder. each folder contains couple of thousand images. want connect each folder, store files inside foder in sequencefile , remove folder ftp server. have written simple pig udf this. here is:

dirs = load '/var/location.txt' using pigstorage(); results = foreach dirs generate download_whole_folder_into_single_sequence_file($0); /* don't need results bag. dummy bag */

the problem i'm not sure if each line of input processed in separate mapper. input file not huge file couple of hundred lines. if pure map/reduce use nlineinputformat , process each line in separate mapper. how can achieve same thing in pig?

pig lets write own load functions, let specify inputformat you'll using. write own.

that said, job described sounds involve single map-reduce step. since using pig wouldn't reduce complexity in case, , you'd have write custom code use pig, i'd suggest doing in vanilla map-reduce. if total file size gigabytes or less, i'd directly on single host. it's simpler not use map reduce if don't have to.

i typically use map-reduce first load data hdfs, , pig data processing. pig doesn't add benefits on vanilla hadoop loading data imo, it's wrapper around inputformat/recordreader additional methods need implement. plus it's technically possible pig loader called multiple times. that's gotcha don't need worry using hadoop map-reduce directly.

Search This Blog

Bay WIKI

hadoop - Downloading list of files in parallel in Apache Pig -

Comments

Post a Comment

Popular posts from this blog

Android : Making Listview full screen -

javascript - Parse JSON from the body of the POST -

Automatically Create Database in Entity Framework 6 with Automatic Migrations Disabled -