Monday 27 January 2014

Running Hadoop examples on Cloudera Quickstart VM CDH4

Cloudera provides a very complete Quickstart VM, a downloadable image for VMWare, KVM or VirtualBox that contains everything to run a single-node Hadoop environment. It also includes the additional components like Hive, Zookeeper, etc.

1. Open a Terminal (Right-click on Desktop or click Terminal icon in the top toolbar)

2. Navigate to the Hadoop library directory:
cd /usr/lib/hadoop-mapreduce/


3. Execute the Hadoop jar command to run the WordCount example:
hadoop jar hadoop-mapreduce-examples.jar wordcount

4. The wordcount example complains that it needs input and output parameters.
 Usage: wordcount <in> <out>

5. Create one or more text files with a few words in it for testing, or use a log file:
echo "count these words for me hadoop" > /home/cloudera/file1
echo "hadoop counts words for me" > /home/cloudera/file2 

6. Create a directory on the HDFS file system:
hdfs dfs -mkdir /user/cloudera/input

7. Copy the files from local filesystem to the HDFS filesystem:
hdfs dfs -put /home/cloudera/file1 /user/cloudera/input
hdfs dfs -put /home/cloudera/file2 /user/cloudera/input

8. Run the Hadoop WordCount example with the input and output specified:
hadoop jar hadoop-mapreduce-examples.jar wordcount /user/cloudera/input /user/cloudera/output

9. Hadoop prints out a whole lot of logging information, after completion view the output directory:
hdfs dfs -ls /user/cloudera/output

10. Check the output file to see the results:
hdfs dfs -cat /user/cloudera/output/part-r-00000