Setup a MapReduce Program in Cloudera

Setup a MapReduce Program in Cloudera


1. Download the Cloudera VMWare from below URL

    The downloaded Cloudera VMware has included 

  • CentOS
  • Hadoop setup (with all libraries)
  • Java 
  • Eclipse IDE

2. Download the VMware player to open downloaded Cloudera virtual machine( use this below         URL to download the VMware player)
                               https://www.vmware.com/products/player

3. Open the eclipse which is located on the desktop of our VM CentOS (The CentOS was loaded 
    by VMware player using Cloudera)

4. Create a new java project in eclipse (e.g: MapReduce) 

5. Add all the Hadoop related dependency libraries need to be added to our project from below location (You can add all libraries by using java build path, it can be achieved by right-clicking the project and go to java build path and add external jar in library tab)
               Library Path:    /usr/lib/hadoop/client

6. Create packages to each and every problem (MapReduce problem can be solved by three   methods, which are pair, strip, and hybrid approaches)

    e.g:Package names are com.mapreduce.pair, com.mapreduce.straip, com.mapreduce.hybrid

7. Create the MainClass, Mapper, Partitioner and Reducer classes for each every problem with appropriate naming convention.
                e.g: PairMain, PairMapper, PairPartitioner, PairReducer

8. Implement all the classes with necessary extends and implements.

9. Create a HDFS input file with given input (e.g: input.txt)

10. Create an HDFS input file in terminal by using below Linux command
              > >hadoop fs -put <inputfile path> <HDFS file path>

              e.g: >> hadoop fs -put input.txt hdfsinput.txt

Run the MapReduce program by using Linux command in Terminal

1. In eclipse, right click on our project and export the jar file with a specific name (project.jar)

2. Check our HDFS input file by using Linux command which was created in above step 10                                 >> hadoop fs -ls

3. Execute the jar file with created HDFS input by using below Linux command

         General syntax:
             >>hadoop jar <jarfile> <Main class with package> <inputHDFS file> <output name>

             PairMain class execution:
            >>hadoop jar project.jar com/ranjith/pair/PairMain projectInput.txt pairoutput

             StraipMain class execution:
             >>hadoop jar project.jar com/ranjith/straip/StraipMain projectInput.txt straipoutput

             HybridMain class execution
             >>hadoop jar project.jar com/ranjith/hybrid/HybridMain projectInput.txt                                        hybridoutput

4. We can view the corresponding output in the terminal by below Linux command:
           >> hadoop fs -cat pairoutput

5. Also, we can see the job status, all HDFS files, and outputs in-browser GUI by hitting below URLs
           http://localhost:88888 (Hue)
           http://localhost:50070

 6. A deployed application job running status in Hue

7. HDFS files and output in Hue: Once you see the output file (e.g: pair output) you can click on the output file, there is a file with name Part_0000, by clicking this file you can see the output in the same browser.




8. The same HDFS files and output in http://localhost:50070 as well.

Comments

Popular posts from this blog

Programmatically turn ON/OFF NFC in Android

Sign-on by using Google OAuth2

Setup to execute Apache Spark in Cloudera