Skip to main content

Posts

Showing posts from 2016

Apache Spark

  Spark on Windows ========================================== Step 1: Download and  install JDK http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Step 2: Download Spark http://spark.apache.org/downloads.html Choose the following options for the download 1. Choose a Spark release - 1.6.1 2. Pre-built for Hadoop 2.6 or later 3. Direct Download 4. Click on the spark-1.6.1-bin-hadoop2.6.tgz Step 3: Extract the tar file Step 4: Copy the contents of the tar files into  C:\spark\ folder Step 5: Update the log4j.properties to set the messages to WARN C:\spark\conf\log4j.properties.template Set the property - log4j.rootCategory=WARN Save the file as log4j.properties Step 6: Download winutils.exe from the  here 1. Create a folder C:\winutils\bin 2. Copy the winutils.exe file into C:\bin\winutils.exe Step 7: Set the environment variables (Inform Windows where is Spark) SPARK_HOME =  C:\spark JAVA_HO

Platform LSF Installation on Ubuntu

LSF INSTALLATTION STEPS ==================================================================================== STEP 1:Download the Specific LSF package from the IBM resources website. STEP 2:sudo tar -xvf lsfce9.1.3-ppc64le.tar STEP 3:cd lsfce9.1.3-ppc64le STEP 4:sudo vi install.config  Edit the following LSF_TOP="/usr/share/lsf" LSF_ADMINS="lsf_admin" LSF_CLUSTER_NAME="my_first_cluster" LSF_TARDIR="/tmp" STEP 5: navigate inside the folder lsfce9.1.3-ppc64le  run the following command ./lsfinstall -f install.config Press "1" to accept the License Once it shows the Success message, navigate to /usr/share/lsf/conf run the following command source ./profile.lsf Start LSF daemons lsadmin limstartup lsadmin resstartup badmin hstartup Once starting all the daemons , test the cluster by issuing the following commands lsid --> displays the cluster name and other information lshosts --> display the number o

Apache Hive 1.2.1 installation on Hadoop 2.7.1 in ubuntu 14.04

Steps to follow for installation Step 1: Download the Hive 1.2.1 tar ball from the link Step 2: Extract the apache-hive-1.2.1-bin.tar.gz using the command root@ubuntu:/usr/local#tar -xvzf apache-hive-1.2.1-bin.tar.gz Step 3: Move the extracted file to location called /usr/local/ Step 4: Navigate inside /usr/local/apache-hive-1.2.1-bin Step 5: Export the Hive_Home using the following command root@ubuntu:/usr/local/apache-hive-1.2.1-bin# export HIVE_HOME="/usr/local/apache-hive-1.2.1-bin" Step 6: Set the class-path of the hive1.2.1 root@ubuntu:/usr/local/apache-hive-1.2.1-bin# PATH=$PATH:$HIVE_HOME/bin root@ubuntu:/usr/local/apache-hive-1.2.1-bin# export PATH Step 7: Make changes in the hive-config.sh using the following command root@ubuntu:/usr/local/apache-hive-1.2.1-bin# vi bin/hive-config.sh Add the following at the end.. export HADOOP_HOME=/usr/local/hadoop Step 8: Start the hive using root@ubuntu:/usr/local/apache-hive-1.2.1-bin# bin/hive

How to Pull Twitter Data Using Apache Flume into HDFS

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to: Stream data from multiple sources into Hadoop for analysis Collect high-volume Web logs in real time Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination Guarantee data delivery Scale horizontally to handle additional data volume Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume

WordCount using Eclipse

Starting Eclipse and running MapReduce Program =============================== Step1: Open the home directory Step2: Go inside the eclipse directory Step3: Doube click on the eclipse.exe file to open the eclipse IDE Step4: Provide the workspace path Next the eclipse IDE Starts and look like this Step5: Just close the workbench After closing the workbench you will see something like below figure Step6: Now let’s create a project           Right click in Project explorer Select → New → Project → Map/Reduce Project Next Provide Project Name → Wordcount in the Project name field Next Provide Project Name → Wordcount Next Configure hadoop installation directory Click on → Configure hadoop installation directory Next click → Browse → goto hadoop installation directory I.e,. /usr/local/hadoop → Ok → ok → next → finish Once done you will see something like below fig

Apache Sqoop

INTRODUCTION Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Steps for Installation Step 1 - Download sqoop-1.4.4.bin hadoop-1.0.0.tar.gz from the mirror website sqoop.apache.org wget http://mirror.cogentco.com/pub/apache/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz Step 2 - Untar the downloaded file tar -xvzf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz Step 3 - Copy the extracted folder in /usr/local/sqoop location sudocp -