Apache Pig on Hadoop 1.x

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop sub-project). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

How Pig Works

Pig runs on Apache Hadoop YARN and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. While SQL is designed to query the data, Pig Latin allows you to write a data flow that describes how your data will be transformed (such as aggregate, join and sort).

Since Pig Latin scripts can be graphs (instead of requiring a single output) it is possible to build complex data flows involving multiple inputs, transforms, and outputs. Users can extend Pig Latin by writing their own functions, using Java, Python, Ruby, or other scripting languages. Pig Latin is sometimes extended using UDFs (User Defined Functions), which the user can write in any of those languages and then call directly from the Pig Latin.

The user can run Pig in two modes, using either the “pig” command or the “java” command:

MapReduce Mode. This is the default mode, which requires access to a Hadoop cluster.
Local Mode. With access to a single machine, all files are installed and run using a local host and file system.

Installation Steps:

Step 1: Download the latest version of pig from the following link

Step 2: Extract the downloaded tarbal using the following command

$tar -xvf pig-0.10.tar.gz

Step 3: Move the extracted files into "/opt" folder using the command

$ sudo mv pig-0.10 /opt/pig

Step 4: Change the ownership using the command

$ sudo chown hduser:hadoop -R /opt/pig

Step 5: Change the permission

$ sudo chmod 755 -R /opt/pig

Step 6: Make changes in "~/.bashrc" file as shown below

$ sudo vi ~/.bashrc

export PIG_HOME=/opt/pig

export PIG_CLASSPATH=/opt/hadoop/conf

export HADOOP_LOG_DIR=$HADOOP_HOME/logs

export PATH=$PATH:PIG_HOME/bin

Step 7: Start pig using the command

$ /opt/pig/bin/pig

Step 8: For stopping the pig use the command given below

grunt> quit;

Learning

Search This Blog