Apache Pig is
a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in turns enables
them to handle very large data sets.
At the present time, Pig's
infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already
exist (e.g., the Hadoop sub-project). Pig's language layer currently consists of
a textual language called Pig Latin, which has the following key properties:
- Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
How Pig Works
Pig runs on Apache Hadoop YARN and makes use of
MapReduce and the Hadoop Distributed File System (HDFS). The language for the
platform is called Pig Latin, which abstracts from the Java MapReduce idiom
into a form similar to SQL. While SQL is designed to query the data, Pig Latin
allows you to write a data flow that describes how your data will be
transformed (such as aggregate, join and sort).
Since Pig Latin scripts can be graphs (instead of
requiring a single output) it is possible to build complex data flows involving
multiple inputs, transforms, and outputs. Users can extend Pig Latin by writing
their own functions, using Java, Python, Ruby, or other scripting languages.
Pig Latin is sometimes extended using UDFs (User Defined Functions), which the
user can write in any of those languages and then call directly from the Pig
Latin.
The user can run Pig in two modes, using either the
“pig” command or the “java” command:
- MapReduce Mode. This
is the default mode, which requires access to a Hadoop cluster.
- Local Mode. With access to a single machine, all files are installed and run using a local host and file system.
Installation Steps:
Step 1: Download the latest version of pig from the following link
Step 2: Extract the downloaded tarbal using the following command
$tar -xvf pig-0.10.tar.gz
Step 3: Move the extracted files into "/opt" folder using the command
$ sudo mv pig-0.10 /opt/pig
Step 4: Change the ownership using the command
$ sudo chown hduser:hadoop -R /opt/pig
Step 5: Change the permission
$ sudo chmod 755 -R /opt/pig
Step 6: Make changes in "~/.bashrc" file as shown below
$ sudo vi ~/.bashrc
export PIG_HOME=/opt/pig
export PIG_CLASSPATH=/opt/hadoop/conf
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export PATH=$PATH:PIG_HOME/bin
Step 7: Start pig using the command
$ /opt/pig/bin/pig
Step 8: For stopping the pig use the command given below
grunt> quit;
Comments
Post a Comment