I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:
- Bringing schema less approach to unstructured data.
- Bringing programming support to data flow.
PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.
PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc
- Point PIG to the JAVA_HOME.
- set PIG_HOME to the core PIG script.
- set PIG_DIR to the bin- dir.
- set PIG_CONF_DIR to hadoop configuration directory.
- finally set the PIG_CLASSPATH add it to the CLASSPATH.
Here is the exact code for above 5 steps:
Now if you have written UDF then first register it and then define the function.
define <function> org.apache.pig.piggybank.storage.<functionname>();
Now you’ve UDF available to use throughout your script.
A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);
Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.
Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.