BLOGs
Technology
Published August 03rd, 2016 by

How to Process Daily Data to Detect the Latest Data by Map Reduce

In this post, big data application development expert is using Apache Pig to represent for map reduce job. You will learn to process and detect the latest data in Big Data application with the help of Map Reduce. Follow the discussed steps and make it happen

Introduction merging data to do daily process data

In big data application, we collect/ingest the daily data with many approaches such as streaming, copy, put, log data from other data sources to our system. We will have a huge daily data set. However, how to re-organize that daily data and merge with the old data which already processed by previous day? In this blog, I will introduce how we can handle that issue with Map Reduce in Hadoop to only get the latest data in daily data and merge with the old data.

I use Apache Pig to represent for map reduce job in this blog. The script will load the old data, new data and do the sorting base on the collect date and only pick records which just collect today and filter records which processed from few days ago.

Environment
Java: JDK 1.7
Cloudera version: CDH4.6.0
Initial steps

1.We need to prepare some input data file, open a new file in terminal of linux:

vi file1

Text some input data with format: id;product;price;collectdate
1;XY milk;2000;20160730000000
2;AB candy;5000;20160730000000
3;B chair;6000;20160730000000
vi file2

Text some input data with format: id;product;price;collectdate
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000

2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mysample/mergedData
hadoop fs -put file1 /data/mysample/mergedData_20160730000000/
hadoop fs -put file2 /data/mysample/mergedData_20160731000000/

Code walk through

This pig script will merge the data with old and new and collect only the latest records from daily data set.

SET job.name ‘merge old and new data with map reduce by Pig script’;
— Load old data which already processed yesterday

previousDayData = LOAD ‘/data/mysample/mergedData_20160730000000/’ USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);

— Load today data which collected today

todayData = LOAD ‘/data/mysample/mergedData_20160731000000/’ USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);

— Combine two data set together

unionData = UNION previousDayData, todayData;

— Group data by id as a key of data set

groupData = GROUP unionData by id;

— Sort the data set by collect date then we will have the latest date is top rank of dataset
— We will collect only 1 record from the top rank of dataset then we can collect the latest data collect by today. This is de-duplication process and generate the output data to HDFS.

outputData = foreach groupData {
removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1;
GENERATE FLATTEN(removeDuplication);
}

— Store outputData to HDFS
STORE outputData INTO ‘/data/mysample/mergedData_20160731000000_processed/’ USING PigStorage(‘;’);

Verify the result

We can check in the HDFS location to see the output
hadoop fs –text /data/mysample/mergedData_20160731000000_processed/* | head –n 10
The latest data in HDFS for 31/Jul will be:
1;XY milk;2000;20160730000000
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000

Hope that this blog can help you guys understand the steps to merge the daily data in big data application by Map Reduce.

You must follow all the steps as discussed in this post for best results. The big data application development experts are here to help you. You can write your queries in comments and get response from qualified professionals.

Technoligent

Technoligent

Director at Technoligent
Chirag Thumar is working with Technoligent as a java application and software developer for many years. His major contribution includes the collection of web solutions like Java, Python, Asp.Net, Mobile Apps etc…. He has extensive experience in java web development and object-oriented programming language. You can follow him on twitter @Techno_Ligent and like our Facebook page TechnoLigent
Technoligent

Our rankings are completely independent, transparent, and community driven; they are based on user reviews and client sentiment. These technology companies had to earn their way up and didn't just pay their way up.

View Rankings of Best Technology Companies