In this post, big data application development expert is using Apache Pig to represent for map reduce job. You will learn to process and detect the latest data in Big Data application with the help of Map Reduce. Follow the discussed steps and make it happen
Introduction merging data to do daily process data
In big data application, we collect/ingest the daily data with many approaches such as streaming, copy, put, log data from other data sources to our system. We will have a huge daily data set. However, how to re-organize that daily data and merge with the old data which already processed by previous day? In this blog, I will introduce how we can handle that issue with Map Reduce in Hadoop to only get the latest data in daily data and merge with the old data.
I use Apache Pig to represent for map reduce job in this blog. The script will load the old data, new data and do the sorting base on the collect date and only pick records which just collect today and filter records which processed from few days ago.
Environment
Java: JDK 1.7
Cloudera version: CDH4.6.0
Initial steps
1.We need to prepare some input data file, open a new file in terminal of linux:
vi file1
Text some input data with format: id;product;price;collectdate
1;XY milk;2000;20160730000000
2;AB candy;5000;20160730000000
3;B chair;6000;20160730000000
vi file2
Text some input data with format: id;product;price;collectdate
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000
2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:
hadoop fs -mkdir -p /data/mysample/mergedData
hadoop fs -put file1 /data/mysample/mergedData_20160730000000/
hadoop fs -put file2 /data/mysample/mergedData_20160731000000/
Code walk through
This pig script will merge the data with old and new and collect only the latest records from daily data set.
SET job.name ‘merge old and new data with map reduce by Pig script’;
— Load old data which already processed yesterday
previousDayData = LOAD ‘/data/mysample/mergedData_20160730000000/’ USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);
— Load today data which collected today
todayData = LOAD ‘/data/mysample/mergedData_20160731000000/’ USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);
— Combine two data set together
unionData = UNION previousDayData, todayData;
— Group data by id as a key of data set
groupData = GROUP unionData by id;
— Sort the data set by collect date then we will have the latest date is top rank of dataset
— We will collect only 1 record from the top rank of dataset then we can collect the latest data collect by today. This is de-duplication process and generate the output data to HDFS.
outputData = foreach groupData {
removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1;
GENERATE FLATTEN(removeDuplication);
}
— Store outputData to HDFS
STORE outputData INTO ‘/data/mysample/mergedData_20160731000000_processed/’ USING PigStorage(‘;’);
Verify the result
We can check in the HDFS location to see the output
hadoop fs –text /data/mysample/mergedData_20160731000000_processed/* | head –n 10
The latest data in HDFS for 31/Jul will be:
1;XY milk;2000;20160730000000
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000
Hope that this blog can help you guys understand the steps to merge the daily data in big data application by Map Reduce.
You must follow all the steps as discussed in this post for best results. The big data application development experts are here to help you. You can write your queries in comments and get response from qualified professionals.
Technoligent
Latest posts by Technoligent (see all)
- How to Counter Challenges in Implementing IoT? - October 12, 2017
- List of Agile Methodologies to Become a Perfect Software Tester - October 12, 2017
- How to Create Unbeatable Enterprise Mobile Solutions - September 26, 2017