Published August 30th, 2016 by Ethan Millar

Combine Hadoop Distributed File System Command And Linux Command To Query Data In Big Data Application

In this post, big data hadoop architect professionals will explain HDFS command and Linux command and how the combination of both will help in querying data in big data app. You can read this post further to know more about it.

Introduction HDFS command and Linux command for text processing

HDFS command includes command which interact with HDFS and other file system in Hadoop.

For instance, we can use some basic command to move, copy and delete the data folder in HDFS. The main purpose in this blog guide how to query, scan the data in HDFS with HDFS command.

In Linux command, we use some basic command to interact with text data such grep, sort, uniq etc. This is really helpful with local file but how we can use these commands with HDFS.

This blog will guide how to combine the advantages of commands together to query the data in HDFS.

Environment

Java: JDK 1.7

Cloudera version: CDH4.6.0

Initial steps

We need to ready some input data file for user information about music, open a new file in terminal of linux:

vifile1

Text some input data with format: id;music;viewTime;duration;price;musicType

1;Music 1;2;4;70;Pop

2;Music 2;3;5;66;Rock

3;Music 3;1;5;87;Classic

4;Music 4;3;5;90;Dance

5;Music 5;2;3;34;Rock

5;Music 5;2;3;34;Rock

We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mysample/

hadoop fs -put file1/data/mysample/

Script and verify the result in line

We will implement script for some queries below:

List all user information who listen Rock and Pop

hadoop fs -text /data/mysample/* | grep -E "(Rock|Pop)"

List all user information who listenmusics with view time/ duration less than 0.3

hadoop fs -text /data/mysample/* | awk -F ";" '{if($3 !="" && $4 != "" && ($3/$4 < 0.3)) print $0;}'

select music, music type in the data set

hadoop fs -text /data/mysample/* | cut -f2,6

Count how many duplication records in the data set

hadoop fs -text /data/mysample/* | sort | uniq -c

List first 5 records by natural order in data set

hadoop fs -text /data/mysample/* | head -n 5

List last 5 records by natural order in data set

hadoop fs -text /data/mysample/* | tail -n 5

List all records which price greater than 50

hadoop fs -text /data/mysample/* | awk -F ";" '{if($5 !="" && ($5 > 50)) print $0;}'

Distinct duplication record in data set

hadoop fs -text /data/mysample/* | sort | uniq

To run automation for query above, we can add all into one sh file and run it from the command file or add the sh file to cron job to make everything in automation mode.

vi auto.sh

hadoop fs -text /data/mysample/* | grep -E "(Rock|Pop)"

hadoop fs -text /data/mysample/* | awk -F ";" '{if($3 !="" && $4 != "" && ($3/$4 < 0.3)) print $0;}'

hadoop fs -text /data/mysample/* | cut -f2,6

hadoop fs -text /data/mysample/* | sort | uniq -c

hadoop fs -text /data/mysample/* | head -n 5

hadoop fs -text /data/mysample/* | tail -n 5

hadoop fs -text /data/mysample/* | awk -F ";" '{if($5 !="" && ($5 > 50)) print $0;}'

hadoop fs -text /data/mysample/* | sort | uniq

Hope that you guys can understand how to combine Linux shell command and HDFS command to query the data in HDFS without Hive mapping. We use this approach in case we need to some adhoc query.

All the instances and coding shared by big data hadoop architect experts is for reference purposes only. You can ask your questions (if any) to them and get answers sooner.

About
Latest Posts

Ethan Millar

Technical Writer at Aegis Softtech

Having more than 6+ years of experience to write technical articles especially for Hadoop, Big Data, Java, CRM and Asp.Net at Aegis Softtech.

Latest posts by Ethan Millar (see all)

The Future Of Machine Learning And Data Science - September 26, 2017
Easy Way to Add Help Section in CRM Entity Form - July 28, 2017
Customized Deployment for Spring Boot Applications - June 2, 2017

Web Development Rankings Based on Reviews & Sentiment.

Our rankings are completely independent, transparent, and community driven; they are based on user reviews and client sentiment. These web development companies had to earn their way up and didn't just pay their way up.

View Rankings of Best Web Development Companies