Published March 24th, 2017 by joseph-macwan

Apache Spark: Excellent Data Processing, but Needs a Few Improvements

Apache Spark is a data processing toolkit that allows you to work with large data sets without having to consider the underlying infrastructure. It provides the necessary abstraction to process huge chunks of data in real time or offline mode.

Although there are other options available when you are looking for a suitable data processing framework (say Apache Samza or Apache Storm), Apache Spark is preferred because of its accelerated speed in data analysis, which is achieved through an improved implementation of MapReduce by keeping data in memory rather than on disk.

This article takes an objective look at the best and the can-be-improved features of Apache Spark.

The best implementation scenarios of Apache Spark

Powerful real-time data processing

Spark is extremely efficient in processing massive amount of incoming real-time data. By default, it supports HDFS, Kafka, Flume, Twitter and ZeroMQ. However, even custom data sources can be processed.

Intelligent machine learning

MLib is a built-in machine learning library in Apache Spark, which enables applying machine learning algorithms to a large dataset. These algorithms are particularly useful in implementing data clustering techniques on offline datasets. When combined with Spark Streaming, the MLib can also provide real-time data analysis.

Compatibility with IoT (Internet of Things)

IoT devices are constantly logging data, which is stored in the backend systems for further processing. Apache Spark helps build data pipelines, apply algorithms at given intervals (hourly, weekly, monthly and even per minute, if required) and generate trigger actions depending upon a configurable set of events.

Deciphering social media trends

Apache Spark enables you to stay on top of the latest trends by analyzing real-time data from social network streams. Trends are determined through the processing of updates based on particular events. This allows discovering trends in a specific time window (hourly, daily, monthly and so on).

Challenges that need to be overcome for a successful Apache Spark implementation

Deployment issues

Apache Spark can be deployed in various modes, the most basic deployment approach being the standalone mode. It supports Mesos and Yarn, without the knowledge of which you cannot deploy the toolkit correctly. Bundling dependencies is a real headache if you do not know either of them. If it is not deployed correctly, Apache Spark may work fine in standalone mode but throw exceptions in cluster mode.

Memory configuration

Apache Spark is equipped to deal with huge amount of incoming data. This makes it essential to monitor and allot memory usage as per the flow of incoming data. There are numerous configuration settings in MLib that need to be set to according to the inflow of data. If this is not done properly, configuration limits will be exceeded and you will hit a memory exception.

Frequent releases affecting APIs

The release cycle of Apache Spark for 1.x.x version is 3 months while 3-4 months for 2.x.x. This is frequent even in terms of the latest software development cycle trends. Although this means that developers can push changes and patches regularly, it creates havoc with the APIs. In some cases, they get modified even if changes to the APIs are not required at that time.

Limitations in a Python based implementation

Apache Spark supports Scala, Java and Python. While the support for the first two is fine, there are a few glitches in Python based implementation. The latest versions of Apache Spark come to the Scala and Java implementations first while the Python library takes some time to catch up.

Lack of detailed (official) documentation

A comprehensive documentation is a must to allow new developers easy entry into the Apache Spark based implementations. Although there is a host of documentation available through other data sources like developer communities, the official documentation lacks in providing code examples, walkthroughs and demos on the more complex aspects of Apache Spark implementations.

About
Latest Posts

joseph-macwan

Technical writer at Aegis softwares

Joseph Macwan technical writer with a keen interest in business, technology and marketing topics. I am writing on various topics Java, Big Data and Asp.Net with solution and code.

Latest posts by joseph-macwan (see all)

How to Develop Cloud Based Integration Logic Apps - October 5, 2017
How to Create and Deploy ASP.NET Core 2.0 Web Application with ASP.NET Razor Pages - September 12, 2017
Tips for a Successful Dynamics 365 Implementation Consultancy - July 18, 2017

Technology Rankings Based on Reviews & Sentiment.

Our rankings are completely independent, transparent, and community driven; they are based on user reviews and client sentiment. These technology companies had to earn their way up and didn't just pay their way up.

View Rankings of Best Technology Companies