Why Real Time Analytics is the Next Big Thing

Keet Malin Sugathadasa
Oct 20, 2018
10 min read

In today's world, the gold mine behind every business is its data. Whichever industry you are working in, data is the backbone of every company. May be a decade ago, data didn't seem to matter much due to the lack of resources and needs as of such. But today, many organizations' primary income source has become its own data which can be used to generate hidden patterns and information which are not visible to the naked eye. Day by day, the data gathered by companies is growing so large, that it has caused the data volumes to grow indefinitely. What organization used to do back then was to process all the data at the end of a period to see whether there are anomalies or other fraudulent activities.

Today's industry has become so advanced that, as you read the next 10 words, there is over a million of gigabytes gathered just by user activities as data. But what is more important is, on how we manage and make the maximum use of it. Due to this reason, the storage of data and analysis of it have made a significant impact on organizations' resource scarcity, which makes it difficult to manage all these complexities. With different use cases coming in, realtime data has become the key to many organizations, where that is what triggers the actual patterns and analytics off the data streaming in, every second. This article is the beginning of a series of articles related to the realtime data analytics field. I would like to address the basics of realtime data and why it is important in today's industry. Also, at the end of this article, I would like to mention a few open source tools used for realtime data analytics, that can be used directly for your next data project.

What is Real-time Analytics

Real-time analytics is the analysis of data, as soon as it becomes available. Users should be able to draw conclusions as soon as the data enters the system. The definition of "Real Time", is a level of computer responsiveness that the user senses as sufficiently immediate or that enables the computer to keep up with some external processes [1]. The word real-time gives an impression that the other side of the machine is responding immediately when the user fires an action. Why is it so important to handle everything in real time nowadays? A user's lifestyle gets complicated day by day, that the users tend to lean onto processes that provide a much more faster and convenient solution to their problems. In today's market, even a delay in milliseconds, can lead to a loss of a customer, where response time plays a critical role in the application. This is why it is important to respond to the user as immediately as possible.

Real time systems differ mainly from "Batch Processing" systems, where all the transactions are stored until a specific point in time, where all of them are processed at once. This is where the user receives a response after a certain amount of time.

If real-time systems is about providing an immediate response in real-time, real-time analytics is the same, but with a bit more analyzing placed on the incoming data. As soon as the data gets available, the system runs a set of analytical tools to determine what kind of a response should be given to a user. Almost every system that we see today is struggling inside to provide the fastest response possible, so that the user can get the feeling of a real time system. Let's have a look at some practical examples, where real time analytics play a major role.

Adaptive Authentication: where the user is prompted with multiple authentication steps based on the user's credibility in real-time.
Bank Credit Score: financial institutes decide on the suitability of a customer to receive a financial aid from the bank, based on the user's past transaction history
Fraud Detection: at point of sale machines, frauds need to be identified in real time.
Customer Relationship Management: providing dynamic satisfactory business results to customers interacting with the system.
Dynamic Analysis: testing and evaluation of a program by executing data in real-time.

There are different technologies that support real-time analytics, according to TechTarget. Of course there are many possible ways to implement real-time systems. But given below are some common approaches taken by most of the tools described in this article.

Massively Parallel Processing: this is where the data is separated into parts and processed by different machines, to get the final result, which is coordinated by a coordinator. Eg: Hadoop
In-Memory Processing: this is where the data is kept in a memory and being processed on top of that. This is different from querying data from an actual physical disk, as I/O operations are very costly and time consuming.
Database Triggers: some databases provide functionalities to activate functions as soon as the data is inserted into the database.
Data-warehouse Applications: Data is being stored in the warehouse, after it has been processed. This processing is based on the analytic use cases, where the Data cubes are generated with the relevant requirements.

If data analytics is the "Next Big Thing" in the industry, why is it so hard for applications to adapt to it and provide real time responses to their customers? This is mainly due to the very expensive resources required by the real time processing systems. Whatever the resources we pick, they should be highly available and reliable. They should also be able to handle large amounts of data, up to and including terabytes. Yet they should still return answers to queries within just seconds.

Realtime Analytics vs Batch Analytics

We all know the difference between batch processing and realtime processing (stream processing). Batch processing is where data is being processed after a certain amount of time, where the required data is stored in specific places and gathered when needed. This is similar to calculating the average salary of a person at the end of the month. At the end of each month, you verify the number of hours worked, the number of leaves taken and other relevant details to determine the average monthly salary of an employee. Batch processing gives the results of the data after a certain amount of time.

To understand which is better, we need to also understand the use case and the resources at hand. If the employees do not require to know their average salary every day, there is no need for real time processing of that information, as it will just be redundant information to the user. Also, processing this kind of information in realtime, would mean that the company may have to invest a lot of resources for this matter only. I really don't think a company would invest such a load of money for something that doesn't bring real benefits to the organization. Hence, use cases like calculating average monthly salary of an employee can be calculated using batch processing systems.

But what about fraudulent activities during point of sales (POS) transactions? Can a fraudulent transaction wait until the end of the month to detect it and take necessary actions against it? Well, I don't think so.... Activities like this need to be identified then and there to see whether the transaction needs to be investigated in realtime. Therefore, the comparison between realtime and batch processing is a tough one, due to different demands in different use cases.

In today's businesses we see in the industry, data has become the backbone of their technology and everything has turned into data driven systems. Data is the key to real wealth in most of the businesses where collecting, storing and processing such data is very important and could be costly. Just imagine a security system where CCTV cameras gather video footage from a jewelry store? If we need to identify fraudulent activities happening in realtime, we need to consider analyzing the video streams in real time. But what about storing this information? If we store the raw data in a database, this data will become meaningless as it amasses in a very expensive storage system of our own. Hence, companies prefer to process the data as soon as they are collected, and store the processed data which will significantly reduce the size of the data set that needs to be stored then and there. Also, processing data in realtime requires a lot of compute resources that might increase the costs of the company indefinitely. Even though real time data processing seems tempting, there is always a limitation that could hinder our overall efficiency.

If we look at mostly used tools for each of these types, Hadoop MapReduce is the best framework for processing data in batches. Apache Spark is a common example for processing data in real time. The image given below explains both in a very high-level. A more detailed explanation is given in the sections below.

In summary, the following factors can be considered as factors to adhere to when comparing these two processing types.

Popular Tools for Data Analytics

In this section, let's go through some of the available tools for real time data processing and how each component works, in a very high level perspective.

Apache Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computations. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop is primarily used for processing massive data sets in batches where finally the result is generated using the MapReduce functions. For example, the data will be distributed among a set of distributed commodity hardware that processes and reduces the final value down as required. Big Data studies use map reduce for two different purposes such as data analytics and data reduction for storage purposes.

It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.

Apache Spark

Apache Spark is a unified analytics engine for large scale data processing. This is based on top of Apache Hadoop, where this can be used for processing of massive data volumes. Even though it is considered as a successor of Hadoop, it is more of an alternative which was built to overcome the shortcomings in Hadoop. According to facts, it can process both real-time and batch data whilst outperforming MapReduce by over 100 times. The specialty is that it provides an in memory processing capability, which turns out to be way faster that disk processing as in hadoop. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Apache Kafka

Kafka is used for building real-time data pipelines and streaming apps. Kafka is one of the popular open source tools used for queuing purposes in building realtime applications. Kafka provides a lot of capabilities as given below.

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.

Kafka can be used for multiple analytical purposes. But it isn't enough to just read, write, and store streams of data. The purpose is to enable real-time processing of streams. It is possible to do simple processing directly using the producer and consumer APIs. However, for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together. Kafka provides zero data loss and zero downtime, by running as a cluster on one or more servers that can span multiple data-centers.

Apache Zeppelin

Apache Zeppelin is a new and incubating multi-purposed web-based notebook that provides the following features.

Data Ingestion
Data Discovery
Data Analytics
Data Visualization and Collaboration

It provides an interactive browser-based application for data engineers to analyze and view data patterns related to the system. This avoids the hassle in coding redundant code lines and using console to run various types of analytics on the data. The Spark interpreter can be configured with properties provided by Zeppelin. Similarly, many analytical tools can be integrated with Zeppelin for additional features on visualization of data.

Elastic Search

Elasticsearch is based on Lucene, which is a free and open source information retrieval library written in Java. This is a distributed RESTful search engine, that allows users to search large volumes of data in seconds. In a nut shell, elasticsearch stores realtime streaming data in a manner that allows users to query and retrieve data in seconds. Generally, search engines require prior processing time to allow users to query data. But in elasticsearch, even realtime data can be directly included into the search database, to allow users to even query those data, along with the old data.

This is an analytics engine, designed for horizontal scalability, reliability, and easy management. It combines the speed of search with the power of analytics via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data. Many companies we see today use elasticsearch which is changing the way their entire business works, to be much more sophisticated and efficient like never before.

Apache Storm

Apache Storm is also an open source distributed realtime computation engine, which supports the processing of unbounded realtime streams of data. The power of storm goes way beyond just online realtime stream processing, where it also provides online machine learning, continuous computation, distributed remote procedure calls, ETLs and much more. Storm integrates with the queuing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, re-partitioning the streams between each stage of the computation however needed.

This stream processor was originally written in Clojure programming language, which was open sourced after being acquired by Twitter.

Project R

Even though R is a statistical platform for analytics related to the mathematical field, R can be definitely used for realtime applications. The key is to define the tech stack in a way that it supports realtime data streams as its processing data ingestion. This is also an open source tool, where currently it has a very active community of developers contributing to it very frequently. The Project R is widely used by data miners, analysts, and statisticians who want to arrive at meaningful conclusions from large data volumes.

Apache Samza

Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing. This uses Apache Kafka for messaging and Apache Hadoop YARN, to provide a fault tolerant and secure stream processor for the end user. Samza’s highlight is that it uses a simple callback-based “process message” API similar to MapReduce for message queueing. It comes out of the box with YARN and Apache Kafka and has a pluggable API that helps integrate the platform with other messaging platforms.