by Nikita Ivanov

How to do real-time analytics across historical and live data

analysis
Aug 14, 20195 mins
AnalyticsApache SparkData Management

5 in-memory computing platform capabilities that support analytical processing of both data lake data and operational streams

analytics statistics stats big data
Credit: Thinkstock

Today’s analytical requirements are putting unprecedented pressures on existing data infrastructures. Performing real-time analytics across operational and stored data is typically critical to success but always challenging to implement.

Consider an airline that wants to collect and analyze a continuous stream of data from its jet engines to enable predictive maintenance and faster time to issue resolution. Each engine has hundreds of sensors that monitor conditions such as temperature, speed, and vibration, and continuously send this information to an internet of things (IoT) platform. After the IoT platform ingests, processes, and analyzes the data, it is stored in a data lake (also known as an operational data store), with only the most recent data retained in the operational database.

Now, whenever an anomalous reading in the live data triggers an alert for a particular engine, the airline needs to run real-time analytics across the live operational data and the stored historical data for that engine. However, the airline may find that accomplishing real-time analytics using its current infrastructure is impossible. 

Today, companies developing big data initiatives typically use Hadoop to store a copy of their operational data in a data lake, where data scientists can access the data for various analyses. When the use case requires running real-time analytics across the incoming operational data as well as a subset of the data stored in the data lake, the traditional infrastructure becomes a stumbling block. There are inherent delays in accessing data stored in a data lake as well as challenges running federated queries across the combined data lake and operational data.

In-memory computing solutions solve the challenge of real-time analytics across data lake and operational data by providing real-time performance, massive scalability, and built-in integrations with popular data platforms. These capabilities enable hybrid transactional/analytical processing (HTAP) with the ability to run real-time federated queries across data lake and operational data sets.

In-memory computing platform capabilities

In-memory computing platforms enable ingestion, processing, and analytics of operational data, along with real-time performance and petabyte scalability for some or all of the following:

  • In-memory data grids and in-memory databases. In-memory data grids and in-memory databases pool the available memory and compute of a cluster of servers, allowing data to be processed in memory and eliminating delays waiting for data to be retrieved from disk prior to processing. In-memory data grids are deployed atop existing databases and keep the underlying database in sync, while in-memory databases maintain the full data set in memory, writing data to disk periodically only for backup and recovery purposes. In-memory data grids and in-memory databases can be deployed on premises, in a public or private cloud, or in a hybrid environment.
  • Streaming data processing. In-memory computing platforms can ingest, process, and analyze high-volume data streams with real-time performance from popular streaming platforms such as Apache Kafka.
  • Machine learning and deep learning. In-memory computing platforms incorporating machine learning libraries allow real-time training of machine learning models in place using the operational data. In-memory computing platforms with native integrations with deep learning platforms such as TensorFlow can dramatically decrease the cost and complexity of preparing and transferring data to these deep learning training platforms.
  • Federated queries. Some in-memory computing platforms leverage built-in integrations with popular streaming data platforms, including Apache Kafka and Apache Spark, to enable federated queries across data lake and operational data sets. Apache Kafka is used for building the real-time data pipelines and streaming apps that provide data for real-time processing of incoming data. Apache Spark is a unified analytics engine that can perform large-scale data processing, including running federated queries across data from a Hadoop-based data lake and an operational database.
  • Hybrid transactional/analytical processing (HTAP) or hybrid operational/analytical processing (HOAP). HTAP, HOAP, or translytics enables companies to maintain a single data set on which they can perform simultaneous transactional and analytical processing, eliminating the costly and slow extract, transform, load (ETL) process required to move data from a dedicated transactional database to a separate, dedicated analytical database.

From Apache Kafka to Apache Spark to real-time insights

An in-memory computing platform that integrates with Kafka, Spark, and Hadoop makes it possible for the airline in our example to run real-time analytics across its live operational data and the historical data for a particular engine. Apache Kafka feeds the live streaming data to the in-memory computing platform. The in-memory computing platform maintains operational and warm data in memory and runs real-time queries across those data sets. Spark retrieves the historical data from the data lake and warm operational data from the in-memory computing platform to provide deeper insights by running queries across the data silos. With this architecture, the airline gains instant insight into the causes of an anomalous reading.

Some of the most sought-after benefits of modern data infrastructure include predictive maintenance and faster time to issue resolution, which lead to increased customer satisfaction, improved asset utilization, and higher ROI. Using in-memory computing platforms to run real-time analytics on operational data and a subset of data lake data can make these and other new real-time IoT services a reality.

Nikita Ivanov is co-founder and CTO of GridGain Systems, where he has led the development of advanced and distributed in-memory data processing technologies. He has more than 20 years of experience in software application development, building HPC and middleware platforms, and contributing to the efforts of companies including Adaptec, Visa, and BEA Systems.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.