by Ashish Thusoo

How Qubole addresses Apache Spark challenges

analysis
Aug 28, 20199 mins
AnalyticsApache SparkCloud Computing

The Qubole Data Platform brings streamlined configuration, auto-scaling, cost management, and performance optimizations to Spark-as-a-service

Traditional relational databases have been highly effective at handling large sets of structured data. That’s because structured data conforms nicely to a fixed schema model of neat columns and rows that can be manipulated using SQL commands to establish relationships and obtain results. Then big data came along.

Big data required a new way to store, manage, and query the massive sets of messy, unstructured data that are often involved. Traditional data processing tools have failed to meet the performance and reliability requirements of big data for machine learning and advanced analytics applications. Organizations needed a way to build reliable pipelines that could handle these vast, complex workloads.

This led to the emergence of distributed data processing engines such as Apache Spark that split the data into smaller, manageable chunks and process it across multiple compute nodes. Distributed engines greatly improve processing times and enable a wide spectrum of use cases in machine learning and big data analytics, which in turn allow for more experimentation and innovation.

What is Apache Spark?

Apache Spark is a high-performance, distributed data processing engine that has become a widely adopted framework for machine learning, stream processing, batch processing, ETL, complex analytics, and other big data projects. The open source engine stands out for its ability to process large volumes of data significantly faster than MapReduce, because data is persisted in-memory on Spark’s own processing framework.

First developed at the AMPLab at UC Berkeley, Spark now has a vibrant community of more than 1,900 contributors. These developers prioritize agility, flexibility, and scalability through the 300 to 400 code commits deployed each month. And data professionals confirm the framework’s rising popularity: Spark is the leading big data framework in use and continues to experience growth, with a usage up 29 percent since 2017.

Why use Apache Spark?

Created to address the speed limitations of traditional distributed processing engines, Spark’s in-memory data engine provides a faster and more fault-tolerant way to run workloads. With Spark, you can concurrently process large amounts of data across clusters or servers without incurring speed issues or job losses. Due to the nature of in-memory processing, organizations can rapidly process large workloads for machine learning, streaming data, ETL, and batch and graph processing.

Unlike Hadoop MapReduce, Spark relies on its own parallel data processing framework. This framework places data in Resilient Distributed Datasets (RDDs), a distributed memory abstraction that performs calculations on large Spark clusters in a fault-tolerant manner. Because data is persisted in-memory (and on disk if required), Spark can be significantly faster and more flexible than MapReduce for certain applications. 

Data professionals also like Spark because the framework works for multiple use cases and supports multiple languages. Data scientists, data engineers, and data analysts can leverage Spark in conjunction with their preferred coding language—whether R, SQL, Scala, Python, or Java—or alongside their BI or analytics tool of choice.

Common uses of Apache Spark

Spark provides a simple and expressive programming model that supports a wide range of machine learning and data processing applications. Spark’s in-memory data model and fast processing makes it particularly suitable for applications such as batch processing, stream processing, machine learning and graph processing, ad hoc analysis, and interactive queries against in-memory data.

Batch processing 

Batch processing takes a large data set as input all at once, processes it, and writes a large output. While MapReduce handles batch processing, Spark can process batch jobs even faster. Spark offers an in-memory computational engine that allows the creation of programs that can read data, build a pipeline, and export the results, and it may be the better choice if speed of processing is critical. By reducing the number of writes and reads to and from disk, Spark can execute batch processing jobs 10 to 100 times faster than the Hadoop MapReduce engine.

Stream processing

With so much data being processed daily, companies need the ability to stream and analyze all that information in real time. Spark Streaming can handle this extra workload, and some believe Spark will become the go-to platform for stream computing applications no matter the type. That’s because Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework for all of their processing needs. Businesses today use Spark in this way for streaming ETL, data enrichment, trigger event detection, and complex session analysis.

Machine learning

Spark comes with an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data—which essentially amounts to processing machine learning algorithms. Among the components in this framework is Spark’s scalable machine learning library, MLlib. MLlib can work in areas such as classification, clustering, and dimensionality reduction, among many others. All of this enables Spark to be used for common big data functions like predictive intelligence, customer segmentation, and sentiment analysis. 

Graph processing

Unlike traditional relational databases, where data processing requires numerous temporary data sets, Spark separates the storage of SQL from the database object store, allowing temporary data sets to be included as variables. GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system. A library of common graph algorithms and operators allows users to manipulate graphs and perform graph-parallel computations. 

Ad hoc analysis

Among Spark’s most notable features is its capacity for interactive analytics. MapReduce was built to handle batch processing, and SQL engines such as Apache Pig are frequently too slow for interactive analysis. But Apache Spark is fast enough to perform exploratory queries without sampling. By combining Spark with interactive analysis tools such as notebooks, complex data sets can be processed and visualized interactively. With Spark, business analysts and data scientists can quickly explore data sets to answer specific questions and drill deeper into results.  

Interactive programming

Spark also works well for data sets that require a programmatic approach, because it interfaces with a number of development languages including SQL, R, Scala, and Python. For example, a hierarchical tree needs to be traversed just to read the data, as is the case with the 835 and 837 file formats widely used in healthcare insurance processing. Spark also offers facilities (Python, R, Scala) that can be used to quickly understand the nature and statistical distribution of the data.

Apache Spark challenges

Some of the same elements that make Spark attractive—flexibility, speed, and customization—also create challenges. Here are a few of the most common challenges that Spark users face, and how we address them with Qubole. Qubole is a cloud native platform for big data analytics and machine learning that integrates with the major public cloud providers and supports all of the most popular data processing engines, including Spark and Presto. 

Configuration

One of the pitfalls of Spark is that it can be highly challenging to get up and running. A variety of different configurations are needed to run different workloads at scale, and Spark becomes unstable if they’re not set up properly. All of these different modifications mean Spark is not intuitive for end users.

Qubole developed its own version of Spark that is part of the cluster management framework at the heart of the platform. It streamlines configuration into a few guided steps, with the option to use default settings or highly customized parameters depending on use case and level of expertise.

Performance

Your success with Spark will depend at least partly on performance, or how quickly you can read, write, and process data. As a stand-alone open source solution, Spark provides a general-purpose engine that can be configured and optimized depending on the use case. However, customizing Spark is daunting and time-consuming. As a result, users often face performance and usability hiccups that impact the value Spark can deliver. 

Qubole’s version of Spark has been optimized for the cloud by improving I/O, such as direct writes and batch listing with cloud-based object storage. It also carries specific engine-level performance enhancements such as join optimizations and dynamic filtering, which handle skew joins more efficiently, along with various other query optimizations like Python UDF (user defined functions) pushdown. Qubole also integrates with a distributed caching layer known as Rubix, which the company released as open source, to further enhance performance when running against data in a cloud object store.

Cost and resources

Cost can be a critical obstacle to Spark adoption, both in terms of infrastructure setup and resource utilization. Because Spark is memory intensive, it may not be the most cost-effective engine for all use cases, and it can strain your infrastructure and resources. Often, data teams spend significant time on routine maintenance requests and servicing internal customers. Poor resource utilization can impede or delay success, because Spark—like any distributed data processing engine—is only as useful as your ability to optimize related resources and costs. 

Qubole takes management and operation costs into consideration by automatically managing the lifecycle of clusters, including bringing up new clusters, scaling them up and down to match workloads, and terminating clusters at idle time. Qubole also developed an open source tool called Sparklens that provides insight into compute resource usage and helps you to optimize it. And Qubole manages Amazon EC2 spot instances automatically, so customers can take advantage of these lower-priced cloud resources without risking job failures.

Apache Spark in the cloud

Spark in the cloud offers a multitude of benefits for organizations focused on big data, and includes many configuration parameters that allow you to optimize the application for your specific needs. However, there are a few caveats to keep in mind. Data processed by Spark is subject to change, and a tightly optimized Spark application may fail when data is floating. Thus, you should consider the end-to-end process and be cautious when optimizing a Spark application with many stages that process numerous tasks but take less than a minute or two to process. That’s why it’s helpful to have an adaptive, intelligent cloud platform that dynamically optimizes to meet your business needs.

Ashish Thusoo is co-founder and chief executive officer of Qubole. Before co-founding Qubole, Ashish ran Facebook’s data infrastructure team. Under his leadership, the Facebook data infrastructure team built one of the largest data processing and analytics platforms in the world and created a host of tools, technologies, and templates that are used throughout the industry today.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.