Overhauling Apache Kylin for the cloud

Apache Kylin was built to query massive relational tables with sub-second response times. A new, fully distributed query engine in Kylin 4 steps up performance of both cubing and queries.

Credit: Thinkstock

Recently, the Apache Kylin community released a major update with the general availability of Kylin 4. Kylin 4 continues the mission to provide a unified, high-performance, cloud-friendly, open source OLAP (online analytical processing) platform. Kylin 4 upgrades the Kylin architecture to make it easy to deploy and scale in the cloud. The new release features three major platform updates and myriad other improvements.

First, Kylin 4 replaces its previous HBase storage engine with Apache Parquet, making it possible to decouple compute and storage for unlimited independent scalability. Second, Kylin 4 unifies the compute engine and removes any previous dependencies on the Hadoop ecosystem. This makes resource allocation much more flexible, resulting in a significant reduction in total cloud resource usage and associated costs. Third, by introducing a brand new, fully distributed query engine, Kylin 4 makes cubing duration and query latency much more performant compared to previous releases.

In this article, we will dive into the details of these new innovations and the new capabilities they enable.

What is Apache Kylin?

Apache Kylin is an open source distributed analysis engine that provides SQL query interfaces above Hadoop and Spark, along with OLAP capabilities to support extremely large data sets. It was initially developed at eBay and contributed to the Apache Software Foundation. Kylin can query massive relational tables with sub-second response times.

Kylin’s core idea is the precomputation of result sets, meaning it calculates all possible query results in advance according to the specified dimensions and measures. Kylin basically exchanges space for time to speed up OLAP queries with fixed query patterns.

Apache Kylin lets you query billions of rows at sub-second latency in three steps：

Identify a star or snowflake schema on Hadoop/Spark.
Build a cube from the identified tables.
Query using ANSI-SQL and get results via ODBC, JDBC, or RESTful API.

How Kylin works

Concepts

Kylin’s concepts of a cube and a cuboid can be understood from the following figure:

Each combination of dimensions is called a cuboid and the set of all cuboids is a cube. The cuboid composed of all dimensions is called the base cuboid. All cuboids can be calculated from the base cuboid. A cuboid can be understood as a wide table after precomputation. During the query, Kylin will automatically select the most suitable cuboid that meets the query requirements.

Basic query process

The above figure is a scenario without precomputation, which requires on-site calculation. Agg and Join will involve a shuffle, so the performance will be poor and more resources will be occupied with large amounts of data, which will affect the concurrency of queries.

After the precomputation, the previously most time-consuming two-step operation (Agg/Join) disappeared from the rewritten execution plan, showing a cuboid precise match. Additionally, when defining the cube we can choose to order by column so the Sort operation does not need to be calculated. The whole calculation is a single stage without the expense of a shuffle. The calculation can be completed with only a few tasks therefore improving the concurrency of the query.

Cloud-friendly architecture

New storage engine

When Apache Kylin was born, it relied on Hadoop. In Kylin 3.x and before, Kylin used HBase as a storage engine to save the precomputing results generated after cube builds; supported MapReduce, Spark, and Flink as the build engine; and used the query engine based on Apache Calcite.

Time in production use and continued development have gradually exposed a variety of problems with this architecture, such as the high maintenance cost of HBase and the performance limitations of the Calcite query engine, which is difficult to expand horizontally. And while HBase, as the database of HDFS, has been excellent in terms of query performance, it still has the following disadvantages:

HBase is not real columnar storage.
HBase has no secondary index; Rowkey is the only index.
HBase has no encoding; Kylin has to do the encoding by itself.
HBase does not fit for cloud deployment and auto-scaling.
HBase has different API versions and compatibility issues between them (e.g, 0.98, 1.0, 1.1, 2.0).
HBase has different vendor releases and compatibility issues between them (e.g, Cloudera’s is not compatible with others).

Facing the above problems, the Apache Kylin community proposed to replace HBase with Apache Parquet and Apache Spark, for the following reasons:

Parquet is a mature and stable open source column storage format.
Parquet is more cloud-friendly, able to work with most cloud file systems (HDFS, Amazon S3, Azure Blob Storage, Alibaba Cloud Object Storage Service, etc.).
Parquet can be tightly integrated with Hadoop, Hive, Spark, Impala, etc.
Parquet supports custom indexes.

New Spark build engine

In Kylin 4, the Spark engine is the only build engine. Compared with the build engine in previous versions, the Spark engine has the following characteristics:

Kylin 4 simplifies many build steps. For example, Kylin 4 only needs two steps to build a cube: resource detection and cubing.
Because Parquet encodes the stored data, an encoding process for dimension dictionaries and dimension columns is no longer needed in Kylin 4.
Kylin 4 implements a new global dictionary. For more details, please refer to this Kylin Wiki article.
Kylin 4 will automatically adjust the parameters of Spark according to available cluster resources and the build job.
Kylin 4 improves build performance.

New distributed query engine

Sparder, the new query engine of Kylin 4, is a distributed query engine implemented by the Spark back end. Compared with the original query engine, Sparder has the following advantages:

Distributed query engine eliminates a single point of failure.
Unified computation framework (Spark) for building and querying.
Substantial increase in performance of complex queries.
Can benefit from new features in Spark and the Spark ecosystem.

Kylin 4 in the cloud

Cloud computing has many compelling features (unlimited storage capacity, easy maintenance, paying for what you use) that are drawing more enterprises into the public cloud. We see many companies benefiting from moving their on-premises infrastructure to cloud, achieving goals of lower TCO (total cost of ownership), greater scalability and reliability, and stronger data protection.

On the engineering side, cloud computing also brings changes to the way enterprises design and deploy their software. Modular software architecture makes applications user-friendly and flexible to develop and use.

Kylin 3 relies on Hadoop. Before deploying a Kylin 3 instance, users must prepare a Hadoop cluster including heavy services such as HDFS and HBase. Kylin 3 users must acquire a lot of knowledge about how to maintain and optimize these Hadoop components. Because Kylin 3 has a complex architecture, and suffers reliability and scalability problems, it is not generally suitable for cloud deployment.

All of this changes with Kylin 4. Kylin 4 removes Kylin’s dependency on Hadoop components such as Yarn and HBase. The “Kylin plus Spark plus object storage” architecture has less complexity, making deployment in the cloud easier and more manageable. In this new architecture, Parquet replaces HBase and Spark replaces Yarn and MapReduce.

This figure shows how Kylin 4 could be deployed on a public cloud. First, the new architecture is lightweight, and the required components are fewer than before. Deployment is easier and faster, and most components are stateless; by contrast, HDFS and HBase are stateful services. Statelessness means we can delete these resources when we do not need them. Second, scaling is much easier than before, done simply by adding or deleting these components to your Spark cluster.

Kylin 4 performance on AWS

Preparation

In order to help readers understand the performance differences between Kylin 3 and Kylin 4, we have provided a performance benchmark report in a standard software and hardware environment. Amazon EMR was chosen as our benchmark platform.

Additionally, we chose TPC-H and SSB as our benchmark standards. The scale factor used in this test is 10 (meaning fact table has 60 million rows).

The following table shows the aspects compared between different versions in this benchmark report.

Metrics/Aspect	Description
Cubing duration	Duration of precalculation (cube building) process (load source table into Kylin)
Cube size	Disk space occupied by cube/index
Response time	Serial query test lasting 15 minutes, taking the 95th percentile of the overall response time as the result.

The following table shows information about software and hardware used in this performance benchmark.

Item	Value
Instance type	m5.4xlarge
Node memory	64 GB
Node vCPU	16
Node disk	400 * 2; SSD
Network brandwith	Up to 10 Gbps
Node count	A master node and four worker nodes
Allocated memory on Yarn	202 GB
Allocated cores on Yarn	52
Kylin version	3.1.2 & 4.0.0
EMR version	5.31
Hadoop version	2.10.0
HBase version	1.4.13

Benchmark results

apache kylin 07 — Cubing duration of TPC-H (sf = 10)

apache kylin 08 — Storage size of TPC-H (sf = 10)

apache kylin 09 — Average response time of SSB query (sf=10)

apache kylin 10 — Average response time of TPC-H query (sf=10)

Conclusions

Cubing duration and cube size

Compared with Kylin 3’s MapReduce cube engine, thanks to higher resource utilization and no more converting cuboids to a specific data format (HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.

In Kylin 3, the cuboid files are stored in two different formats. Instead Kylin 4 uses Parquet. We know Parquet has better encoding efficiency and a higher compression ratio, so the disk space consumed of the same cubes was reduced greatly by 72.56%.

apache kylin 11 — Kylin 3 (MapReduce engine) has lower resource utilization.

apache kylin 12 — Kylin 4 (new Spark engine) has higher and more stable resource utilization.

Query performance

In big query scenarios (queries that scan and do on-site complex calculations on large numbers of partitions/files) Kylin 3 query optimization is difficult, requiring repeated optimization of HBase RS servers and Kylin query servers. In stress test scenarios, query nodes become unstable because they need to do post-calculation on large data sets, and performance (query latency) degrades over time. Kylin 4 removes the single bottleneck of the Kylin query server, significantly improving both response time and QPS. Further, performance is stable during the stress test. In the TPC-H query set, response time of Kylin 4 is improved by 5x to 7x, and its concurrency is improved by 4x.

apache kylin 13 — P95 response time of TPC-H query under different concurrency levels.

In point query scenarios (queries that scan small numbers of partitions/files and do not do many on-site calculations) Kylin 4 can meet the sub-second query latency requirement after some simple parameter adjustments, and its performance is relatively close to Kylin 3 (to be specific, only slightly worse).

Cost of learning and difficulty of performance optimization (parameter adjustment)

Kylin 3 has many build steps including steps that depend on different components, such as Hive, MapReduce, and HBase. Operating Kylin 3 requires learning and understanding many architectures and technical details, and being familiar with many parameters related to these components.

Kylin 4 removes all of these limitations. Cubing and queries in Kylin 4 are uniformly switched to the popular Spark engine, and new users only need to master Spark to learn and adjust parameters. These learning materials for Spark can be easily found, and the commonly used parameters are far fewer than in Kylin 3.

Xiaoxiang Yu is a software engineer at Kyligence and member of Apache Kylin PMC (Project Management Committee). He became an active Kylin project maintainer in 2018 and has been the release manager of several of the most recent versions. Yaqian Zhang is a software engineer at Kyligence, committer and maintainer of Apache Kylin.

—

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Topics

About

Policies

Our Network

More

Overhauling Apache Kylin for the cloud

Apache Kylin was built to query massive relational tables with sub-second response times. A new, fully distributed query engine in Kylin 4 steps up performance of both cubing and queries.

What is Apache Kylin?