Martin Heller
Contributor

Review: Databricks Lakehouse Platform

reviews
Apr 27, 202215 mins
AnalyticsCloud ComputingData Management

Databricks Lakehouse Platform combines cost-effective data storage with machine learning and data analytics, and it's available on AWS, Azure, and GCP. Could it be an affordable alternative for your data warehouse needs?

dock on lake at sunset

Data lakes and data warehouses used to be completely different animals, but now they seem to be merging. A data lake was a single data repository that held all your data for analysis. The data was stored in its native form, at least initially. A data warehouse was an analytic database, usually relational, created from two or more data sources. The data warehouse was typically used to store historical data, most often using a star schema or at least a large set of indexes to support queries.

Data lakes contained a very large amount of data and usually resided on Apache Hadoop clusters of commodity computers, using HDFS (Hadoop Distributed File System) and open source analytics frameworks. Originally, analytics meant MapReduce, but Apache Spark made a huge improvement in processing speed. It also supported stream processing and machine learning, as well as analyzing historic data. Data lakes didn’t impose a schema on data until it was used—a process known as schema on read.

Data warehouses tended to have less data but it was better curated, with a predetermined schema that was imposed as the data was written (schema on write). Since they were designed primarily for fast analysis, data warehouses used the fastest possible storage, including solid-state disks (SSDs) once they were available, and as much RAM as possible. That made the storage hardware for data warehouses expensive.

Databricks was founded by the people behind Apache Spark, and the company still contributes heavily to the open source Spark project. Databricks has also contributed several other products to open source, including MLflow, Delta Lake, Delta Sharing, Redash, and Koalas.

This review is about Databricks’ current commercial cloud offering, Databricks Lakehouse Platform. Lakehouse, as you might guess, is a portmanteau of data lake and data warehouse. The platform essentially adds fast SQL, a data catalog, and analytics capabilities to a data lake. It has the functionality of a data warehouse without the need for expensive storage.

Direct competitors to Databricks Lakehouse Platform include Dremio, Ahana Presto, and Amazon Athena, in addition to the open source project, Apache Spark. Data warehouses that support external tables, such as Snowflake and Azure Synapse Analytics are indirect competitors. 

The following screenshot is an overview of Databricks Lakehouse Platform. Note that as of this writing, Unity Catalog is not yet generally available.

Databricks Lakehouse Platform. IDG

A top-level view of Databricks Lakehouse Platform.

Getting started with Databricks Lakehouse

Databricks Lakehouse Platform is offered on the three major clouds: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). It is also available on Alibaba. Databricks Lakehouse is supported as a first-party product on Azure, which is where I did my hands-on trial for this review. The platform is not offered on-prem, although you can deploy its open source products wherever you wish.

While you can accomplish most data science, data engineering, and machine learning tasks in Databricks Lakehouse using notebooks, the command-line interface, and the console interface, you can also use various client tools and languages. Databricks notebooks support Python, R, and Scala.

Databricks Lakehouse supports Eclipse, IntelliJ, Jupyter, PyCharm, SBT, sparklyr and RStudio Desktop, SparkR and RStudio Desktop, and Visual Studio Code for external IDEs. It also has connectors and drivers, including the Databricks SQL Connector for Python, pyodbc, the Databricks ODBC driver, and the Databricks JDBC driverCloud interfaces.

Here’s a look at the opening screen for Azure Databricks.

The initial screen for Azure Databricks. IDG

The Get Started column on the left offers introductory actions; the Data Science & Engineering screen on the right points to routine actions.

I went through three tutorials using Azure Databricks. One covered basic data science and engineering using the lakehouse architecture provided by Delta Lake. The other two covered machine learning and SQL, respectively.

Delta Lake

The Delta Lake documentation describes Delta Lake as an open source project that can be used to build a lakehouse architecture on top of existing data lakes such as Amazon S3, Azure Data Lake Storage, Google Cloud Storage, and HDFS. Delta Lake provides ACID transactions, scalable metadata handling, and it unifies streaming and batch data processing.

The Delta Lake documentation has more about these offerings:

  1. ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
  2. Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
  3. Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
  4. Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
  5. Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  6. Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

The Delta file format combines the information needed to support Delta Lake operations with a Parquet file, meaning that it carries over Parquet’s efficient compressed columnar storage. The additional transaction log is written in JSON format in a subdirectory.

Data science and engineering on Databricks Lakehouse

It took me about two hours to step through the data science and engineering notebook supplied for this trial.

The Delta format is an enhancement of the Parquet format, so you can substitute the word “delta” for the word “parquet” in Apache Spark and Spark SQL commands.

The Delta format is an enhancement of the Parquet format. IDG

The Delta format is an enhancement of the Parquet format.

Before you can do much with a Databricks notebook or workspace, you need to create or access a compute cluster and attach it to the notebook you want to execute. In the following screenshot, we see the configuration of a standard cluster for running notebooks.

Configuration of a standard cluster for running notebooks. IDG

A standard cluster configured for running notebooks.

There are three basic ways to upgrade a Parquet table to a Delta table. In a Spark Python notebook, you can use spark.read.format("parquet") to read a Parquet file into a data frame and then write it out using the data frame’s write.format("delta") method.

How to upgrade a Parquet table to a Delta table (1). IDG

Upgrading a Parquet table to a Delta table in a Spark Python notebook.

You can accomplish essentially the same thing using the Spark SQL construction CREATE TABLE USING delta AS SELECT * FROM parquet.''

How to upgrade a Parquet table to a Delta table (2). IDG

You can use a Spark SQL construction to convert a Parquet table to a Delta table.

The simplest SQL statement to convert to Delta format in place is CONVERT TO DELTA.

How to upgrade a Parquet table to a Delta table (3). IDG

You may also use the CONVERT TO DELTA statement.

Delta table stream capabilities take full advantage of Spark streams. Here, we see two streaming writes that are being appended to a table.

Two streaming writes being appended to a table. IDG

Delta files can handle streams as well as static batch data.

The spark.readStream API can read and separate batch data and stream data, as shown by this display of 10-second window histograms.

A display of 10-second window histograms. IDG

The spark.readStream API reads and separates batch data and stream data.

Delta tables support “time travel” version queries, restore from a previous version, schema enforcement, schema evolution, data manipulation language (DML) operations, and transaction history.

A disaplay of historical version queries. IDG

The VERSION AS OF clause turns SELECT statements into historical version queries.

You can roll back to any version of a Delta table using the RESTORE VERSION AS OF <n> SQL command. Here, we are rolling back to before the streams started.

It is easy to roll back to a previous version of a Delta table. IDG

You can roll back to any version of a Delta table; in this case, we are rolling back to before the streams started.

You can’t blithely change the schema of a Delta Lake table. Here, we attempt to add a column, but the Delta engine throws an analysis exception for the schema mismatch.

The screen shows an analysis exception. IDG

The Delta engine throws an analysis exception.

Of course, sometimes you do want to change the schema. Delta Lake supports schema evolution using the mergeSchema option to the new_data.write.format command.

Delta Lake supports schema evolution. IDG

Delta Lake supports schema evolution using the mergeSchema option to the new_data.write.format command.

Parquet and Apache ORC file formats don’t support DML commands, but Delta Lake supports DELETE, UPDATE, and MERGE INTO. These commands are essential for complying with a mandated request to remove a user’s personal information based on the General Data Protection Regulation and California Consumer Privacy Act.

An SQL DELETE command deletes a loan ID. IDG

We can delete a specific loan ID using one line of standard SQL.

ACID transitions and the Delta Lake transaction log also support the time travel and DML features that we’ve just seen.

DESCRIBE HISTORY displays the transaction log in a table. IDG

The DESCRIBE HISTORY command displays the transaction log in a tabular format.

Machine learning on Databricks Lakehouse

Machine learning capabilities differentiate Databricks from some of its competitors:

Databricks Runtime ML clusters include the most popular machine learning libraries, such as TensorFlow, PyTorch, Keras, and XGBoost, and also include libraries required for distributed training such as Horovod. Using Databricks Runtime ML speeds up cluster creation and ensures that the installed library versions are compatible.

In addition, Databricks offers AutoML, Feature Store, pipelines, MLflow, and SHAP (SHapley Additive exPlanations) capabilities.

I went through a hands-on tutorial using Databricks Machine Learning to analyze a telco churn prediction problem. It starts with data exploration, which is standard. In this case the basic descriptive statistics don’t tell us much.

Initial data exploration. IDG

To start the churn analysis, we try basic descriptive statistics and data exploration. They’re not very useful.

The analysis moves on to running AutoML with a limit of five models. We very quickly see results, but there isn’t enough training data included (specifically, not enough columns) for a good model. As you’ll see, in this story there’s more data we can use, and several machine learning techniques for building a good model.

A quick run with AutoML.' IDG

A quick AutoML run gives us a best result using an XGBoost model. The best val F1 score is only 0.649, however. This needs more work.

Next, we try merging the existing training data with additional data from the Databricks Feature Store, which “just happened” to be left over from a previous analysis.

Merging the training data with additional data. IDG

The two tables are joined on the customerID column, the customerID column is excluded from the training set, and the table is saved in Delta format.

Using MLflow and scikit-learn, we fit the expanded training table to a LightGBM model and get an F1 score close to 0.8.

Put the expanded training table into a LightGBM model. IDG

With an F1 score close to 0.8, we’re getting close.

A SHAP plot shows the relative impact of each variable on the model’s prediction. SHAP is a post-fit XAI (explainable AI) technique.

A SHAP plot. IDG

A SHAP plot shows the relative impact of each variable on the model’s prediction.

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. The PCA (principal component analysis) initialization does the dimensionality reduction. Note the obvious clustering patterns.

Using PCA initialization for dimensionality reduction. IDG

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data.

Here, we fine-tune the model by using hyper-optimization to search for the best parameters.

Hyper-optimization searches for the best parameters. IDG

Fine-tuning the model with hyper-optimization.

We run a final experiment using the optimized parameters, and get an F1 score above 0.8.

The final experiment. IDG

Our final experiment yields an F1 score above 0.8—success!

Finally, we test the accuracy score of the registered model. If it is at least 0.8, we can promote it to production in the model registry.

Testing the model's accuracy score.' IDG

Testing the accuracy score of the registered model.

SQL queries on Databricks Lakehouse

One of the features that makes Databricks Lakehouse Platform viable as a substitute for a standard data warehouse is its much improved SQL engine, called Photon. Photon is two to four times faster than Databricks’ previous SQL engine, and was the SQL engine used in Databricks’ stunning TPC-DS V3 benchmark results.

Using Databricks Photon SQL engine with spot instances brought the Price/Performance metric for a 100TB TPC-DS decision support benchmark down to $146, which is roughly 12 times better than the most expensive cloud data warehouse tested, and roughly 2.5 times better than the least expensive cloud data warehouse tested.

Databricks Photon SQL engine. IDG

Using Databricks Photon SQL engine with spot instances.

I went through a tutorial introduction to SQL on Databricks Lakehouse. This tutorial concentrated on building and running SQL queries, as opposed to the Python notebooks used in the first two tutorials.

One of the useful things you can build using Databricks Lakehouse Platform is a visualization based on an SQL query.

A dashboard shows visualizations based on an SQL query. IDG

This dashboard combines half a dozen visualizations and can be shared with others without granting them access to the data.

SQL endpoints are computation resources that run SQL queries. I used the provided starter endpoint.

The starter endpoint. IDG

The provided starter endpoint: a single, small cluster.

The Databricks SQL data explorer allows you to view your databases and tables, including the schema data shown here.

Databricks SQL data explorer. IDG

Viewing databases and tables with the Databricks SQL data explorer.

This next screen shows the sample data for the file we used to view the schema in the previous screen.

Sample data. IDG

Sample data for the tpch.orders Delta file.

The Databricks SQL Editor is closely linked to the Data Explorer. In this next screen, we load and run a SELECT query against six of the tables in the tpch database.

A SELECT query. IDG

A query against tables in the tpch database lists the total national revenue by year and nation.

You can define one or more visualizations for each SQL query.

A stacked bar chart. IDG

This stacked bar chart appears in the visualization dashboard along with other visualizations.

Databricks SQL maintains a query history for you. Note that the SELECT year query we’ve been looking at took almost 30 seconds to run. Most of the other queries were generated by console actions.

A sample query history. IDG

A sample query history.

Clicking on a query in the history brings up details about the query plan and timings.

Query details. IDG

We can access details about the query plan and timing.

From the query details, we can bring up additional information such as a graph view of the query.

A graph view of the query. IDG

We can also bring up a graph view of the query.

A tree view offers additional information.

A tree view of the query. IDG

A tree view of the same query.

The monitoring screen for the SQL endpoint shows its utilization. The number of clusters here is only 1 or 0 because we didn’t allow for more than one small cluster, which is appropriate for a single endpoint user (in this case, me).

Monitoring the SQL endpoint. IDG

The monitoring screen for the SQL endpoint shows its utilization.

We can also define an alert tied to the results of an SQL query. This particular query is set to refresh every hour.

Place an alert on an SQL query. IDG

We’ll receive an alert if the top row of query results for revenue slides under $10 million.

Unity Catalog and Delta Sharing

The Azure Databricks Lakehouse that I reviewed did not include the Unity Catalog nor the Delta Sharing feature. Instead, Databricks gave me a demonstration of a pre-release version of both features on AWS.

Unity Catalog provides fine-grained governance for data and AI on Lakehouse. One result is that Databricks Lakehouse can actually be the single source of truth for your enterprise. Other benefits are access permissions for workloads, access management across clouds, and access controls on tables, files, rows, and columns. You also get visibility into data lineage (shown below) and fine-grained auditing of data consumption.

A data lineage diagram. IDG

A data lineage diagram courtesy of Unity Catalog.

Delta Sharing allows you to securely share live data across clouds, platforms, and on-prem without any data duplication. There’s no more need to email data sets or post them on separate servers. A Delta Sharing server can interact with clients in all the major data tools and programming languages.

Conclusion

Overall, Databricks Lakehouse Platform is almost excellent as a data lake that can replace a data warehouse. It’s not for everyone, however, and it’s not quite feature complete.

To use Databricks Lakehouse effectively through its own interfaces, you need to know SQL and be able to program in at least one of its supported languages: Python, Scala, or R. The Databricks APIs allow other programs to control it, and some of those can serve as easy-to-use front ends, for example Tellius.

On the machine learning side, Databricks now supports most of the prominent machine learning and deep learning frameworks, in addition to its own machine learning libraries. It offers both AutoML and hyperparameter optimization to automate the machine learning process. That said, it’s still best suited for data scientists: A business analyst without a background in statistics and machine learning could easily accept bad models as adequate, or have difficulty deciding how to improve weak models.

The promised Unity Catalog and Delta Sharing, which should improve governance and live data sharing, are not yet generally available, and not completely fleshed out. When both of these are finished and released, Databricks Lakehouse Platform should indeed be able to replace cloud data warehouses, but it will still be a platform best used by programmers and data scientists.

Cost: 14-day free trial (not including cloud infrastructure costs). Pricing is based on compute usage, with spot instance and committed usage discounts. You can choose pay-as-you-go or committed-use plans on AWSAzure, and GCP

Platform:  Host on AWS, Azure, or GCP. Databricks Lakehouse Platform support numerous client IDEs, connectors, drivers, and APIs.

Martin Heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.

More from this author