Anirban Ghoshal
Senior Writer

Snowflake adopts open source strategy to grab data catalog mind share

news
Jun 03, 20245 mins
Data ManagementData Warehousing

With its plan to make its Polaris data catalog open source, Snowflake hopes the new offering will be seen as vendor-neutral, boosting its attractiveness when compared to Databricks’ Unity Catalog.

Snowflake says it will open up the source code to its new Polaris Catalog, a strategy that suggests it wants to lure data catalog users away from rival Databricks’ Unity Catalog while bolstering the attractiveness of its own offering, analysts said.

“The move to launch Polaris Catalog provides a competitive response to Databricks’s Unity Catalog, thereby enhancing Snowflake’s value proposition, attracting a broader range of customers, and fostering a vibrant community around the new data catalog,” said Jayesh Chaurasia, analyst at research and advisory services firm Forrester.

How Polaris Catalog is different from Databricks’ Unity Catalog

Databricks’ Unity Catalog, which was made generally available in June 2022 and was later updated with Okera’s capabilities the following year, is a closed-sourced unified governance offering that provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.

Polaris Catalog, released during Snowflake’s annual conference this week, offers similar capabilities to Unity Catalog, but is built atop the popular open source Apache Iceberg data table format.

“With Polaris Catalog, users now gain a single, centralized place for any engine to find and access an organization’s Iceberg tables with consistent security and full, open interoperability,” Snowflake said in a statement, adding that Polaris Catalog relies on Iceberg’s open source REST protocol, which provides an open standard for users to access and retrieve data from any engine that supports the Iceberg Rest API, including Apache Flink, Apache Spark, Dremio, Python, and Trino among others.

The complexity and diversity of data systems, coupled with the universal desire of organizations to leverage AI, necessitates the use of an interoperable data catalog, which is likely to be open source in nature, according to Chaurasia.

“An open-source data catalog addresses interoperability and other needs, such as scalability, especially if it is built on top of a popular table format as Iceberg. This approach facilitates data management across various platforms and cloud environments,” Chaurasia said.

Separately, market research firm IDC’s research vice president Stewart Bond pointed out that Polaris Catalog may have leveraged Apache Iceberg’s native Iceberg Catalogs and added enterprise-grade capabilities to it, such as managing multiple distributed instances of Iceberg repositories, providing data lineage, search capability for data utilities, and data description capabilities among others.

Polaris Catalog, which Snowflake expects to open source in the next 90 days, can be either be hosted in its proprietary AI Data Cloud or can be self-hosted in an enterprise’s own infrastructure using containers such as Docker or Kubernetes.

 “Since Polaris Catalog’s backend implementation will be open source, organizations can freely swap the hosting infrastructure while retaining all security controls and eliminating vendor lock-in,” the company said, adding that Polaris Catalog inside Snowflake’s AI Data Cloud is currently in public preview.

Is Polaris Snowflake’s ticket to garnering community goodwill?

While experts such as Forrester’s Chaurasia and dbInsight’s Tony Baer think that Polaris Catalog is an extended strategy for the company to broaden its reach to acquire new customers, The Futurum Group’s research vice president Steven Dickens thinks it is a “desperate” attempt to garner “goodwill” from customers and the open source community.

The soon-to-be-open-sourced data catalog, according to Dickens, is a direct consequence of Snowflake’s shortcomings and limitations, including poor interoperability, vendor lock-in, exorbitant costs, lack of innovation, and dependency on partnerships.

“Snowflake is notoriously expensive, and its cost structure has driven many customers to seek alternatives. Polaris can be seen as a last-ditch effort to retain customers by offering a potentially cheaper, open-source alternative,” Dickens said.

Further, Dickens sees Snowflake’s move to open-source Polaris Catalog as a way to counter its “slower, insular development pace”.

“Polaris is an attempt to leverage external innovation to compensate for Snowflake’s internal stagnation,” Dickens explained.

Polaris Catalog has open source rivals

Chaurasia and Dickens also pointed out that Polaris Catalog isn’t the only open source data catalog available in the market.

“There are several other open-source projects in the data cataloguing and metadata management space, including Apache Atlas, Amundsen, and LinkedIn’s DataHub. Each provides capabilities for data discovery, governance, and metadata management,” Chaurasia said.

While Apache Atlas is designed for governance and compliance within Apache Hadoop environments, offering scalable metadata management, lineage, and governance capabilities for Hadoop and associated big data technologies, Amundsen, originating from Lyft aims to enhance the productivity of data analysts, scientists, and engineers by indexing data resources (metadata) and facilitating the discovery and exploration of datasets based on usage and relevance.

Another alternative is LinkedIn’s DataHub, which provides real-time metadata architecture that supports various data systems and environments through pluggable integration.

“It focuses on metadata ingestion, indexing, data discovery, and governance,” Chaurasia said, adding that Amundsen and DataHub have become popular due to their emphasis on user experience, support for multiple integrations (both real-time and batch), and data discovery capabilities in the wake of demand for efficient data management offerings.