by Benoit Dageville

Striking a balance with ‘open’ at Snowflake

feature
May 20, 202113 mins
AnalyticsCloud ComputingDatabases

At Snowflake, we fully embrace the value of open standards and open source. But we strive to avoid misguided applications of open that create costly complexity instead of low-cost ease of use.

Neon Open sign
Credit: Thinkstock

The relative merits of “open” have been hotly debated in our industry for years. There is a sense in some quarters that being open is beneficial by default, but this view does not always fully consider the objectives being served. What matters most to the vast majority of organizations are security, performance, costs, simplicity, and innovation. Open should always be employed in service of those goals, not as the goal in itself.

When we develop products at Snowflake, we evaluate where open standards, open formats, and open source can create the best outcome for our customers. We believe strongly in the positive impact of open and we are grateful for the open source community’s efforts, which have propelled the big data revolution and much more. But open is not the answer in every instance, and by sharing our thinking on this topic we hope to provide a useful perspective to others creating innovative technologies.

[ Also on InfoWorld: What’s next for the cloud data warehouse ]

Open is often understood to describe two broad elements: open standards and open source. We’ll look at them each in more detail here.

Open standards

Open standards encompass file formats, protocols, and programming models, which include languages and APIs. Although open standards generally provide value to users and vendors alike, it’s important to understand where they serve higher-level priorities and where they do not.

File formats

We agree that open file formats are an important counter to the very real problem of vendor lock-in. Where we differ is in the assertion that those open formats are the optimal way to represent data during processing, and that direct file access should be a key characteristic of a data platform. 

At first glance, the ability to directly access files in a standard, well-known format is appealing, but it becomes troublesome when the format needs to evolve. Consider an enhancement that enables better compression or better processing. How do we coordinate across all possible users and applications to understand the new format?

Or consider a new security capability where data access depends on a broader context. How do we roll out a new privacy capability that reasons through a broader semantic understanding of the data to avoid re-identification of individuals? Is it necessary to coordinate all possible users and applications to adopt these changes in lockstep? What happens if one of these is missed?

Our long experience with these trade-offs gives us a strong conviction about the superior value of providing abstraction and indirection versus exposing raw files and file formats. We strongly believe in API-driven access to data, in higher-level constructs abstracting away physical storage details. This is not about rejecting open; it’s about delivering better value for customers. We balance this with making it very easy to get data in and out in standard formats.

A good illustration of where abstracting away the details of file formats significantly helps end users is compression. An ability to transparently modify the underlying representation of data to achieve better compression translates to storage savings, compute savings, and better performance. Exposing the details of file formats makes it next to impossible to roll out better compression without incurring long migrations, breaking changes, or added complexity for applications and developers. 

Similar issues arise when we think about enhancements to security, data governance, data integrity, privacy, and many other areas. The history of database systems offers plenty of examples, like iSAMS or CODASYL, showing us that physical access to data leads to an innovation dead end. More recently, adopters of Hadoop found themselves managing costly, complex, and unsecured environments that didn’t deliver the promised performance.

In a world with direct file access, introducing new capabilities translates into delays in realizing the benefits of those capabilities, complexity for application developers, and, potentially, governance breaches. This is another point arguing for abstracting away the internal representation of data to provide more value to customers, while supporting ingestion and export of open file formats. 

Open protocols and APIs

Data access methods are more important than file formats. We all agree that avoiding vendor lock-in is a key customer priority. But while some believe that open formats are the solution, the heavy lifting in any migration is really about code and data access, whether it’s protocols and connectivity drivers, query languages, or business logic. Those who have gone through a system migration can likely attest that the topic of file formats is a red herring.

For us, this is where open matters most — it’s where costly lock-in can be avoided, data governance can be maximized, and greater innovation is possible. Focusing on open protocols and APIs is key to avoiding complexity for users and enabling continuous, transparent innovation.

Open source

The benefits cited for open source include a greater understanding of the technology, increased security through transparency, lower costs, and community development. Open source can deliver against some of these goals, and does so primarily when technology is installed on-premises, but the shift to managed services greatly alters these dynamics.

When it comes to greater understanding of code, consider that a sophisticated query processor is typically built and optimized over several years by dozens of Ph.D. graduates. Making the source code available will not magically allow its users to understand its inner workings, but there may be greater value in surfacing data, metadata, and metrics that provide clarity to customers.

Another aspect of this discussion is the desire to copy and modify source code. This can provide value and optionality to organizations that can invest to build these capabilities, but we’ve also seen it lead to undesirable consequences, including fragmented platforms, less agility to implement changes, and competitive dysfunction. 

Increased security

This has traditionally been one of the main arguments for open source. When an organization deploys software within its security perimeter, source code availability can indeed increase confidence about security. But there is a growing awareness of the risks in software supply chains, and complex technology solutions often aggregate multiple software subsystems without an understanding of the full end-to-end impact on security.

Luckily there is a better model, which is the deployment of technology as managed cloud services. Encapsulation of the inner workings of these services allows for faster evolution and speedy delivery of innovation to customers. With additional focus, managed services can remove the configuration burden and eliminate the effort required for provisioning and tuning. 

Lower cost

Most organizations have recognized by now that not paying a software license does not necessarily mean lower costs. Besides the cost of maintenance and support, it ignores the cost and complexity of deploying, updating, and break-fixing software. Cost should be measured in terms of total cost and price/performance out of the box. Here, too, managed services are preferable, removing among other things the need to manage versions, work around maintenance windows, and fine-tune software.

Community

One of the most powerful aspects of open source is the notion of community, by which a group of users work collaboratively to improve a technology and help one another. But collaboration does not need to imply source code contribution. We think of community as users helping one another, sharing best practices, and discussing future directions for the technology. 

As the shift from on-premises to the cloud and managed services continues, these topics of control, security, cost, and community recur. What’s interesting is that the original goals of open source are being met in these cloud environments without necessarily providing source code for everyone—which is where we started this discussion. We must not lose sight of the desired outcomes by focusing on tactics that may no longer be the best route to those outcomes.

Open at Snowflake

At Snowflake, we think about first principles, about desired outcomes, about intended and unintended consequences, and, most importantly, about what’s best for our customers. As such, we don’t think of open as a blanket, non-negotiable attribute of our platform, and we are very intentional in choosing where and how we embrace it. 

Our priorities are clear: 

  1. Deliver the highest levels of security and governance; 
  2. Provide industry-leading performance and price/performance through continuous innovation; and 
  3. Set the highest levels of quality, capabilities, and ease of use so our customers can focus on deriving value from data without the need to manage infrastructure. 

We also want to ensure that our customers continue to use Snowflake because they want to and not because they’re locked in. To the extent that open standards, open formats, and open source help us achieve those goals, we embrace them. But when open conflicts with those goals, our priorities dictate against it.

Open standards at Snowflake

With those priorities in mind, we have fully embraced standard file formats, standard protocols, standard languages, and standard APIs. We’re intentional about where and how we do so, and we have invested heavily in the ability to leverage the capabilities of our parallel processing engine so that customers can get their data out of Snowflake quickly should they need or choose to. However, abstracting away the details of our low-level data representation allows us to continually improve our compression and deliver other optimizations in a way that is transparent to users. 

We can also advance the controls for security and data governance quickly, without the burden of managing direct (and brittle) access to files. Similarly, our transactional integrity benefits from our level of abstraction and not exposing underlying files directly to users. 

We also embrace open protocols, languages, and APIs. This includes open standards for data access, traditional APIs such as ODBC and JDBC, and also REST-based access. Similarly, supporting the ANSI SQL standard is key to query compatibility while offering the power of a declarative, higher-level model. Other examples we embrace include enterprise security standards such as SAML, OAuth, and SCIM, and numerous technology certifications.

With proper abstractions and promoting open where it matters, open protocols allow us to move faster (because we don’t need to reinvent them), allow our customers to re-use their knowledge, and enable fast innovation due to abstracting the “what” from the “how.” 

Open source at Snowflake

We deliver a small number of components that get deployed as software solutions into our customers’ systems, such as connectivity drivers like JDBC or Python connectors or our Kafka connector. For all of these we provide the source code. Our goal is to enable maximum security for our customers, and we do so by delivering our core platform as a managed service, and we increase the peace of mind for installable software through open source.

However, a misguided application of open can create costly complexity instead of low-cost ease of use. Offering stable, standard APIs while not opening up our internals allows us to quickly iterate, innovate, and deliver value to customers. But customers cannot create—deliberately or unintentionally—dependencies on internal implementation details, because we encapsulate them behind APIs that follow solid software engineering practices. That is a major benefit for both sides, and it’s key to maintaining our weekly cadence of releases, to continuous innovation, and to resource efficiency. Customers who have migrated to Snowflake tell us consistently that they appreciate those choices.

The interface to our fully managed service, run in its own security perimeter, is the contract between us and our customers. We can do this because we understand every component and devote a great amount of resources to security. Snowflake has been evaluated by security teams across the gamut of company profiles and industries, including highly regulated industries such as healthcare and financial services. The system is not only secure, but the separation of the security perimeter through the clean abstraction of a managed service simplifies the job of securing data and data systems for customers.

On a final note, we love our user groups, our customer councils, and our user conferences. We fully embrace the value of a vibrant community, open communications, open forums, and open discussions. Open source is an orthogonal concept, from which we do not shy away. For example, we collaborated on open sourcing FoundationDB, and made significant contributions to evolving FoundationDB further. 

However, we don’t extrapolate from this to say there is an inherent merit to open source software. We could equally have used a different operational store and a different model of making it to suit our requirements if needed. The FoundationDB example illustrates our key thesis: Open is a great collection of initiatives and processes, but it’s one of many tools. It is not the hammer for all nails and is the best choice only in some situations. 

At Snowflake, we believe in the value of open standards and open source, but also in the value of data governance and security; we believe in the value of ease of use, the power of community, and the value of abstractions that enable transparent optimizations and improvements over time. Some companies pride themselves on being open source, but their embrace is rarely 100%; as described above, there are good reasons for such departures. Our goal is to be clear and transparent about how we think about these topics at Snowflake and to dispel any myths and misconceptions.

We remain committed to open-sourcing components that get deployed in customer premises or security perimeters, and to import and export open formats. We remain committed to standards-based APIs and programming models. And above all, we remain committed to continuing to innovate and raising the bar of what’s possible in our industry.

Benoit Dageville co-founded Snowflake and currently serves as president of the product division. Benoit is a leading expert in parallel execution and self-tuning database systems. Prior to founding Snowflake, Benoit was with Oracle for 15 years as a lead architect for parallel execution in Oracle RAC and a key architect in the SQL Manageability group. Prior to Oracle, Benoit worked at Bull Information Systems. He helped define the architecture and lead database performance efforts for Bull’s parallel systems. Benoit has a Ph.D. in computer science with a focus in parallel database systems and holds more than 80 patents.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.