Home Blogs Enterprise Microsoft Azure Databricks: Fast analytics in the cloud with Apache Spark

by Simon Bisson

Contributor

Azure Databricks: Fast analytics in the cloud with Apache Spark

analysis

Nov 28, 20176 mins

Apache SparkCloud ComputingData Management

Microsoft’s partnership with Databricks adds new analytics tools to Azure’s data platform

wireless network - industrial internet of things edge [IoT] - edge computing

Credit: Thinkstock

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

Configuring the Azure Databricks virtual appliance

The heart of Microsoft’s new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once it’s configured and running, loading new VMs to handle scaling.

Databricks’ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster – so if you’re planning on using it to train machine learning systems, you’ll want to choose one of the latest GPU-based VMs. And of course, if one VM model isn’t right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.

Querying in Spark brings engineering to data science

Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.

DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.

Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.

Although Azure Databricks provides a high-speed analytics layer across multiple sources, it’s also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.

The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.

by Simon Bisson

Contributor

Author of InfoWorld's Enterprise Microsoft blog, Simon Bisson prefers to think of “career” as a verb rather than a noun, having worked in academic and telecoms research, as well as having been the CTO of a startup, running the technical side of UK Online (the first national ISP with content as well as connections), before moving into consultancy and technology strategy. He’s built plenty of large-scale web applications, designed architectures for multi-terabyte online image stores, implemented B2B information hubs, and come up with next generation mobile network architectures and knowledge management solutions. In between doing all that, he’s been a freelance journalist since the early days of the web and writes about everything from enterprise architecture down to gadgets.

Topics

About

Policies

Our Network

More

Azure Databricks: Fast analytics in the cloud with Apache Spark

Microsoft’s partnership with Databricks adds new analytics tools to Azure’s data platform

Configuring the Azure Databricks virtual appliance

Querying in Spark brings engineering to data science

Microsoft plus Databricks: a new model for Azure Services

More from this author

Azure AI Foundry tools for changes in AI applications

Understanding Hyperlight, Microsoft’s minimal VM manager

GitHub Copilot learns new tricks

OpenHCL: Understanding Microsoft’s open source paravisor

Enter the Neoverse with Azure’s Cobalt servers

Secure Azure Kubernetes with Advanced Container Networking Services

Microsoft’s Drasi: A lightweight approach to event-driven programming

Understanding VBS Enclaves, Windows’ new security technology

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Microsoft unveils imaging APIs for Windows Copilot Runtime

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx

Azure Databricks: Fast analytics in the cloud with Apache Spark

Microsoft’s partnership with Databricks adds new analytics tools to Azure’s data platform

Configuring the Azure Databricks virtual appliance

Querying in Spark brings engineering to data science

Microsoft plus Databricks: a new model for Azure Services

Related content

The dirty little secret of open source contributions

Breaking down digital silos

The cloud reaches its equilibrium point

Overlooked cloud sustainability issues

More from this author

Azure AI Foundry tools for changes in AI applications

Understanding Hyperlight, Microsoft’s minimal VM manager

GitHub Copilot learns new tricks

OpenHCL: Understanding Microsoft’s open source paravisor

Enter the Neoverse with Azure’s Cobalt servers

Secure Azure Kubernetes with Advanced Container Networking Services

Microsoft’s Drasi: A lightweight approach to event-driven programming

Understanding VBS Enclaves, Windows’ new security technology

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Microsoft unveils imaging APIs for Windows Copilot Runtime

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx