Microsoft’s new Azure data analytics platform is an evolution of Azure Data Warehouse. Credit: Thinkstock Hyperscale cloud services such as Azure are designed to work with large amounts of data, taking advantage of their economies of scale when purchasing storage hardware. Their close relationship with search engines like Bing and Google allows them to build on the algorithms and tools developed to analyze the public internet. It’s a combination that makes them an ideal platform for building applications that need to process massive data sets, at a scale that would be prohibitive in your own data center. Microsoft has offered a range of data and analytics services on Azure since its early days, starting with its own SQL database (which quickly became a cloud-hosted version of the familiar SQL Server), adding HDInsight for Hadoop and other Apache data services, and offering a large-scale data lake that lets you mix structured and unstructured data. Until recently most of these services have been stand-alone, and if you wanted to bring them together, you’d need to build your own analytics tooling. At Ignite 2019, Microsoft launched Azure’s existing SQL Data Warehouse as Azure Synapse, rearchitected and rebranded, adding support for Apache Spark and its own Studio development and analytics tools. Introducing Azure Synapse Azure Synapse is more than a rebranding of an existing product, with a focus on integrating much of Azure’s data analysis capabilities into a single service. Unlike traditional data warehouses, there’s support for mixed relational and unstructured data, while still allowing you to use existing SQL skills to build and test analytical models, building on Azure SQL’s PolyBase big data query engine. Because it uses column stores in memory, it’s fast and efficient, an important feature when you’re using a cloud service consumption model. Where Synapse differs from other data warehouse products is its roots in Azure SQL’s hyperscale option. Instead of a single compute node handling all your queries, it uses a cluster of what Microsoft is calling “data warehouse units.” These separate query compute from the underlying storage and let Synapse take a massively data parallel approach to working with your queries. Each data warehouse unit has compute and a custom application, the Data Movement Service, that works across nodes and with Azure Storage to ensure that the right data is available in the right node. It’s certainly fast; a demo at Ignite compared it with Google’s Big Query on a 30 petabyte data set, and showed Synapse to be 75 times faster. There’s a price advantage to this architecture, too. Like much of Azure’s Paas functionality, it’s essentially serverless, spinning up as required and paused when no longer needed. This way you only pay for the compute used to run a query and your underlying storage, adding more compute as required for more complex analysis. Getting started with Azure Synapse SQL Pools Creating a new SQL Pool for Azure Synapse follows the same path as Azure Data Warehouse. Start by making a new SQL Data Warehouse resource in the Azure Portal. This lets you create a new server, with log-on and password. Choosing a performance level picks the default server type, which sets the compute costs for your pool. Once provisioned and with the appropriate firewall rules set, you can start managing it with your choice of SQL Server management tool. Queries are built using the familiar T-SQL dialect. You can use PolyBase to load the data from Azure blob storage into staging tables that are used to build the production tables that you then use for your queries. The same platform will work with Apache Spark rather than SQL if you prefer to use it to analyze your data. That approach gives you an extra level of flexibility, letting you choose the appropriate tool for your problem. If you’ve got an existing Azure SQL Data Warehouse instance, you can run it in Azure Synapse, giving you more options for working with complex data sets that work with both relational and nonrelational data. Microsoft interestingly promises integration with Open Data Initiative sources, which should include support for data from Adobe’s Marketing Cloud, from SAP, and from Microsoft’s own Dynamics platform. Microsoft is planning connectors for other data sources, including third parties. Building queries in Synapse Studio Synapse Studio is a key element of a new combined analytics platform. It functions as a single pane of glass for building, testing, and viewing the results of queries. You can use it with code or without, picking and choosing different tables and sources before running queries and building reports. The same tool can be used with one-off queries that help solve a specific problem, or to build repeated queries that can be automated using Azure Data Factory where you need to get regular reports across your various data sources. One useful feature is Azure Synapse’s integration with Power BI. Data analysts can use Azure Synapse to build models that can be handed over to business analysts, who can use Power BI’s visual analysis tools to build and run their own queries, looking for results that are relevant to their own interests. This approach reduces the time needed to find the answers to specific business questions, as there’s no need for the data team to continually rebuild models and run expensive analytical processes while businesses analysts refine their queries. Power BI integration will ensure that prebuilt queries using machine learning and other Azure services can be built into executive dashboards for near real-time access to key business metrics. From analytics to machine learning Access to large data stores like those in Azure Synapse can help train your own machine learning models. Synapse’s Studio development tools include a Jupyter Notebook-like sketchpad for building and testing new models before using them in Azure Machine Learning pipelines. There’s support for the R statistical data exploration language as well as for Python, the machine learning staple. Microsoft describes Azure Synapse as a “petabyte scale” platform. Most of us don’t need to work at that scale, but the tools that deliver results at that scale work equally well with much smaller data sets, as well as with streamed data from IoT devices. Microsoft recommends using Azure Synapse with more than 1TB of data, with 250GB the minimum data set size. The ability to start relatively small and quickly scale with Azure Synapse, as well as to share the data in your data warehouse with data scientists and business analysts, as well as with database warehouse specialists makes it an attractive alternative to other on-premises and in-cloud analytics tools—especially with the option of using role-specific tooling on the same underlying data sets. Related content analysis How to support accurate revenue forecasting with data science and dataops Data science and dataops have a critical role to play in developing revenue forecasts business leaders can count on. By Isaac Sacolick Nov 05, 2024 8 mins Data Science Machine Learning Artificial Intelligence feature The machine learning certifications tech companies want Not all machine learning courses and certifications are equal. Here are five certifications that will help you get your foot in the door. By Bob Violino Nov 04, 2024 9 mins Certifications Machine Learning Software Development how-to Download the AI in the Enterprise (for Real) Spotlight This issue showcases practical AI deployments, implementation strategies, and real-world considerations such as for data management and AI governance that IT and business leaders alike should know before plunging into AI. By InfoWorld and CIO.com contributors Nov 01, 2024 1 min Machine Learning Data Governance Artificial Intelligence feature The best Python libraries for parallel processing Do you need to distribute a heavy Python workload across multiple CPUs or a compute cluster? These seven frameworks are up to the task. By Serdar Yegulalp Oct 23, 2024 11 mins Python Data Science Machine Learning Resources Videos