Microsoft’s cloud-hosted data lake and lakehouse platform gains new data science tools and opens up Power BI datasets to Python, R, and SparkSQL. Credit: Vink Fan / Shutterstock The modern enterprise is powered by data, bringing together information from across the organization and using business analysis tools to deliver answers to any relevant questions. Those tools give access to real-time information, as well as using historic data to provide predictions of future trends based on the current state of the business. What’s essential to delivering that tooling is having a common data layer across the enterprise, bringing in many different sources and providing one place to query that data. A common data layer, or “data fabric,” gives the organization a baseline of truth that can be used to inform both short-term and long-term decision-making, powering both instantaneous dashboard views and the machine learning models that help identify both trends and issues. Building up from the data lake It wasn’t surprising to see Microsoft bring many of its data analysis tools together under the Microsoft Fabric brand, with a mix of relational and non-relational data stored in cloud-hosted data lakes and managed with lakehouses. Building on the open-source Delta table format and the Apache Spark engine, Fabric takes big data concepts and makes them accessible to both common programming languages and more specialized analytics tooling, like the visual data explorations and complex query engine provided by Power BI. The initial preview releases of Microsoft Fabric were focused on building out the data lakehouses and data lakes that are essential for building at-scale, data-driven applications. A whole lot of heavy lifting will be needed to get your data estate in the requisite shape for this scale of project. It’s essential to get that data engineering complete before you start to build more complex applications on top of your data. Adding data science to data engineering While the Fabric service remains in preview, Microsoft has continued to add new features and tools. The latest updates address the developer side of the story, adding integration with familiar developer tools and services, features that go beyond the basics of a set of REST APIs. These new tools bring Fabric to data scientists, linking Power BI data sets to Azure’s existing data science platform. Power Query in Power BI is one of the most important tools in Microsoft’s data analysis platform. Perhaps best thought of as an extension of the pivot table tools in Excel, Power Query is a way of slicing and dicing large amounts of data across multiple sources and extracting relevant data quickly and easily. The key to its capabilities is DAX, Data Analysis Expressions, a query language for data analysis that provides the tools needed to filter and refine data. Then there is Microsoft Fabric’s new semantic link feature, which provides a bridge between this data-centric world and the data science tools provided by languages like Python, using familiar Pandas and Apache Spark APIs. By adding these new libraries to your Python code, you can use semantic link from inside notebooks to build machine learning models in AI tools like PyTorch. You can then use your Power BI data with any of Python’s many numerical analysis tools, allowing you to apply complex analysis to datasets. That’s an important development, bringing data science into familiar development tools and frameworks, from both sides. You can use the semantic link to allow both teams to collaborate more effectively. The BI team can use tools like DAX to build their report datasets, which are then linked to the notebooks and models used by the data science team, ensuring that both teams are always working with the same data and the same models. Using semantic link in Fabric workspaces The semantic link Python API uses familiar Pandas methods. From those methods you can discover and list the datasets and tables created by Power BI, and read the contents of the tables. If there are associated measures you can write code to evaluate them, and then run DAX from your Python code. You can use standard Python tools to install the semantic link library, as it’s available from the Pip module repository. Once the library is loaded into your Python workspace, all you need to do is import sempy.fabric to access your Fabric-hosted data, then use it to extract data for use in your Python code. As you’re working inside the context of your Fabric environment there’s no need for additional authentication beyond your Azure login. Once you’re in your workspace you can create notebooks and load data. The semantic link package is a meta-package, containing several different packages that can be installed individually if you prefer. One useful part of the package is a set of functions that let you use Fabric data as geodata, letting you quickly add geographic information to your Fabric frames and use Power BI’s geographic tools in reports. A useful feature for anyone working with semantic links in an interactive notebook is the ability to execute DAX code directly, using the iPython interactive syntax. Much like writing Python code, you’ll need to install the library in your environment before loading sempy as an external module. You can then use the %%dax command to run DAX commands and view the output. This approach works well for experimenting with Fabric-hosted data, where data analysts and scientists are working together in the same notebook. DAX queries can be run directly from Python, with sempy’s evaluate_dax function. To use it, call the function with the name of the dataset and a string containing your query. You can then parse the resulting data object and use it in the rest of your application. Other tools in the semantic link package help data scientists validate data. For example, you can use a couple of lines of code to quickly visualize the relationships in a dataset. Again, this is a useful tool for collaborative working, as it’s possible to use this output to refine the selections made in Power BI, helping to ensure that the right queries are used to build the dataset we want to use. Other options include the ability to visualize the dependencies between the entities in your data, helping you refine the results of your queries and understand the structures of your datasets. A foundation for data science at scale Finally, you’re not limited to Python notebooks. If you want to use big data tooling, you can work with both Power BI data and Spark data in a single query, as Power BI datasets are treated as Spark tables by Fabric. That means you can use PySpark to query across both Power BI data and Spark tables hosted in Fabric. You can even use Spark’s R and SQL tools if you prefer. There’s a lot happening in Microsoft Fabric, with new features being added to the service preview on a monthly cadence. It’s clear that the semantic link library is only the start of bridging the divide between data analysis and data science, making it easier for users to build data-driven applications and services. It will be interesting to see what Microsoft does next. Related content analysis And the #1 Python IDE is . . . PyCharm, VS Code, and five other popular Python IDEs duke it out. Which one do you think takes home the prize? By Serdar Yegulalp Nov 15, 2024 2 mins Python Programming Languages Software Development news JetBrains IDEs ease debugging for Kubernetes apps Version 2024.3 updates to IntelliJ, PyCharm, WebStorm, and other JetBrains IDEs streamline remote debugging of Kubernetes microservices and much more. By Paul Krill Nov 14, 2024 3 mins Integrated Development Environments Java Python analysis Python is the most popular language on GitHub Python was in the spotlight all last month, with a new release and a couple of big wins. Here are our picks for the best news and tutorials for Python developers in October. By Serdar Yegulalp Nov 01, 2024 2 mins Python Programming Languages Software Development feature Python threading and subprocesses explained Python lets you parallelize workloads using threads, subprocesses, or both. Here's what you need to know about Python's thread and process pools and Python threads after Python 3.13. By Serdar Yegulalp Oct 30, 2024 9 mins Concurrency Python Programming Languages Resources Videos