by Michael Berthold

How CI/CD is different for data science

feature
Nov 23, 20218 mins
AnalyticsCI/CDData Science

Moving data science into production has quite a few similarities to deploying an application. But there are key differences you shouldn’t overlook.

An infinity symbol hovers over a horizon line of sea and sky. [continuous cycle / iterative process]
Credit: PPAMPicture / Almagami / Getty Images

Agile programming is the most-used methodology that enables development teams to release their software into production, frequently to gather feedback and refine the underlying requirements. For agile to work in practice, however, processes are needed that allow the revised application to be built and released into production automatically—generally known as continuous integration/continuous deployment, or CI/CD. CI/CD enables software teams to build complex applications without running the risk of missing the initial requirements by regularly involving the actual users and iteratively incorporating their feedback.

Data science faces similar challenges. Although the risk of data science teams missing the initial requirements is less of a threat right now (this will change in the coming decade), the challenge inherent in automatically deploying data science into production brings many data science projects to a grinding halt. First, IT too often needs to be involved to put anything into the production system. Second, validation is typically an unspecified, manual task (if it even exists). And third, updating a production data science process reliably is often so difficult, it’s treated as an entirely new project.

What can data science learn from software development? Let’s have a look at the main aspects of CI/CD in software development first before we dive deeper into where things are similar and where data scientists need to take a different turn.

CI/CD in software development

Repeatable production processes for software development have been around for a while, and continuous integration/continuous deployment is the de facto standard today. Large-scale software development usually follows a highly modular approach. Teams work on parts of the code base and test those modules independently (usually using highly automated test cases for those modules).

During the continuous integration phase of CI/CD, the different parts of the code base are plugged together and, again automatically, tested in their entirety. This integration job is ideally done frequently (hence “continuous”) so that side effects that do not affect an individual module but break the overall application can be found instantly. In an ideal scenario, when we have complete test coverage, we can be sure that problems caused by a change in any of our modules are caught almost instantaneously. In reality, no test setup is complete and the complete integration tests might run only once each night. But we can try to get close.

The second part of CI/CD, continuous deployment, refers to the move of the newly built application into production. Updating tens of thousands of desktop applications every minute is hardly feasible (and the deployment processes are more complicated). But for server-based applications, with increasingly available cloud-based tools, we can roll out changes and complete updates much more frequently; we can also revert quickly if we end up rolling out something buggy. The deployed application will then need to be continuously monitored for possible failures, but that tends to be less of an issue if the testing was done well.

CI/CD in data science

Data science processes tend not to be built by different teams independently but by different experts working collaboratively: data engineers, machine learning experts, and visualization specialists. It is extremely important to note that data science creation is not concerned with ML algorithm development—which is software engineering—but with the application of an ML algorithm to data. This difference between algorithm development and algorithm usage frequently causes confusion.

“Integration” in data science also refers to pulling the underlying pieces together. In data science, this integration means ensuring that the right libraries of a specific toolkit are bundled with our final data science process, and, if our data science creation tool allows abstraction, ensuring the correct versions of those modules are bundled as well.

However, there’s one big difference between software development and data science during the integration phase. In software development, what we build is the application that is being deployed. Maybe during integration some debugging code is removed, but the final product is what has been built during development. In data science, that is not the case.

During the data science creation phase, a complex process has been built that optimizes how and which data are being combined and transformed. This data science creation process often iterates over different types and parameters of models and potentially even combines some of those models differently at each run. What happens during integration is that the results of these optimization steps are combined into the data science production process. In other words, during development, we generate the features and train the model; during integration, we combine the optimized feature generation process and the trained model. And this integration comprises the production process.

So what is “continuous deployment” for data science? As already highlighted, the production process—that is, the result of integration that needs to be deployed—is different from the data science creation process. The actual deployment is then similar to software deployment. We want to automatically replace an existing application or API service, ideally with all of the usual goodies such as proper versioning and the ability to roll back to a previous version if we capture problems during production.

An interesting additional requirement for data science production processes is the need to continuously monitor model performance—because reality tends to change! Change detection is crucial for data science processes. We need to put mechanisms in place that recognize when the performance of our production process deteriorates. Then we either automatically retrain and redeploy the models or alert our data science team to the issue so they can create a new data science process, triggering the data science CI/CD process anew.

So while monitoring software applications tends not to result in automatic code changes and redeployment, these are very typical requirements in data science. How this automatic integration and deployment involves (parts of) the original validation and testing setup depends on the complexity of those automatic changes. In data science, both testing and monitoring are much more integral components of the process itself. We focus less on testing our creation process (although we do want to archive/version the path to our solution), and we focus more on continuously testing the production process. Test cases here are also “input-result” pairs but more likely consist of data points than test cases.

This difference in monitoring also affects the validation before deployment. In software deployment, we make sure our application passes its tests. For a data science production process, we may need to test to ensure that standard data points are still predicted to belong to the same class (e.g., “good” customers continue to receive a high credit ranking) and that known anomalies are still caught (e.g., known product faults continue to be classified as “faulty”). We also may want to ensure that our data science process still refuses to process totally absurd patterns (the infamous “male and pregnant” patient). In short, we want to ensure that test cases that refer to typical or abnormal data points or simple outliers continue to be treated as expected.

MLOps, ModelOps, and XOps

How does all of this relate to MLOps, ModelOps, or XOps (as Gartner calls the combination of DataOps, ModelOps, and DevOps)? People referring to those terms often ignore two key facts: First, that data preprocessing is part of the production process (and not just a “model” that is put into production), and second, that model monitoring in the production environment is often only static and non-reactive.

Right now, many data science stacks address only parts of the data science life cycle. Not only must other parts be done manually, but in many cases gaps between technologies require a re-coding, so the fully automatic extraction of the production data science process is all but impossible. Until people realize that truly productionizing data science is more than throwing a nicely packaged model over the wall, we will continue to see failures whenever organizations try to reliably make data science an integral part of their operations.

Data science processes still have a long way to go, but CI/CD offers quite a few lessons that can be built upon. However, there are two fundamental differences between CI/CD for data science and CI/CD for software development. First, the “data science production process” that is automatically created during integration is different from what has been created by the data science team. And second, monitoring in production may result in automatic updating and redeployment. That is, it is possible that the deployment cycle is triggered automatically by the monitoring process that checks the data science process in production, and only when that monitoring detects grave changes do we go back to the trenches and restart the entire process.

Michael Berthold is CEO and co-founder at KNIME, an open source data analytics company. He has more than 25 years of experience in data science, working in academia, most recently as a full professor at Konstanz University (Germany) and previously at University of California (Berkeley) and Carnegie Mellon, and in industry at Intel’s Neural Network Group, Utopy, and Tripos. Michael has published extensively on data analytics, machine learning, and artificial intelligence. Follow Michael on Twitter, LinkedIn and the KNIME blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.