Because building reliable data pipelines is hard, and the first step to becoming a data-driven organization is trusting your data.
It’s 8 a.m., and a business leader is looking at a financial performance dashboard, questioning if the results are accurate. A few hours later, a customer logs in to your company’s portal and wonders why their orders aren’t showing the latest pricing information. In the afternoon, the head of digital marketing is frustrated because data feeds from their SaaS tools never made it into their customer data platform. The data scientists are also upset because they can’t retrain their machine learning models without the latest data sets loaded.
These are dataops issues, and they’re important. Businesses should rightly expect that accurate and timely data will be delivered to data visualizations, analytics platforms, customer portals, data catalogs, ML models, and wherever data gets consumed.
Data management and dataops teams spend significant effort building and supporting data lakes and data warehouses. Ideally, they are fed by real-time data streams, data integration platforms, or API integrations, but many organizations still have data processing scripts and manual workflows that should be on the data debt list. Unfortunately, the robustness of the data pipelines is sometimes an afterthought, and dataops teams are often reactive in addressing source, pipeline, and quality issues in their data integrations.
In my book Digital Trailblazer I write about the days when there were fewer data integration tools, and manually fixing data quality issues was the norm. “Every data processing app has a log, and every process, regardless of how many scripts are daisy‐chained, also has a log. I became a wizard with Unix tools like sed, awk, grep, and find to parse through these logs when seeking a root cause of a failed process.”
Today, there are far more robust tools than Unix commands to implement observability into data pipelines. Dataops teams are responsible for going beyond connecting and transforming data sources; they must also ensure that data integrations perform reliably and resolve data quality issues efficiently.
Dataops observability helps address data reliability
Observability is a practice employed by devops teams to enable tracing through customer journeys, applications, microservices, and database functions. Practices include centralizing application log files, monitoring application performance, and using AIops platforms to correlate alerts into manageable incidents. The goal is to create visibility, resolve incidents faster, perform root cause analysis, identify performance trends, enable security forensics, and resolve production defects.
Dataops observability targets similar objectives, only these tools analyze data pipelines, ensure reliable data deliveries, and aid in resolving data quality issues.
Lior Gavish, cofounder and CTO at Monte Carlo, says, “Data observability refers to an organization’s ability to understand the health of their data at each stage in the dataops life cycle, from ingestion in the warehouse or lake down to the business intelligence layer, where most data quality issues surface to stakeholders.”
Sean Knapp, CEO and founder of Ascend.io, elaborates on the dataops problem statement: ”Observability must help identify critical factors like the real-time operational state of pipelines and trends in the data shape, “ he says. “Delays and errors should be identified early to ensure seamless data delivery within agreed-upon service levels. Businesses should have a grasp on pipeline code breaks and data quality issues so they can be quickly addressed and not propagated to downstream consumers.”
Knapp highlights businesspeople as key customers of dataops pipelines. Many companies are striving to become data-driven organizations, so when data pipelines are unreliable or untrustworthy, leaders, employees, and customers are impacted. Tools for dataops observability can be critical for these organizations, especially when citizen data scientists use data visualization and data prep tools as part of their daily jobs.
Chris Cooney, developer advocate at Coralogix, says, “Observability is more than a few graphs rendered on a dashboard. It’s an engineering practice spanning the entire stack, enabling teams to make better decisions.”
Observability in dataops versus devops
It’s common for devops teams to use several monitoring tools to cover the infrastructure, networks, applications, services, and databases. It’s similar to dataops—same motivations, different tools. Eduardo Silva, founder and CEO of Calyptia, says, “You need to have systems in place to help make sense of that data, and no single tool will suffice. As a result, you need to ensure that your pipelines can route data to a wide variety of destinations.”
Silva recommends vendor-neutral, open source solutions. This approach is worth considering, especially since most organizations utilize multiple data lakes, databases, and data integration platforms. A dataops observability capability built into one of these data platforms may be easy to configure and deploy but may not provide holistic data observability capabilities that work across platforms.
What capabilities are needed? Ashwin Rajeev, cofounder and CTO of Acceldata.io, says, “Enterprise data observability must help overcome the bottlenecks associated with building and operating reliable data pipelines.”
Rajeev elaborates, “Data must be efficiently delivered on time every time by using the proper instrumentation with APIs and SDKs. Tools should have proper navigation and drill-down that allows for comparisons. It should help dataops teams rapidly identify bottlenecks and trends for faster troubleshooting and performance tuning to predict and prevent incidents.”
Dataops tools with code and low-code capabilities
One aspect of dataops observability is operations: the reliability and on-time delivery from source to data management platform to consumption. A second concern is data quality. Armon Petrossian, cofounder and CEO of Coalesce, says, “Data observability in dataops involves ensuring that business and engineering teams have access to properly cleansed, managed, and transformed data so that organizations can truly make data-driven business and technical decisions. With the current evolution in data applications, to best prepare data pipelines, organizations need to focus on tools that offer the flexibility of a code-first approach but are GUI-based to enable enterprise scale, because not everyone is a software engineer, after all.”
So dataops and thus data observability must have capabilities that appeal to coders who consume APIs and develop robust, real-time data pipelines. But non-coders also need data quality and troubleshooting tools to work with their data prep and visualization efforts.
“In the same way that devops relies extensively on low-code automation-first tooling, so too does dataops,” adds Gavish. “As a critical component of the dataops life cycle, data observability solutions must be easy to implement and deploy across multiple data environments.”
Monitoring distributed data pipelines
For many large enterprises, reliable data pipelines and applications aren’t easy to implement. “Even with the help of such observability platforms, teams in large enterprises struggle to preempt many incidents,” says Ramanathan Srikumar, chief solutions officer at Mphasis. “A key issue is that the data does not provide adequate insights into transactions that flow through multiple clouds and legacy environments.”
Hillary Ashton, chief product officer at Teradata, agrees. “Modern data ecosystems are inherently distributed, which creates the difficult task of managing data health across the entire life cycle.”
And then she shares the bottom line: “If you can’t trust your data, you’ll never become data driven.”
Ashton recommends, “For a highly reliable data pipeline, companies need a 360-degree view integrating operational, technical, and business metadata by looking at telemetry data. The view allows for identifying and correcting issues such as data freshness, missing records, changes to schemas, and unknown errors. Embedding machine learning in the process can also help automate these tasks.”
We’ve come a long way from using Unix commands to parse log files for data integration issues. Today’s data observability tools are a lot more sophisticated, but providing the business with reliable data pipelines and high-quality data processing remains a challenge for many organizations. Accept the challenge and partner with business leaders on an agile and incremental implementation because data visualizations and ML models built on untrustworthy data can lead to erroneous and potentially harmful decisions.