Importing, transforming, and validating data from unmanaged external sources is a messy, complex process. A data exchange platform can help. Credit: sirtravelalot What has always fascinated me about Moore’s law is that for more than half a century, the technological computing innovations we take for granted—from the PC to smart watches to self-driving cars—hinged on solving one small, specific problem: the distance between transistors on a chip. As our software-powered world becomes more and more data-driven, unlocking and unblocking the coming decades of innovation hinges on data: how we collect it, exchange it, consolidate it, and use it. In a way, the speed, ease, and accuracy of data exchange has become the new Moore’s law. TL;DR: Safely and efficiently importing a myriad of data file types from thousands or even millions of different unmanaged external sources is a pervasive, growing problem. Most organizations struggle with file import because traditional ETL (extract, transform, and load) and iPaaS (integration platform-as-a-service) solutions are designed to transfer data only between tightly managed IT systems and databases. Below, I’ll explain what data import is and the common problems companies face in taming unmanaged files. I’ll discuss how emerging new data exchange platforms are designed to solve these problems and how these platforms work individually and in tandem with traditional ETL solutions to make them faster and more agile. Six data file exchange challenges Data files often require data mapping, review, cleanup, and validation. They may need human oversight before they can be imported into managed databases and business systems. Data files present developers and IT teams with a variety of challenges: Onboarding customers: The need to load customer data into the software applications that customers use can introduce delays or complications that decrease customer satisfaction and trigger churn. Uploading files: Applications that allow customers, prospects, employees, partners, or vendors to upload data files also can induce time delays, errors, and complaints from end users. Some users who can’t complete the simple task will leave and never return. Orchestrating data workflows: Businesses often need to orchestrate complex data workflows across diverse stakeholders, systems, and processes while delivering seamless data exchange experiences that provide the highest business value for all participants. Migrating data: Preparing data for large IT migration projects can be a complex undertaking, and nearly always introduces data errors, versioning issues, time delays, and frustration. Moving data from legacy systems to a new business system requires extensive data review between business stakeholders and implementation experts. Data from old systems needs to be prepared for import into a new system, which often involves emailing Excel files back and forth for review and data cleanup. Automating file imports: Most businesses need to periodically collect data from partners, agents, or remote employees or aggregate data from remote departments or divisions. The volume and complexity of available data are constantly growing, turning data collection, import, and processing into cumbersome and error-prone tasks. Those files might be emailed, dropped into a shared folder, or sent via FTP. Often, those files require resources to be dedicated to a mapping, formatting, cleaning, and review process before they can be combined with other data. Reviewing data manually: Data imports frequently require manual review, with exception handling and approvals on both the sending and receiving ends. Users need to be able to quickly upload a file, look through it, fill in any blanks, and make simple mapping decisions. The receiving side may need to review exceptions, review data in a consolidated form, or even send back requests to users to fix or update certain parts of the data. The human-in-the-loop component in the data integration process requires an entirely new approach to managing data exchange. Data import workarounds vs. a purpose-built data exchange solution Most IT teams rely on a range of workarounds to bring data files into their business, usually with significant data quality issues and at a high cost. Businesses attempt to solve these data file issues by hiring outside IT services teams, using end-user templates and rules, or building a custom solution. Beyond the direct costs of personnel and maintenance required for these workarounds, the opportunity cost of lost and delayed revenue vastly increases the impact of data import. A data exchange solution will streamline, accelerate, and secure data import processes, improving business velocity and delivering rapid and sustained ROI. Flatfile Data file exchange is a critical component of a modern data integration architecture. The right solution will: Reduce data errors; Accelerate timely decision-making; Reduce in-house development time and cost; Increase data usability; Accelerate time to value; Improve security and compliance. Build vs. buy (or a mixture of both) In addition to building a file importer from scratch, companies can draw on several open-source libraries and commercial solutions to complete their enterprise data integration architecture. Building is always a long-term commitment and will entail developing new features as file import needs change (such as adding new languages, or navigating regulatory concerns that may come with supporting a new customer), on top of supporting and maintaining the tool over time. Some companies opt to buy a CSV import tool, choosing among the many options that have emerged in recent years. These tools offer basic functionality but typically are limited to a narrowly defined use case and cannot address the varied and evolving needs of enterprise use cases. The third option is a “build with” approach that provides the functionality and scalability of software, together with the flexibility to meet an organization’s specific business needs. An API-based file import platform enables developers to build fully customizable data file import, using code to drive business and data logic without having to maintain the underlying plumbing. Whether an organization DIYs it, outsources it, or builds with a platform, there are certain basic functions that any data exchange solution needs to support. Data parsing is the process of aggregating information (in a file) and breaking it into discrete parts. A data parsing feature that provides the ability to transform a file into an array of discrete data and streamlines this process for end users. Along with parsing, proper data structuring ensures that data is received into the system and labeled appropriately. APIs expect a specific format of data and will fail without it. Data validation involves checking the data to ensure it matches an expected format or value, preventing issues from occurring down the line and eliminating the need for your end users to remove and re-upload data. After validation, data mapping and matching refer to taking the previously unknown source data and matching it to a known target. Without data mapping, imports will fail when data elements—such as column headings—do not match exactly. Data transformation involves making changes to data as it flows into the system to ensure it meets an expected or desired value. Rather than sending data back to users with an error message, the data undergoes small, systematic tweaks to ensure that it is usable. Data in / data out refers to all the ways data can be moved into and out of the tool. It can be as simple as downloading and uploading or as complex as automating imports and posting exports to an external API. Data ingress and egress should align with an organization’s operational needs. Performance at scale and facilitating collaboration among multiple users is imperative. What might suffice in the short term can swiftly devolve into a sluggish system unless you consider future requirements. Security, compliance, and access functionalities ensure that the data import solution functions smoothly, aligns with regulatory requirements, safeguards data integrity, and increases transparency. These elements form the foundation of a trustworthy and dependable file import tool. ETL + data import = stronger together Data exchange and import solutions are designed to work seamlessly alongside traditional integration solutions. ETL tools integrate structured systems and databases and manage the ongoing transfer and synchronization of data records between these systems. Adding a solution for data-file exchange next to an ETL tool enables teams to facilitate the seamless import and exchange of variable unmanaged data files. The data exchange and ETL systems can be implemented on separate, independent, and parallel tracks, or so that the data-file exchange solution feeds the restructured, cleaned, and validated data into the ETL tool for further consolidation in downstream enterprise systems. Flatfile Flatfile A data exchange platform integrated with a traditional ETL tool offers several advantages in managing and transferring data: Data collection from many (small or large) sources Any source Human-in-the-loop Data collaboration Ephemeral data integration Intelligent and scalable data cleaning and validation Secure gate for external data Combining a data exchange platform with an ETL tool will create a modern data integration and management ecosystem that enables companies to make better use of all of their data and start reaping the benefits of the new Moore’s law. David Boskovic, founder and CEO of Flatfile. — Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com. Related content feature Dataframes explained: The modern in-memory data science format Dataframes are a staple element of data science libraries and frameworks. Here's why many developers prefer them for working with in-memory data. By Serdar Yegulalp Nov 06, 2024 6 mins Data Science Data Management analysis Cloud providers make bank with genAI while projects fail Generative AI is causing excitement but not success for most enterprises. This needs to change quickly, but it will take some work that enterprises may not be willing to do. By David Linthicum Nov 05, 2024 5 mins Generative AI Cloud Computing Data Management feature Overcoming data inconsistency with a universal semantic layer Disparate BI, analytics, and data science tools result in discrepancies in data interpretation, business logic, and definitions among user groups. A universal semantic layer resolves those discrepancies. By Artyom Keydunov Nov 01, 2024 7 mins Business Intelligence Data Management feature Bridging the performance gap in data infrastructure for AI A significant chasm exists between most organizations’ current data infrastructure capabilities and those necessary to effectively support AI workloads. By Colleen Tartow Oct 28, 2024 12 mins Generative AI Data Architecture Artificial Intelligence Resources Videos