by Anais Dotis-Georgiou

How Apache Arrow speeds big data processing

feature
Nov 07, 20236 mins
AnalyticsData ManagementSQL

Apache Arrow defines an in-memory columnar data format that accelerates processing on modern CPU and GPU hardware, and enables lightning-fast data access between systems.

Graphic blue arrows pointing up for growth
Credit: Thinkstock

Working with big data can be a challenge, thanks to the performance overhead associated with moving data between different tools and systems as part of the data processing pipeline. Indeed, because programming languages, file formats and network protocols have different ways of representing the same data in memory, the process of serializing and deserializing data into a different representation at potentially each step in a data pipeline makes working with large amounts of data slower and more costly in terms of hardware.

Apache Arrow solves this problem, making analytics workloads more efficient for modern CPU and GPU hardware. A framework that defines an in-memory columnar data format that every processing engine can use, Apache Arrow does for OLAP (online analytical processing) workloads what ODBC/JDBC did for OLTP (online transaction processing) workloads by creating a common interface for different systems working with analytics data.

Apache Arrow has started to gain major adoption in the developer community and is poised to change the big data ecosystem for good.

Apache Arrow benefits

The primary benefit of adopting Arrow is performance. With Arrow, serializing and deserializing data when moving it around between different tools and languages is no longer needed, as everything can use the Arrow format. This is especially useful at scale when you need multiple servers to process data.

Consider the following example of performance gains from Ray, a Python framework for managing distributed computing:

apache arrow numpy arrays IDG

Source: Apache Arrow blog

Clearly, converting the data to the Apache Arrow format is faster than using an alternative for Python like Pickle. However, even greater performance gains are made when it comes to deserialization, which is orders of magnitude faster.

Apache Arrow’s column-based format means processing and manipulating data is also faster because it has been designed for modern CPUs and GPUs, which can process the data in parallel and take advantage of things like SIMD (single instruction, multiple data) for vectorized processing.

Apache Arrow also provides for zero-copy reads so memory requirements are reduced in situations where you want to transform and manipulate the same underlying data in different ways.

Another benefit is that Apache Arrow integrates well with Apache Parquet, another column-based format for data focused on persistence to disk. Combined, Arrow and Parquet make managing the life cycle and movement of data from RAM to disk much easier and more efficient.

Apache Arrow’s ecosystem presents an additional benefit, as more functionality and features are being added over time and performance is being improved as well. In many cases, companies are donating entire projects to Apache Arrow and contributing heavily to the project itself. Apache Arrow benefits almost all companies because it makes moving data between systems easier. By adding Apache Arrow support to a project, it becomes easier for developers to migrate or adopt that technology as well.

Apache Arrow features and components

There are four key features and components of the Apache Arrow project: the Arrow columnar data format, Arrow Flight, Arrow Flight SQL, and Arrow DataFusion.

The Arrow columnar format is the core of the project and defines the actual specification for how data should be structured in memory. From a performance perspective, the key features delivered by this format are:

  • Data is able to be read sequentially
  • Constant time random access
  • SIMD and vector processing support
  • Zero-copy reads

Arrow Flight is a RPC (remote procedure call) framework added to Apache Arrow to allow easy transfer of large amounts of data across networks without the overhead of serialization and deserialization. The compression provided by Arrow also means that less bandwidth is consumed compared to less-optimized protocols. Many projects use Arrow Flight to enable distributed computing for analytics and data science workloads.

An extension of Arrow Flight, Arrow Flight SQL interacts directly with SQL databases. It is still considered experimental and features are being added rapidly. A JDBC (Java Database Connectivity) driver was recently added to the project, allowing any database that supports JDBC or ODBC (Microsoft Open Database Connectivity) to communicate with Arrow data through Flight SQL.

Finally, DataFusion is a query execution framework that was donated to Apache Arrow in 2019. DataFusion includes a query optimizer and execution engine with support for SQL and DataFrame APIs. It is commonly used for creating data pipelines, ETL (extract, transform, and load) processes, and databases.

Apache Arrow projects of note

Many projects are adding integrations with Apache Arrow to make adopting their tool easier or embedding components of Apache Arrow directly into their projects to save themselves from duplicating work. The following are some of them:

  • InfluxDB 3.0. InfluxDB’s new columnar storage engine (previously known as InfluxDB IOx) uses the Apache Arrow format for representing data and moving data to and from Parquet. It also uses DataFusion to add SQL support to InfluxDB.
  • Apache Parquet. Parquet is a file format for storing columnar data used by many projects for persistence. Parquet supports vectorized reads and writes to and from Apache Arrow.
  • Dask. A parallel computing framework, Dask makes it easy to scale Python code horizontally. It uses Apache Arrow to access Parquet files.
  • Ray. Ray is a framework that allows data scientists to process data, train machine learning models, and serve those models in production using a unified tool. It relies on Apache Arrow to move data between components with minimal overhead.
  • Pandas. One of the most popular data analysis tools in the Python ecosystem, Pandas is able to read data stored in Parquet files by using Apache Arrow behind the scenes.
  • Turbodbc. Turbodbc is a Python module that allows data scientists to efficiently access data stored in relational databases via the ODBC interface. Apache Arrow makes this more efficient by allowing the data to be transferred in batches rather than as single records.

The push to eliminate lock-in effects by improving interoperability is happening in many different areas of software development today. For example, we see it in the observability and monitoring space with projects like OpenTelemetry, as well as in the big data ecosystem with projects like Apache Arrow.

With Apache Arrow, developers not only save time by not having to reinvent the wheel. They also gain valuable access to the entire ecosystem of data processing tools that also use Apache Arrow, which can make adoption by new users significantly easier.

Anais Dotis-Georgiou is a developer advocate for InfluxData with a passion for making data beautiful with the use of data analytics, AI, and machine learning. She takes the data that she collects and applies a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.