OpenAI previews Realtime API for speech-to-speech apps

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

Image of a person typing on a keyboard. Text, speech, typing.

Credit: Tero Vesalainen/Shutterstock

OpenAI has introduced a public beta of the Realtime API, an API that allows paid developers to build low-latency, multi-modal experiences including text and speech in apps.

Introduced October 1, the Realtime API, similar to the OpenAI ChatGPT Advanced Voice Mode, supports natural speech-to-speech conversations using preset voices that the API already supports. OpenAI also is introducing audio input and output in the Chat Completions API to support use cases that do not need the low-latency benefits of the Realtime API. Developers can pass text or audio inputs into GPT-4o and have the model respond with text, audio, or both.

With the Realtime API and the audio support in the Chat Completions API, developers do not have to link together multiple models to power voice experiences. They can build natural conversational experiences with just one API call, OpenAI said. Previously, creating a similar voice experience had developers transcribing an automatic speech recognition model such as Whisper, passing text to a text model for inference or reasoning, and playing the model’s output using a text-to-speech model. This approach often resulted in loss of emotion, emphasis, and accents, plus latency.

With the Chat Completions API, developers can deal with the entire process with one API call, though it remains slower than human conversation. The Realtime API improves latency by streaming audio inputs and outputs directly, enabling more natural conversational experiences, OpenAI said. The Realtime API also can handle interruptions automatically, like ChatGPT’s advanced voice mode.

The Realtime API enables development of a persistent WebSocket connection to exchange messages with GPT-4o. The API backs function calling, which makes it possible for voice assistants to respond to user requests by pulling in new context or triggering actions. Also, the Realtime API leverages multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs.

The Realtime API uses text tokens and audio tokens. Text input costs $5 per 1M tokens and text output costs $20 per 1M tokens. Audio input costs $100 per 1M tokens and audio output costs $200 per 1M tokens.

OpenAI said plans for improving the Realtime API include adding support for vision and video, increasing rate limits, adding support for prompt caching, and expanding model support to GPT-4o mini. The company said it would also integrate support for the Realtime API into the OpenAI Python and Node.js SDKs.

OpenAI previews Realtime API for speech-to-speech apps

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

More from this author

Microsoft unveils imaging APIs for Windows Copilot Runtime

Microsoft extends Entra ID to WSL, WinGet

F# 9 adds nullable reference types

Akka distributed computing platform adds Java SDK

Spin 3.0 supports polyglot development using Wasm components

Go language evolving for future hardware, AI workloads

JDK 24: The new features in Java 24

Rust Foundation moves forward on C++ and Rust interoperability

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Azure AI Foundry tools for changes in AI applications

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx