Home Artificial Intelligence OpenAI updates API with model distillation, prompt caching abilities

by Anirban Ghoshal

Senior Writer

OpenAI updates API with model distillation, prompt caching abilities

news

Oct 03, 20244 mins

APIsGenerative AI

The updates could help OpenAI compete better with rivals such as Anthropic, Google, and AWS which already offer similar capabilities.

Shutterstock Germany - News - OpenAI Finance September 2024

Credit: Shutterstock - Mehaniq

In what can only be seen as OpenAI’s efforts to catch up with rivals, the ChatGPT-maker released several updates to its API to help ease the development of generative AI-based applications.

These updates, introduced during its DevDay conference this week, include capabilities such as model distillation and prompt caching, which are already offered by rivals.

Model distillation to help reduce costs of gen AI applications

Model distillation, a derivative of knowledge distillation, is a technique used in large language model training. The technique is used to teach a smaller model desired or required knowledge from a larger model.

Model distillation is preferred by developers as it can maintain the performance of a model underpinning an application while reducing the computation requirements and in turn costs.

The rationale is that smaller models, which use less compute, are able to perform like a larger model in a specified field of knowledge or expertise.

Several experts claim that model distillation can be used effectively in real-time natural language processing tasks or in industry sectors such as finance and healthcare that need the model to have domain expertise.

The model distillation capability introduced inside OpenAI API includes three components — Stored Completions, Evals, and Fine-tuning — all of which can be accessed via the API.

In order to distill a model using the OpenAI API, developers need to create an evaluation, either manually or using the Evals component, which is in beta, to measure the performance of the smaller model.

The idea is to continuously monitor the model after distilling it to ensure that it is performing as desired, OpenAI explained.

Post creating the evaluation, developers can use Stored Completions to create a dataset of outputs from the larger model on the desired topic on which the smaller model is to be trained.

Stored Completions, according to OpenAI, is a new free feature inside the API that can be used to automatically capture and store input-output pairs generated by any of the LLMs provided by the company, like GPT-4o or o1-preview.

Once the dataset is created using Stored Completions, it can be reviewed, filtered, and then used to fine-tune the smaller model or can be used as an evaluation dataset.

After this, developers can conduct an evaluation of the smaller model to see if it is performing optimally or is close to the larger model, the company said.

Rivals Google, Anthropic, and AWS already offer model distillation capabilities.

While Google previously offered the capability to create distilled models for PaLM and currently offers the capability to use Gemini to distill smaller models, AWS provides access to Llama 3.1-405B for synthetic data generation and distillation to fine-tune smaller models.

Model distillation as a feature inside OpenAI API is generally available, the company said, adding that any of its larger models can be used to distill smaller models.

Prompt Caching to reduce latency in gen AI applications

Alongside the distillation ability, OpenAI has also made available prompt caching capability for the latest versions of GPT-4o, GPT-4o mini, o1-preview, and o1-mini, as well as fine-tuned versions of those models.

Prompt caching is a technique used in the gen AI-based application development process that allows the model to understand natural language faster by storing and reusing contexts that are repetitively used while making API calls.

“Many developers use the same context repeatedly across multiple API calls when building AI applications, like when making edits to a codebase or having long, multi-turn conversations with a chatbot,” OpenAI explained, adding that the rationale is to reduce token consumption when sending a request to the LLM.

What that means is that when a new request comes in, the LLM checks if some parts of the request are cached. In case it is cached, it uses the cached version, otherwise it runs the full request.

OpenAI’s new prompt caching capability works on the same fundamental principle, which could help developers save on cost and time.

“By reusing recently seen input tokens, developers can get a 50% discount and faster prompt processing times,” OpenAI said.

Additionally, OpenAI has introduced a public beta of the Realtime API, an API that allows developers to build low-latency, multi-modal experiences including text and speech in apps.

by Anirban Ghoshal

Senior Writer

Anirban Ghoshal is a senior writer covering enterprise software for CIO.com and databases and cloud and AI infrastructure for InfoWorld.

Topics

About

Policies

Our Network

More

OpenAI updates API with model distillation, prompt caching abilities

The updates could help OpenAI compete better with rivals such as Anthropic, Google, and AWS which already offer similar capabilities.

Model distillation to help reduce costs of gen AI applications

Prompt Caching to reduce latency in gen AI applications

More from this author

Microsoft rebrands Azure AI Studio to Azure AI Foundry

Snowflake bares its agentic AI plans by showcasing its Intelligence platform

Is now the right time to invest in implementing agentic AI?

Hugging Face pitches HUGS as an alternative to Nvidia’s NIM for open models

How developers can automate ‘computer use’ with Anthropic’s new LLM

Musk’s xAI unveils a new API service for Grok models

SingleStore acquires BryteFlow to boost data ingestion capabilities

Explained: How Salesforce Agentforce’s Atlas reasoning engine works to power AI agents

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Azure AI Foundry tools for changes in AI applications

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx

OpenAI updates API with model distillation, prompt caching abilities

The updates could help OpenAI compete better with rivals such as Anthropic, Google, and AWS which already offer similar capabilities.

Model distillation to help reduce costs of gen AI applications

Prompt Caching to reduce latency in gen AI applications

Related content

Microsoft unveils imaging APIs for Windows Copilot Runtime

Designing the APIs that accidentally power businesses

AI accelerating API development, IBM says

WSO2 API managers manage AI APIs

More from this author

Microsoft rebrands Azure AI Studio to Azure AI Foundry

Snowflake bares its agentic AI plans by showcasing its Intelligence platform

Is now the right time to invest in implementing agentic AI?

Hugging Face pitches HUGS as an alternative to Nvidia’s NIM for open models

How developers can automate ‘computer use’ with Anthropic’s new LLM

Musk’s xAI unveils a new API service for Grok models

SingleStore acquires BryteFlow to boost data ingestion capabilities

Explained: How Salesforce Agentforce’s Atlas reasoning engine works to power AI agents

Show me more

What is Rust? Safe, fast, and easy software development

Kotlin for Java developers: Classes and coroutines

Azure AI Foundry tools for changes in AI applications

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx