Vercel

Vercel's AI Gateway Adds Realtime Voice and Speech API Capabilities


Executive Summary

Vercel has expanded its AI Gateway to include comprehensive audio and voice capabilities, available in beta via AI SDK 7. This update allows developers to integrate realtime voice conversations, text-to-speech, and speech-to-text transcription into their applications. These new audio features are unified under the existing AI Gateway API, providing the same provider routing, observability, and spend control management as text and image models.

Key Takeaways

* New Audio Modalities: The AI Gateway now supports three primary audio functions:

* Realtime Voice: Enables live, low-latency, two-way conversational agents where users can interrupt the AI (barge-in).

* Text to Speech (TTS): Converts text input into spoken audio files.

* Speech to Text (Transcription): Transcribes recorded audio files into text.

* Unified Management: All audio calls are routed through the AI Gateway, inheriting its core benefits like cross-provider API key management, observability, budget controls, and bring-your-own-key support.

* Initial Model Support: The feature launches with support for audio models from OpenAI and xAI.

* Developer Tooling: The new capabilities are accessible through `AI SDK 7`. For web clients, a `useRealtime` React hook simplifies managing WebSocket connections, microphone capture, and audio playback.

* Availability: The new audio features are currently in beta.

Strategic Importance

This update positions AI Gateway as a comprehensive, multi-modal AI management layer, moving beyond text and images to capture the growing demand for voice-enabled applications. By unifying all modalities under a single API and management console, Vercel simplifies the developer workflow for building complex AI features.

Original article