How to Beginner · 3 min read

Audio input to LLM explained

Q: Audio input to LLM explained

Audio input to a large language model (LLM) is typically handled by first converting the audio into text using a speech-to-text model like Whisper. This text transcription is then fed into the LLM for understanding or generation tasks, enabling multimodal interaction with voice data.

Quick answer

Audio input to a large language model (LLM) is typically handled by first converting the audio into text using a speech-to-text model like Whisper. This text transcription is then fed into the LLM for understanding or generation tasks, enabling multimodal interaction with voice data.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access speech-to-text and LLM services.

bash

pip install openai>=1.0

Step by step

This example shows how to transcribe an audio file using OpenAI's whisper-1 model, then send the transcription text to a gpt-4o-mini chat model for further processing.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Transcribe audio to text
with open("audio_sample.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print("Transcription:", transcript.text)

# Step 2: Use transcription as input to LLM
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": transcript.text}]
)

print("LLM response:", response.choices[0].message.content)

output

Transcription: Hello, this is a test audio input.
LLM response: Hi! How can I assist you with this audio today?

Common variations

Use asynchronous calls with asyncio for non-blocking transcription and chat.
Stream audio transcription or LLM responses for real-time applications.
Switch models, e.g., gpt-4o-mini for faster, cheaper chat or other speech-to-text models.

python

import asyncio
from openai import OpenAI

async def transcribe_and_chat():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    # Async transcription not directly supported in OpenAI SDK, but you can run in thread or process
    with open("audio_sample.mp3", "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )

    print("Transcription:", transcript.text)

    # Async chat completion with streaming
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": transcript.text}],
        stream=True
    )

    print("LLM response (streaming):", end=" ")
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

asyncio.run(transcribe_and_chat())

output

Transcription: Hello, this is a test audio input.
LLM response (streaming): Hi! How can I assist you with this audio today?

Troubleshooting

If you get Invalid API key, verify your OPENAI_API_KEY environment variable is set correctly.
If audio transcription fails, check that the audio file format is supported (mp3, wav, m4a, etc.) and under 25MB for API use.
For slow responses, consider using smaller models like gpt-4o-mini or batching requests.

Key Takeaways

Convert audio to text using speech-to-text models like whisper-1 before sending to an LLM.
Use the transcription text as input to chat or completion models for multimodal AI applications.
Leverage streaming and async calls for real-time audio processing and response generation.

Verified 2026-04 · whisper-1, gpt-4o-mini

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.