How to Beginner · 3 min read

Audio input to LLM explained

Quick answer
Audio input to a large language model (LLM) is typically handled by first converting the audio into text using a speech-to-text model like Whisper. This text transcription is then fed into the LLM for understanding or generation tasks, enabling multimodal interaction with voice data.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access speech-to-text and LLM services.

bash
pip install openai>=1.0

Step by step

This example shows how to transcribe an audio file using OpenAI's whisper-1 model, then send the transcription text to a gpt-4o-mini chat model for further processing.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Transcribe audio to text
with open("audio_sample.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print("Transcription:", transcript.text)

# Step 2: Use transcription as input to LLM
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": transcript.text}]
)

print("LLM response:", response.choices[0].message.content)
output
Transcription: Hello, this is a test audio input.
LLM response: Hi! How can I assist you with this audio today?

Common variations

  • Use asynchronous calls with asyncio for non-blocking transcription and chat.
  • Stream audio transcription or LLM responses for real-time applications.
  • Switch models, e.g., gpt-4o-mini for faster, cheaper chat or other speech-to-text models.
python
import asyncio
from openai import OpenAI

async def transcribe_and_chat():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    # Async transcription not directly supported in OpenAI SDK, but you can run in thread or process
    with open("audio_sample.mp3", "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )

    print("Transcription:", transcript.text)

    # Async chat completion with streaming
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": transcript.text}],
        stream=True
    )

    print("LLM response (streaming):", end=" ")
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

asyncio.run(transcribe_and_chat())
output
Transcription: Hello, this is a test audio input.
LLM response (streaming): Hi! How can I assist you with this audio today?

Troubleshooting

  • If you get Invalid API key, verify your OPENAI_API_KEY environment variable is set correctly.
  • If audio transcription fails, check that the audio file format is supported (mp3, wav, m4a, etc.) and under 25MB for API use.
  • For slow responses, consider using smaller models like gpt-4o-mini or batching requests.

Key Takeaways

  • Convert audio to text using speech-to-text models like whisper-1 before sending to an LLM.
  • Use the transcription text as input to chat or completion models for multimodal AI applications.
  • Leverage streaming and async calls for real-time audio processing and response generation.
Verified 2026-04 · whisper-1, gpt-4o-mini
Verify ↗