How to beginner · 3 min read

Multimodal models with audio support

Q: Multimodal models with audio support

Multimodal models with audio support like gpt-4o and claude-3-5-sonnet-20241022 can process audio inputs for transcription and audio understanding. Use the OpenAI or Anthropic SDKs to send audio files for transcription or embed audio context in chat messages for multimodal tasks.

Quick answer

Multimodal models with audio support like gpt-4o and claude-3-5-sonnet-20241022 can process audio inputs for transcription and audio understanding. Use the OpenAI or Anthropic SDKs to send audio files for transcription or embed audio context in chat messages for multimodal tasks.

PREREQUISITES

Python 3.8+
OpenAI API key or Anthropic API key
pip install openai>=1.0 or pip install anthropic>=0.20

Setup

Install the required Python SDKs and set your API keys as environment variables.

For OpenAI: pip install openai
For Anthropic: pip install anthropic

Set environment variables in your shell:

export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"

bash

pip install openai anthropic

Step by step

Use OpenAI's whisper-1 model for audio transcription and gpt-4o for multimodal chat with audio context. For Anthropic, use claude-3-5-sonnet-20241022 with audio tools enabled.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Transcribe audio using Whisper
with open("audio_sample.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )
print("Transcription:", transcript.text)

# Chat with multimodal audio context (example: sending transcription text)
messages = [
    {"role": "user", "content": "Here is the audio transcription: " + transcript.text},
    {"role": "user", "content": "Summarize the main points."}
]
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print("Summary:", response.choices[0].message.content)

output

Transcription: Hello, this is a sample audio for transcription.
Summary: The audio introduces a sample for transcription demonstration.

Common variations

You can use streaming for real-time transcription or chat responses by setting stream=True in OpenAI calls. Anthropic's claude-3-5-sonnet-20241022 supports audio tools with the computer-use-2024-10-22 beta flag for advanced multimodal interactions.

Example for streaming chat with OpenAI:

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Summarize this audio transcription: Hello world."}]

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()

output

The audio transcription "Hello world" is a simple greeting.

Troubleshooting

If audio transcription fails, verify the audio file format is supported (mp3, wav, m4a, etc.) and under 25MB for API calls.
For Anthropic multimodal audio, ensure you include betas=["computer-use-2024-10-22"] and the correct tools parameter in your request.
If streaming responses hang, check your network connection and SDK version compatibility.

✅

Key Takeaways

Use OpenAI's whisper-1 model for accurate audio transcription via API.
Combine transcription text with chat models like gpt-4o for multimodal audio understanding.
Anthropic's claude-3-5-sonnet-20241022 supports audio tools with beta flags for advanced multimodal tasks.
Streaming APIs enable real-time audio transcription and chat responses.
Always verify audio format and size limits to avoid transcription errors.

Verified 2026-04 · gpt-4o, whisper-1, claude-3-5-sonnet-20241022

Verify ↗