Multimodal models with audio support
Quick answer
Multimodal models with audio support like
gpt-4o and claude-3-5-sonnet-20241022 can process audio inputs for transcription and audio understanding. Use the OpenAI or Anthropic SDKs to send audio files for transcription or embed audio context in chat messages for multimodal tasks.PREREQUISITES
Python 3.8+OpenAI API key or Anthropic API keypip install openai>=1.0 or pip install anthropic>=0.20
Setup
Install the required Python SDKs and set your API keys as environment variables.
- For OpenAI:
pip install openai - For Anthropic:
pip install anthropic
Set environment variables in your shell:
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"pip install openai anthropic Step by step
Use OpenAI's whisper-1 model for audio transcription and gpt-4o for multimodal chat with audio context. For Anthropic, use claude-3-5-sonnet-20241022 with audio tools enabled.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Transcribe audio using Whisper
with open("audio_sample.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print("Transcription:", transcript.text)
# Chat with multimodal audio context (example: sending transcription text)
messages = [
{"role": "user", "content": "Here is the audio transcription: " + transcript.text},
{"role": "user", "content": "Summarize the main points."}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Summary:", response.choices[0].message.content) output
Transcription: Hello, this is a sample audio for transcription. Summary: The audio introduces a sample for transcription demonstration.
Common variations
You can use streaming for real-time transcription or chat responses by setting stream=True in OpenAI calls. Anthropic's claude-3-5-sonnet-20241022 supports audio tools with the computer-use-2024-10-22 beta flag for advanced multimodal interactions.
Example for streaming chat with OpenAI:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Summarize this audio transcription: Hello world."}]
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
print() output
The audio transcription "Hello world" is a simple greeting.
Troubleshooting
- If audio transcription fails, verify the audio file format is supported (mp3, wav, m4a, etc.) and under 25MB for API calls.
- For Anthropic multimodal audio, ensure you include
betas=["computer-use-2024-10-22"]and the correcttoolsparameter in your request. - If streaming responses hang, check your network connection and SDK version compatibility.
Key Takeaways
- Use OpenAI's
whisper-1model for accurate audio transcription via API. - Combine transcription text with chat models like
gpt-4ofor multimodal audio understanding. - Anthropic's
claude-3-5-sonnet-20241022supports audio tools with beta flags for advanced multimodal tasks. - Streaming APIs enable real-time audio transcription and chat responses.
- Always verify audio format and size limits to avoid transcription errors.