Multimodal AI use cases
Quick answer
Multimodal AI combines multiple data types like text, images, audio, and video to enable applications such as image captioning, video summarization, voice assistants, and content moderation. Models like gpt-4o and gemini-2.5-pro support multimodal inputs for richer, context-aware AI experiences.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable to access multimodal models like gpt-4o.
pip install openai>=1.0 Step by step
Use the OpenAI SDK to send multimodal inputs combining text and images to gpt-4o. The example below shows how to send an image URL with a text prompt for captioning.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [
{"role": "user", "content": "Describe the image."},
{"role": "user", "content": {"image_url": "https://example.com/cat.jpg"}}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(response.choices[0].message.content) output
A cute cat sitting on a windowsill with sunlight streaming in.
Common variations
You can extend multimodal use cases by combining audio inputs for transcription or video frames for summarization. Models like gemini-2.5-pro support these inputs. Async calls and streaming outputs are also supported for real-time applications.
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [
{"role": "user", "content": "Summarize the video frames."},
{"role": "user", "content": {"video_url": "https://example.com/video.mp4"}}
]
response = await client.chat.completions.create(
model="gemini-2.5-pro",
messages=messages
)
print(response.choices[0].message.content)
asyncio.run(main()) output
The video shows a person cooking a meal step-by-step in a kitchen.
Troubleshooting
- If you receive errors about unsupported input types, verify your model supports multimodal inputs like images or video.
- Ensure your API key has access to the multimodal models.
- For large files, use URLs or chunk inputs to avoid size limits.
Key Takeaways
- Use multimodal models like gpt-4o and gemini-2.5-pro for combined text, image, audio, and video inputs.
- Common use cases include image captioning, video summarization, voice assistants, and content moderation.
- Always check model documentation for supported input types and size limits.
- Async and streaming APIs enable real-time multimodal applications.
- Set API keys securely via environment variables and use the latest SDK patterns.