How to beginner · 4 min read

Multimodal AI use cases

Quick answer
Multimodal AI combines multiple data types like text, images, audio, and video to enable applications such as image captioning, video summarization, voice assistants, and content moderation. Models like gpt-4o and gemini-2.5-pro support multimodal inputs for richer, context-aware AI experiences.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to access multimodal models like gpt-4o.

bash
pip install openai>=1.0

Step by step

Use the OpenAI SDK to send multimodal inputs combining text and images to gpt-4o. The example below shows how to send an image URL with a text prompt for captioning.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Describe the image."},
    {"role": "user", "content": {"image_url": "https://example.com/cat.jpg"}}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(response.choices[0].message.content)
output
A cute cat sitting on a windowsill with sunlight streaming in.

Common variations

You can extend multimodal use cases by combining audio inputs for transcription or video frames for summarization. Models like gemini-2.5-pro support these inputs. Async calls and streaming outputs are also supported for real-time applications.

python
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "user", "content": "Summarize the video frames."},
        {"role": "user", "content": {"video_url": "https://example.com/video.mp4"}}
    ]
    response = await client.chat.completions.create(
        model="gemini-2.5-pro",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(main())
output
The video shows a person cooking a meal step-by-step in a kitchen.

Troubleshooting

  • If you receive errors about unsupported input types, verify your model supports multimodal inputs like images or video.
  • Ensure your API key has access to the multimodal models.
  • For large files, use URLs or chunk inputs to avoid size limits.

Key Takeaways

  • Use multimodal models like gpt-4o and gemini-2.5-pro for combined text, image, audio, and video inputs.
  • Common use cases include image captioning, video summarization, voice assistants, and content moderation.
  • Always check model documentation for supported input types and size limits.
  • Async and streaming APIs enable real-time multimodal applications.
  • Set API keys securely via environment variables and use the latest SDK patterns.
Verified 2026-04 · gpt-4o, gemini-2.5-pro
Verify ↗