How to Intermediate · 3 min read

How to build video analysis with AI

Quick answer
Use multimodal AI models like gpt-4o or gemini-2.5-pro that support video or frame input to analyze video content. Extract frames from videos, send them to the model for object detection, scene description, or action recognition, then aggregate results for comprehensive video analysis.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • ffmpeg installed for video frame extraction

Setup

Install required Python packages and ensure ffmpeg is installed on your system for video frame extraction.

bash
pip install openai opencv-python

Step by step

This example extracts frames from a video, sends each frame to gpt-4o for object detection and scene description, then prints the analysis.

python
import os
import cv2
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

VIDEO_PATH = "sample_video.mp4"
FRAME_INTERVAL = 30  # Extract one frame every 30 frames (~1 sec at 30fps)

# Extract frames from video
cap = cv2.VideoCapture(VIDEO_PATH)
frame_count = 0

while True:
    ret, frame = cap.read()
    if not ret:
        break
    if frame_count % FRAME_INTERVAL == 0:
        # Encode frame as JPEG
        ret2, buffer = cv2.imencode('.jpg', frame)
        if not ret2:
            continue
        img_bytes = buffer.tobytes()

        # Call multimodal model for image analysis
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": "Analyze this image for objects and scene description."}
            ],
            files=[{"name": "frame.jpg", "data": img_bytes}],
            modalities=["image"]
        )
        analysis = response.choices[0].message.content
        print(f"Frame {frame_count}: {analysis}\n")
    frame_count += 1

cap.release()
output
Frame 0: Detected objects: person, car; Scene: urban street with pedestrians.
Frame 30: Detected objects: dog, tree; Scene: park area with greenery.
Frame 60: Detected objects: bicycle, traffic light; Scene: intersection with vehicles waiting.
...

Common variations

  • Use gemini-2.5-pro or claude-3-5-sonnet-20241022 for alternative multimodal video frame analysis.
  • Implement asynchronous calls to process frames in parallel for faster throughput.
  • Use streaming APIs to analyze live video feeds frame-by-frame.
  • Combine frame-level analysis with temporal models for action recognition and event detection.

Troubleshooting

  • If frames fail to encode, verify opencv-python installation and video codec support.
  • If API returns errors on image input, confirm your model supports multimodal inputs and you use the files and modalities parameters correctly.
  • For slow processing, reduce frame extraction rate or use batch processing.
  • Check API key and environment variables if authentication fails.

Key Takeaways

  • Extract video frames and send them as images to multimodal models like gpt-4o for analysis.
  • Use files and modalities parameters in the OpenAI SDK to handle image inputs.
  • Process frames asynchronously or in batches to improve video analysis speed.
  • Combine frame-level insights with temporal reasoning for richer video understanding.
Verified 2026-04 · gpt-4o, gemini-2.5-pro, claude-3-5-sonnet-20241022
Verify ↗