How to Intermediate · 3 min read

How to use Whisper with speaker diarization

Quick answer
Use OpenAI's whisper-1 model with the diarization feature enabled via the API or third-party libraries that support speaker diarization. Since Whisper API does not natively support diarization, combine Whisper transcription with a diarization tool like pyannote.audio or speechbrain to segment speakers and then align transcripts accordingly.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install pyannote.audio or speechbrain for diarization

Setup

Install the OpenAI Python SDK and a speaker diarization library such as pyannote.audio. Set your OpenAI API key as an environment variable.

bash
pip install openai pyannote.audio torch torchvision torchaudio

Step by step

This example shows how to transcribe audio with Whisper and perform speaker diarization with pyannote.audio. It first diarizes the audio to get speaker segments, then transcribes each segment separately with Whisper, assigning speaker labels.

python
import os
from openai import OpenAI
from pyannote.audio import Pipeline

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load pyannote speaker diarization pipeline (requires huggingface token setup)
diartization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

# Path to audio file
AUDIO_FILE = "audio.wav"

# Step 1: Perform speaker diarization
# This returns speaker segments with start/end times
speaker_segments = diartization_pipeline(AUDIO_FILE)

transcripts = []

# Step 2: Transcribe each speaker segment with Whisper
for turn, _, speaker in speaker_segments.itertracks(yield_label=True):
    start = turn.start
    end = turn.end
    # Extract segment audio (using pydub or ffmpeg, example uses ffmpeg)
    segment_file = f"segment_{start:.2f}_{end:.2f}.wav"
    os.system(f"ffmpeg -y -i {AUDIO_FILE} -ss {start} -to {end} -c copy {segment_file}")

    # Open segment audio file
    with open(segment_file, "rb") as f:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=f
        )

    transcripts.append(f"{speaker}: {transcription.text.strip()}")

# Output combined diarized transcript
print("\n".join(transcripts))
output
Speaker1: Hello everyone, welcome to the meeting.
Speaker2: Thanks, glad to be here.
Speaker1: Let's start with the project updates.

Common variations

  • Use speechbrain for diarization as an alternative to pyannote.audio.
  • Run diarization asynchronously or batch process multiple files.
  • Use local Whisper models with openai-whisper package for offline transcription.
  • Adjust diarization pipeline parameters for accuracy vs speed trade-offs.

Troubleshooting

  • If diarization segments overlap or are inaccurate, try tuning the diarization model or use higher quality audio.
  • Ensure ffmpeg is installed and accessible for audio segment extraction.
  • Check your OpenAI API key and usage limits if transcription fails.
  • For large audio files, split into smaller chunks before diarization to avoid memory issues.

Key Takeaways

  • Whisper API alone does not support speaker diarization; combine it with a diarization tool like pyannote.audio.
  • Segment audio by speaker, then transcribe each segment with Whisper to assign speaker labels.
  • Use ffmpeg or similar tools to extract audio segments for diarization and transcription.
  • Ensure environment setup includes OpenAI API key and diarization model dependencies.
  • Tune diarization parameters and audio quality for best speaker separation results.
Verified 2026-04 · whisper-1, pyannote/speaker-diarization
Verify ↗