How to use Whisper with speaker diarization
Quick answer
Use OpenAI's
whisper-1 model with the diarization feature enabled via the API or third-party libraries that support speaker diarization. Since Whisper API does not natively support diarization, combine Whisper transcription with a diarization tool like pyannote.audio or speechbrain to segment speakers and then align transcripts accordingly.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install pyannote.audio or speechbrain for diarization
Setup
Install the OpenAI Python SDK and a speaker diarization library such as pyannote.audio. Set your OpenAI API key as an environment variable.
pip install openai pyannote.audio torch torchvision torchaudio Step by step
This example shows how to transcribe audio with Whisper and perform speaker diarization with pyannote.audio. It first diarizes the audio to get speaker segments, then transcribes each segment separately with Whisper, assigning speaker labels.
import os
from openai import OpenAI
from pyannote.audio import Pipeline
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load pyannote speaker diarization pipeline (requires huggingface token setup)
diartization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
# Path to audio file
AUDIO_FILE = "audio.wav"
# Step 1: Perform speaker diarization
# This returns speaker segments with start/end times
speaker_segments = diartization_pipeline(AUDIO_FILE)
transcripts = []
# Step 2: Transcribe each speaker segment with Whisper
for turn, _, speaker in speaker_segments.itertracks(yield_label=True):
start = turn.start
end = turn.end
# Extract segment audio (using pydub or ffmpeg, example uses ffmpeg)
segment_file = f"segment_{start:.2f}_{end:.2f}.wav"
os.system(f"ffmpeg -y -i {AUDIO_FILE} -ss {start} -to {end} -c copy {segment_file}")
# Open segment audio file
with open(segment_file, "rb") as f:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
transcripts.append(f"{speaker}: {transcription.text.strip()}")
# Output combined diarized transcript
print("\n".join(transcripts)) output
Speaker1: Hello everyone, welcome to the meeting. Speaker2: Thanks, glad to be here. Speaker1: Let's start with the project updates.
Common variations
- Use
speechbrainfor diarization as an alternative topyannote.audio. - Run diarization asynchronously or batch process multiple files.
- Use local Whisper models with
openai-whisperpackage for offline transcription. - Adjust diarization pipeline parameters for accuracy vs speed trade-offs.
Troubleshooting
- If diarization segments overlap or are inaccurate, try tuning the diarization model or use higher quality audio.
- Ensure
ffmpegis installed and accessible for audio segment extraction. - Check your OpenAI API key and usage limits if transcription fails.
- For large audio files, split into smaller chunks before diarization to avoid memory issues.
Key Takeaways
- Whisper API alone does not support speaker diarization; combine it with a diarization tool like pyannote.audio.
- Segment audio by speaker, then transcribe each segment with Whisper to assign speaker labels.
- Use ffmpeg or similar tools to extract audio segments for diarization and transcription.
- Ensure environment setup includes OpenAI API key and diarization model dependencies.
- Tune diarization parameters and audio quality for best speaker separation results.