How to Intermediate · 3 min read

How to use WhisperX

Q: How to use WhisperX

Use WhisperX by installing it via pip, then load your audio file and run transcription with forced alignment for word-level timestamps. It extends OpenAI Whisper by adding precise timing and speaker diarization features.

Quick answer

Use WhisperX by installing it via pip, then load your audio file and run transcription with forced alignment for word-level timestamps. It extends OpenAI Whisper by adding precise timing and speaker diarization features.

PREREQUISITES

Python 3.8+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install whisperx
ffmpeg installed and in system PATH

Setup

Install WhisperX and its dependencies including torch and ffmpeg. Ensure ffmpeg is available in your system PATH for audio processing.

bash

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install whisperx
# Install ffmpeg separately if not installed (e.g., brew install ffmpeg on macOS or apt install ffmpeg on Ubuntu)

Step by step

Load an audio file and run WhisperX transcription with forced alignment to get word-level timestamps.

python

import whisperx

# Load audio and model
model = whisperx.load_model("large", device="cuda")  # or "cpu"
audio_fp = "audio.mp3"

# Transcribe audio
result = model.transcribe(audio_fp)

# Align the transcription for word-level timestamps
aligned_result = whisperx.align("large", result["segments"], audio_fp, device="cuda")

# Print word-level timestamps
for word in aligned_result["word_segments"]:
    print(f"{word['word']}: start {word['start']:.2f}s, end {word['end']:.2f}s")

output

Hello: start 0.00s, end 0.50s
world: start 0.51s, end 1.00s
...

Common variations

Use different Whisper models like small, medium, or large-v2 for speed vs accuracy trade-offs.
Run on CPU by setting device="cpu" if no GPU is available.
Use whisperx.diarize to add speaker diarization for multi-speaker audio.

python

import whisperx

model = whisperx.load_model("small", device="cpu")
audio_fp = "audio.mp3"
result = model.transcribe(audio_fp)
aligned_result = whisperx.align("small", result["segments"], audio_fp, device="cpu")

# Optional: speaker diarization
pipeline = whisperx.DiarizationPipeline()
diart_result = pipeline(audio_fp)

print(aligned_result["word_segments"])
print(diart_result)

output

[{'word': 'Hello', 'start': 0.0, 'end': 0.5}, ...]
{'segments': [...], 'speakers': [...]}

Troubleshooting

If you get ffmpeg not found errors, ensure ffmpeg is installed and in your system PATH.
For CUDA errors, verify your GPU drivers and CUDA toolkit are properly installed.
If transcription is slow, try smaller models or run on CPU if GPU memory is limited.

✅

Key Takeaways

Install WhisperX with PyTorch and ffmpeg for full functionality.
Use forced alignment to get precise word-level timestamps.
Choose model size based on your accuracy and speed needs.
Enable speaker diarization for multi-speaker audio transcription.
Check system dependencies like ffmpeg and CUDA to avoid runtime errors.

Verified 2026-04 · large, small, large-v2

Verify ↗