How to Intermediate · 3 min read

How to use WhisperX

Quick answer
Use WhisperX by installing it via pip, then load your audio file and run transcription with forced alignment for word-level timestamps. It extends OpenAI Whisper by adding precise timing and speaker diarization features.

PREREQUISITES

  • Python 3.8+
  • pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
  • pip install whisperx
  • ffmpeg installed and in system PATH

Setup

Install WhisperX and its dependencies including torch and ffmpeg. Ensure ffmpeg is available in your system PATH for audio processing.

bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install whisperx
# Install ffmpeg separately if not installed (e.g., brew install ffmpeg on macOS or apt install ffmpeg on Ubuntu)

Step by step

Load an audio file and run WhisperX transcription with forced alignment to get word-level timestamps.

python
import whisperx

# Load audio and model
model = whisperx.load_model("large", device="cuda")  # or "cpu"
audio_fp = "audio.mp3"

# Transcribe audio
result = model.transcribe(audio_fp)

# Align the transcription for word-level timestamps
aligned_result = whisperx.align("large", result["segments"], audio_fp, device="cuda")

# Print word-level timestamps
for word in aligned_result["word_segments"]:
    print(f"{word['word']}: start {word['start']:.2f}s, end {word['end']:.2f}s")
output
Hello: start 0.00s, end 0.50s
world: start 0.51s, end 1.00s
...

Common variations

  • Use different Whisper models like small, medium, or large-v2 for speed vs accuracy trade-offs.
  • Run on CPU by setting device="cpu" if no GPU is available.
  • Use whisperx.diarize to add speaker diarization for multi-speaker audio.
python
import whisperx

model = whisperx.load_model("small", device="cpu")
audio_fp = "audio.mp3"
result = model.transcribe(audio_fp)
aligned_result = whisperx.align("small", result["segments"], audio_fp, device="cpu")

# Optional: speaker diarization
pipeline = whisperx.DiarizationPipeline()
diart_result = pipeline(audio_fp)

print(aligned_result["word_segments"])
print(diart_result)
output
[{'word': 'Hello', 'start': 0.0, 'end': 0.5}, ...]
{'segments': [...], 'speakers': [...]}

Troubleshooting

  • If you get ffmpeg not found errors, ensure ffmpeg is installed and in your system PATH.
  • For CUDA errors, verify your GPU drivers and CUDA toolkit are properly installed.
  • If transcription is slow, try smaller models or run on CPU if GPU memory is limited.

Key Takeaways

  • Install WhisperX with PyTorch and ffmpeg for full functionality.
  • Use forced alignment to get precise word-level timestamps.
  • Choose model size based on your accuracy and speed needs.
  • Enable speaker diarization for multi-speaker audio transcription.
  • Check system dependencies like ffmpeg and CUDA to avoid runtime errors.
Verified 2026-04 · large, small, large-v2
Verify ↗