High severity intermediate · Fix: 5-15 min

AssertionError or IndexError in word_timestamps

faster_whisper.transcribe.AssertionError/IndexError in alignment logic

What this error means
faster-whisper's word_timestamps alignment fails when audio chunk boundaries don't align with decoded token boundaries, causing timestamp-to-word mapping to break.

Stack trace

traceback
Traceback (most recent call last):
  File "faster_whisper/transcribe.py", line 487, in _get_word_timestamps
    assert len(words) == len(timestamps), f"word count {len(words)} != timestamp count {len(timestamps)}"
AssertionError: word count 42 != timestamp count 41

Or:

Traceback (most recent call last):
  File "faster_whisper/transcribe.py", line 512, in align_words_to_chunks
    word_index = next(idx for idx, token_id in enumerate(token_ids) if tokens[idx].start >= chunk_start)
StopIteration: No token found at chunk boundary
QUICK FIX
Set vad_filter=False and resample audio to 16kHz: audio = librosa.load(file, sr=16000)[0]; result = model.transcribe(audio, word_level=True, vad_filter=False, language='en')

Why it happens

faster-whisper's word-level timestamp alignment depends on exact correspondence between decoded tokens and their time boundaries. When audio has silence at chunk edges, VAD (voice activity detection) removes frames, or when you call transcribe() with seek_point misalignment, the token-to-timestamp mapping breaks. The decoder produces N tokens but alignment logic finds M timestamps, causing the assertion to fail.

Detection

Check transcription output before accessing word_timestamps: wrap the call in try/except and log both the token count and timestamp array length to diagnose the mismatch. Monitor for audio files with heavy silence, background noise, or non-standard sample rates.

Causes & fixes

1

Audio file has VAD (voice activity detection) filtering enabled, which removes silent frames but token count doesn't adjust accordingly

✓ Fix

Pass language='en' and disable aggressive VAD by setting vad_filter=False in transcribe(), or use vad_parameters=dict(use_onset=False) to reduce over-filtering: model.transcribe(audio, word_level=True, vad_filter=False)

2

Audio sample rate mismatch: file is 16kHz but processed as 48kHz, causing timing calculations to be off by 3x

✓ Fix

Resample audio to 16kHz before transcription using librosa or scipy: audio = librosa.load(file, sr=16000)[0] before passing to transcribe()

3

Using seek_point parameter without exact frame-boundary alignment when splitting long audio

✓ Fix

Avoid seek_point unless you're certain it lands on a frame boundary (seek_point % 160 == 0 for 16kHz). Instead, concatenate audio clips or use timestamp-based splitting on already-transcribed segments

4

Corrupted or truncated audio file with missing frames or invalid WAV headers

✓ Fix

Validate audio integrity before transcription: use soundfile or audioread to confirm frame count, and re-encode the file: ffmpeg -i input.wav -acodec pcm_s16le -ar 16000 output.wav

Code: broken vs fixed

Broken - triggers the error
python
import faster_whisper
import os

model = faster_whisper.WhisperModel('large-v3', device='cuda', compute_type='float16')
audio_file = 'sample.wav'  # 48kHz file, VAD filtering enabled by default

# This line fails with AssertionError: word count != timestamp count
segments, info = model.transcribe(audio_file, word_level=True, language='en')

for segment in segments:
    for word in segment.words:
        print(f"{word.word} @ {word.start:.2f}s")  # Crashes before reaching here
Fixed - works correctly
python
import faster_whisper
import librosa
import os

model = faster_whisper.WhisperModel('large-v3', device='cuda', compute_type='float16')
audio_file = 'sample.wav'

# FIX: Resample to 16kHz and disable VAD filtering for alignment stability
audio, sr = librosa.load(audio_file, sr=16000)  # Explicitly resample to 16kHz
segments, info = model.transcribe(
    audio,
    word_level=True,
    language='en',
    vad_filter=False,  # Disable VAD to prevent frame count mismatches
    beam_size=5
)

for segment in segments:
    for word in segment.words:
        print(f"{word.word} @ {word.start:.2f}s - {word.end:.2f}s")

print("Word-level timestamps extracted successfully.")
Resample audio to the exact 16kHz expected by faster-whisper and disable VAD filtering to prevent frame misalignment between decoded tokens and timestamp arrays.

Workaround

If you can't resample before transcription, post-process timestamps by scaling them: scale_factor = detected_sample_rate / 16000; adjust word.start and word.end by multiplying by scale_factor. Alternatively, disable word_level timestamps entirely and use only segment-level times, which are more robust to alignment issues.

Prevention

Always resample input audio to 16kHz before calling transcribe(). Validate WAV headers and frame count using librosa.get_duration(). Use vad_filter=False in production for word-level timestamps unless you have tested VAD behavior with your specific audio domain. For critical applications, use OpenAI's Whisper API (whisper-1) instead, which handles audio preprocessing internally and guarantees timestamp stability.

Python 3.9+ · faster-whisper >=0.9.0 · tested on 0.10.x
Verified 2026-04 · faster-whisper>=0.10.0
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.