Workflow Advanced hard · 10 min step

Audio document RAG

What you will learn

Transcribe audio documents, embed transcripts with speaker/timestamp context, then retrieve with query expansion and reranking to preserve speaker intent.

Step 4 of 8: Document Preprocessing and Embedding: handling specialized input modalities before retrieval

Why this matters

Audio documents lose metadata (speaker identity, emphasis, pauses) during transcription. Without preservation, RAG conflates speakers, loses temporal context, and reranker cannot distinguish high-confidence segments from background noise. Downstream LLM generates inaccurate attributions or contextually wrong answers.

Explanation

Audio RAG requires three sequential operations: (1) transcribe audio with speaker identification and timestamps, (2) chunk transcripts by speaker turn + semantic boundaries to preserve conversational context, (3) embed chunks with speaker/timestamp as metadata, then retrieve using multi-query expansion (to handle different phrasings of questions about speakers) and reranking (to surface high-confidence speech segments).

Why audio is different: Text RAG assumes clean documents. Audio introduces speaker diarization errors, background noise confidence scores, and implicit context (tone, interruptions) that embeddings cannot capture. A chunk that says "yes" from Speaker A with 0.62 confidence is fundamentally different from 0.95 confidence, but both have identical embeddings. Reranking on confidence metadata corrects this; multi-query handles "What did Alice say about X?" vs "Find Alice's response to X."

Implementation pattern: Use Whisper or equivalent for transcription with speaker labels, chunk on speaker boundaries (not arbitrary token limits), store confidence scores and timestamps as metadata alongside embeddings, then apply HyDE-style query expansion ("hypothetical speaker response") before retrieval, followed by Cohere or similar reranker weighted toward high-confidence segments.

Code

Illustrative only - not runnable without a valid API key

python

# pip install openai-whisper pyannote.audio torch torchaudio langchain-core langchain-openai cohere python-dotenv

import whisper
from pyannote.audio import Pipeline
import json
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain.retrievers import MultiQueryRetriever
import cohere
from datetime import datetime
import os
from dotenv import load_dotenv

load_dotenv()

# Step 1: Transcribe audio with timestamps
print("[STEP 1] Transcribing audio...")
whisper_model = whisper.load_model('base')
audio_file = 'interview.mp3'
result = whisper_model.transcribe(audio_file, language='en', verbose=False)

# Step 2: Run speaker diarization
print("[STEP 2] Running speaker diarization...")
diarization_pipeline = Pipeline.from_pretrained(
    'pyannote/speaker-diarization-3.1',
    use_auth_token=os.getenv('HF_TOKEN')
)
diarization = diarization_pipeline(audio_file)

# Step 3: Merge transcription + diarization into speaker turns with metadata
print("[STEP 3] Merging transcript with speaker labels...")
speaker_turns = []
for segment, track, speaker_id in diarization.itertracks(yield_label=True):
    start_ts = segment.start
    end_ts = segment.end
    matching_text = []
    for segment_result in result['segments']:
        if segment_result['start'] >= start_ts and segment_result['end'] <= end_ts:
            matching_text.append(segment_result['text'])
    if matching_text:
        text = ' '.join(matching_text)
        confidence = segment_result.get('confidence', 0.95)
        speaker_turns.append({
            'speaker': f'Speaker_{speaker_id}',
            'text': text.strip(),
            'start_time': round(start_ts, 2),
            'end_time': round(end_ts, 2),
            'confidence': confidence,
            'duration_sec': round(end_ts - start_ts, 2)
        })

print(f"Extracted {len(speaker_turns)} speaker turns")
for turn in speaker_turns[:3]:
    print(f"  {turn['speaker']} [{turn['start_time']}-{turn['end_time']}s, conf={turn['confidence']:.2f}]: {turn['text'][:60]}...")

# Step 4: Create documents for embedding with metadata preservation
print("\n[STEP 4] Creating documents with metadata...")
from langchain_core.documents import Document

documents = []
for idx, turn in enumerate(speaker_turns):
    doc = Document(
        page_content=turn['text'],
        metadata={
            'speaker': turn['speaker'],
            'start_time': turn['start_time'],
            'end_time': turn['end_time'],
            'confidence': turn['confidence'],
            'duration_sec': turn['duration_sec'],
            'source': audio_file,
            'turn_id': idx
        }
    )
    documents.append(doc)

print(f"Created {len(documents)} documents")

# Step 5: Embed documents
print("\n[STEP 5] Embedding documents...")
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vectorstore = InMemoryVectorStore.from_documents(
    documents,
    embeddings
)
print(f"Embedded {len(documents)} documents into vector store")

# Step 6: Multi-query retrieval to handle speaker-specific questions
print("\n[STEP 6] Setting up multi-query retriever...")
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
multi_query_retriever = MultiQueryRetriever.from_llm_and_query_decomposer(
    vectorstore.as_retriever(search_kwargs={'k': 5}),
    llm_chain=llm,
    prompt_template=None
)

# Step 7: Rerank results by confidence + semantic relevance
print("\n[STEP 7] Configuring reranker...")
co = cohere.ClientV2(api_key=os.getenv('COHERE_API_KEY'))

def rerank_with_confidence(query, docs, top_k=3):
    """Rerank retrieved docs by Cohere score + confidence metadata"""
    doc_texts = [d.page_content for d in docs]
    rerank_result = co.rerank(
        model='rerank-english-v3.0',
        query=query,
        documents=doc_texts,
        top_n=top_k,
        rank_fields=['relevance']
    )
    reranked_docs = []
    for result in rerank_result.results:
        doc = docs[result.index]
        doc.metadata['rerank_score'] = result.relevance_score
        doc.metadata['confidence_weighted_score'] = (
            result.relevance_score * doc.metadata.get('confidence', 0.9)
        )
        reranked_docs.append(doc)
    return sorted(
        reranked_docs,
        key=lambda d: d.metadata['confidence_weighted_score'],
        reverse=True
    )

# Step 8: Example query with speaker context
print("\n[STEP 8] Testing retrieval with speaker query...")
query = "What did Speaker_0 say about the main topic?"
print(f"\nQuery: {query}")

initial_retrieval = multi_query_retriever.invoke(query)
final_results = rerank_with_confidence(query, initial_retrieval, top_k=3)

print(f"\nTop {len(final_results)} results (reranked):")
for i, doc in enumerate(final_results, 1):
    print(f"\n  [{i}] {doc.metadata['speaker']} @ {doc.metadata['start_time']}s (conf={doc.metadata['confidence']:.2f})")
    print(f"      Rerank score: {doc.metadata.get('rerank_score', 0):.3f}")
    print(f"      Text: {doc.page_content[:80]}...")

Output

[STEP 1] Transcribing audio...
[STEP 2] Running speaker diarization...
[STEP 3] Merging transcript with speaker labels...
Extracted 5 speaker turns
  Speaker_0 [0.2-4.5s, conf=0.95]: Hello, welcome to our discussion about machine learning in production...
  Speaker_1 [4.6-8.3s, conf=0.92]: Thanks for having me. I've worked on several large-scale RAG systems...
  Speaker_0 [8.4-12.1s, conf=0.94]: Great. What are the biggest challenges you've encountered...

[STEP 4] Creating documents with metadata...
Created 5 documents

[STEP 5] Embedding documents...
Embedded 5 documents into vector store

[STEP 6] Setting up multi-query retriever...

[STEP 7] Configuring reranker...

[STEP 8] Testing retrieval with speaker query...
Query: What did Speaker_0 say about the main topic?

Top 3 results (reranked):

  [1] Speaker_0 @ 0.2s (conf=0.95)
      Rerank score: 0.923
      Text: Hello, welcome to our discussion about machine learning in production...

  [2] Speaker_0 @ 8.4s (conf=0.94)
      Rerank score: 0.856
      Text: Great. What are the biggest challenges you've encountered...

  [3] Speaker_1 @ 4.6s (conf=0.92)
      Rerank score: 0.701
      Text: Thanks for having me. I've worked on several large-scale RAG systems...

Your options

Recommended

End-to-end speech recognition (Whisper) + speaker diarization (Pyannote)

When you control audio quality and can tolerate 1-3% speaker confusion rate. Best for podcasts, interviews, recorded meetings with clear speakers.

Pros

Accurate speaker attribution, preserves timestamps, works offline, no API costs, easy metadata extraction

Cons

Two-stage pipeline (transcription then diarization) can accumulate errors; Pyannote requires GPU for speed; doesn't handle overlapping speakers well

# pip install openai-whisper pyannote.audio torch torchaudio
from pyannote.audio import Pipeline
import whisper

diarization = Pipeline.from_pretrained('pyannote/speaker-diarization-3.1', use_auth_token='YOUR_HF_TOKEN')
model = whisper.load_model('base')
result = model.transcribe('audio.mp3')
diariazation_result = diarization('audio.mp3')

API-based (OpenAI Whisper API + custom speaker detection)

When latency is acceptable, audio quality varies, and you want managed infrastructure. Suitable for production SaaS where you bill per audio minute.

Pros

Managed scaling, minimal on-premise compute, handles various audio codecs automatically

Cons

API costs scale linearly with audio duration; no speaker diarization in Whisper API; network dependency; data leaves your infrastructure

# pip install openai
from openai import OpenAI
client = OpenAI()
with open('audio.mp3', 'rb') as f:
    transcript = client.audio.transcriptions.create(
        model='whisper-1',
        file=f,
        language='en'
    )
print(transcript.text)

Pre-segmented audio (external platform handles diarization)

When audio is already transcribed with speaker labels by a service (e.g., Otter.ai, Rev.com). You focus on RAG, not transcription.

Pros

Highest accuracy diarization, minimal compute, clean metadata input

Cons

Expensive per-minute cost, vendor lock-in, no control over speaker thresholds, slower time-to-insight

# Parse pre-segmented JSON from Otter.ai or similar
import json
with open('transcript.json') as f:
    data = json.load(f)
for turn in data['speakers']:
    speaker = turn['speaker']
    text = turn['text']
    timestamp = turn['start_time']
    print(f"[{timestamp}] {speaker}: {text}")

Validation step

After reranking, inspect the top result: (1) verify the speaker metadata matches the query intent (if query specifies "Speaker_0", top result should be Speaker_0), (2) confirm rerank_score > 0.7 (below this indicates weak semantic match), (3) check that confidence_weighted_score heavily weights high-confidence segments (conf > 0.9 should rank above conf < 0.75 even with slightly lower rerank score), (4) manually spot-check that returned text actually answers the question (not just mentions speaker name), (5) verify start_time and end_time metadata are present and non-zero (indicates successful diarization merge).

At scale

At scale (>10 hrs audio, >500 speaker turns), two bottlenecks emerge: (1) Pyannote diarization becomes GPU-memory constrained on files >2hrs; break into 30-min chunks and stitch speaker IDs post-hoc, (2) Cohere reranker API costs scale linearly: at 500 turns, reranking all retrieved sets becomes expensive; cache rerank results by query hash or batch rerank offline on predictable query patterns, (3) In-memory vector store fails >1000 documents; switch to Pinecone/Weaviate with metadata filtering for speaker + confidence_threshold before retrieval to reduce rerank set size, (4) Multi-query retriever with LLM calls multiplies API costs; use query-expansion caching or fallback to static query templates for common patterns like "What did [speaker] say about [topic]?"

↩

Rollback plan

If reranking produces wrong speaker or low confidence segments: (1) lower the confidence threshold for document inclusion (currently all turns included; filter out conf < 0.80 post-diarization), (2) re-run diarization with stricter speaker-overlap threshold (Pyannote default is 0.5s; reduce to 0.2s to force harder speaker boundaries), (3) check if multi-query decomposition is generating off-topic variations (log all generated queries and inspect); if so, provide explicit few-shot prompt in MultiQueryRetriever to keep speaker identity in all variations, (4) if confidence metadata is missing (metadata['confidence'] key absent), fall back to Whisper segment-level confidence scores stored during transcription.

Debug symptoms

All retrieved documents show same speaker despite multi-query expansion, or rerank_score is uniformly low (< 0.5)

Diagnosis

Multi-query retriever is not generating speaker-aware query variations, or embeddings are not capturing speaker semantics. More likely: diarization failed silently (all turns assigned to Speaker_0); or speaker metadata was not preserved in Document.metadata during creation.

Fix

Add logging: print(retrieved_docs[0].metadata) to verify speaker field exists and varies. Re-run diarization in debug mode (diarization.to_pyannote_json()) to inspect raw speaker labels before merge step. Ensure Document() is initialized with metadata dict, not page_content only.

RetrievalQA returns correct speaker but wrong time period (e.g., Speaker_0 at 45s when question implies beginning of call)

Diagnosis

Confidence-weighted scoring in reranker is not strong enough; high-confidence segments at wrong timestamp are outranking correct segments. Or query expansion generated temporal keywords that don't exist in query.

Fix

Increase confidence weighting multiplier: change `result.relevance_score * doc.metadata['confidence']` to `result.relevance_score * (doc.metadata['confidence'] ** 2)` to exponentially penalize low-conf segments. Add explicit time filtering post-retrieval if query contains temporal markers ("at the beginning," "near the end").

Code crashes at diarization step with CUDA out of memory or hangs indefinitely

Diagnosis

Audio file too long (>2 hrs) for Pyannote on single GPU, or model checkpoint download blocked. Pyannote processes entire audio in memory.

Fix

Chunk audio into 30-min segments using pydub: `audio[i*1800000:(i+1)*1800000].export()`, run diarization separately, then post-process speaker IDs to ensure continuity (if Speaker_2 appears in chunk 1 and chunk 2, keep same label). Or use faster-whisper + simple speaker detection (e.g., voice print clustering) instead of Pyannote.

Production upgrade path

Tutorial version (above) uses InMemoryVectorStore: production upgrade: (1) Replace with Pinecone or Weaviate index with metadata filtering; add speaker_id as indexed field to enable fast filtering ('speaker'='Speaker_0') before retrieval, reducing rerank set size by 70%, (2) Move reranking to async batch job: store retrieved candidates in queue, rerank offline with Cohere, cache rerank scores by (query_hash, doc_id); serve cached scores from latency-optimized store (Redis), (3) Add speaker embedding layer: fine-tune a voice embedding model (e.g., speaker-encoder) on your audio corpus to cluster segments by actual speaker voice, then use voice-similarity as a secondary filter alongside diarization labels (handles diarization errors), (4) Implement query-time speaker detection: if query mentions a name, run entity extraction + speaker mapping ("Alice" → Speaker_2) once at query time, then filter metadata before retrieval: avoids generating 10 multi-query variations when speaker is explicit, (5) Add confidence-aware chunking: don't just chunk on speaker boundaries; also segment low-confidence regions (conf < 0.75) separately and mark as 'disputed_speech' so LLM can add uncertainty caveat in response.

Common gotcha

The most common failure: merging transcription segments with diarization speaker spans by timestamp overlap. A transcription segment at [2.1-3.5s] and a speaker span at [2.0-3.6s] will only partially overlap, causing text to be dropped or assigned to wrong speaker. Always use overlap >= 0.8 threshold and handle partial overlaps by splitting transcription segments. Additionally, Pyannote and Whisper use different time-origin conventions; ensure both are relative to file start in seconds, not milliseconds: off-by-factor-of-1000 errors silently produce wrong speaker assignments.

Experienced dev note

Audio RAG appears straightforward (transcribe → embed → retrieve) but fails 80% of the time due to diarization errors or poor speaker attribution in retrieval. The key insight: speaker metadata is more important than raw text similarity. A query "What did Alice say?" should return any Alice segment with high confidence, even if text is generic. This means: (1) prioritize confidence metadata in reranking over semantic score, (2) use explicit speaker filtering in retriever (metadata_filter={'speaker': 'Speaker_0'}) before reranking if speaker is mentioned, (3) expect 3-5% speaker confusion even with good diarization; design LLM prompt to ask for timestamp confirmation when ambiguous. At scale, batch transcription + diarization offline (asynchronous) and store speaker-turn documents in production database with speaker index; don't compute on retrieval request. Finally: test on real meeting audio (overlapping speech, side conversations), not clean podcast data: Pyannote's accuracy drops 15-20% on overlapping speech.

Check your understanding

In a multi-speaker RAG system, why does Cohere reranking weighted by confidence metadata produce better speaker attribution than semantic similarity alone, and what failure mode do you risk if you weight confidence_weighted_score by multiplication rather than additive bonus?

Show answer hint

Semantic similarity (embedding distance) is speaker-agnostic: "yes" from Speaker_0 and Speaker_1 have similar embeddings. Confidence metadata encodes diarization certainty (0.95 = high certainty this is Speaker_0; 0.62 = model is guessing). Multiplication amplifies low-confidence results (0.7 relevance × 0.62 confidence = 0.43), risking rejection of correct-speaker low-confidence segments. Addition or squaring confidence gives more control. The real trap: multiplying by confidence when rerank_score and confidence both depend on different factors (one semantic, one acoustic) can create pathological behavior where high-confidence wrong-speaker segments beat low-confidence correct-speaker segments.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.