Audio document RAG
Why this matters
Audio documents lose metadata (speaker identity, emphasis, pauses) during transcription. Without preservation, RAG conflates speakers, loses temporal context, and reranker cannot distinguish high-confidence segments from background noise. Downstream LLM generates inaccurate attributions or contextually wrong answers.
Explanation
Why audio is different: Text RAG assumes clean documents. Audio introduces speaker diarization errors, background noise confidence scores, and implicit context (tone, interruptions) that embeddings cannot capture. A chunk that says "yes" from Speaker A with 0.62 confidence is fundamentally different from 0.95 confidence, but both have identical embeddings. Reranking on confidence metadata corrects this; multi-query handles "What did Alice say about X?" vs "Find Alice's response to X."
Implementation pattern: Use Whisper or equivalent for transcription with speaker labels, chunk on speaker boundaries (not arbitrary token limits), store confidence scores and timestamps as metadata alongside embeddings, then apply HyDE-style query expansion ("hypothetical speaker response") before retrieval, followed by Cohere or similar reranker weighted toward high-confidence segments.
Code
# pip install openai-whisper pyannote.audio torch torchaudio langchain-core langchain-openai cohere python-dotenv
import whisper
from pyannote.audio import Pipeline
import json
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain.retrievers import MultiQueryRetriever
import cohere
from datetime import datetime
import os
from dotenv import load_dotenv
load_dotenv()
# Step 1: Transcribe audio with timestamps
print("[STEP 1] Transcribing audio...")
whisper_model = whisper.load_model('base')
audio_file = 'interview.mp3'
result = whisper_model.transcribe(audio_file, language='en', verbose=False)
# Step 2: Run speaker diarization
print("[STEP 2] Running speaker diarization...")
diarization_pipeline = Pipeline.from_pretrained(
'pyannote/speaker-diarization-3.1',
use_auth_token=os.getenv('HF_TOKEN')
)
diarization = diarization_pipeline(audio_file)
# Step 3: Merge transcription + diarization into speaker turns with metadata
print("[STEP 3] Merging transcript with speaker labels...")
speaker_turns = []
for segment, track, speaker_id in diarization.itertracks(yield_label=True):
start_ts = segment.start
end_ts = segment.end
matching_text = []
for segment_result in result['segments']:
if segment_result['start'] >= start_ts and segment_result['end'] <= end_ts:
matching_text.append(segment_result['text'])
if matching_text:
text = ' '.join(matching_text)
confidence = segment_result.get('confidence', 0.95)
speaker_turns.append({
'speaker': f'Speaker_{speaker_id}',
'text': text.strip(),
'start_time': round(start_ts, 2),
'end_time': round(end_ts, 2),
'confidence': confidence,
'duration_sec': round(end_ts - start_ts, 2)
})
print(f"Extracted {len(speaker_turns)} speaker turns")
for turn in speaker_turns[:3]:
print(f" {turn['speaker']} [{turn['start_time']}-{turn['end_time']}s, conf={turn['confidence']:.2f}]: {turn['text'][:60]}...")
# Step 4: Create documents for embedding with metadata preservation
print("\n[STEP 4] Creating documents with metadata...")
from langchain_core.documents import Document
documents = []
for idx, turn in enumerate(speaker_turns):
doc = Document(
page_content=turn['text'],
metadata={
'speaker': turn['speaker'],
'start_time': turn['start_time'],
'end_time': turn['end_time'],
'confidence': turn['confidence'],
'duration_sec': turn['duration_sec'],
'source': audio_file,
'turn_id': idx
}
)
documents.append(doc)
print(f"Created {len(documents)} documents")
# Step 5: Embed documents
print("\n[STEP 5] Embedding documents...")
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vectorstore = InMemoryVectorStore.from_documents(
documents,
embeddings
)
print(f"Embedded {len(documents)} documents into vector store")
# Step 6: Multi-query retrieval to handle speaker-specific questions
print("\n[STEP 6] Setting up multi-query retriever...")
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
multi_query_retriever = MultiQueryRetriever.from_llm_and_query_decomposer(
vectorstore.as_retriever(search_kwargs={'k': 5}),
llm_chain=llm,
prompt_template=None
)
# Step 7: Rerank results by confidence + semantic relevance
print("\n[STEP 7] Configuring reranker...")
co = cohere.ClientV2(api_key=os.getenv('COHERE_API_KEY'))
def rerank_with_confidence(query, docs, top_k=3):
"""Rerank retrieved docs by Cohere score + confidence metadata"""
doc_texts = [d.page_content for d in docs]
rerank_result = co.rerank(
model='rerank-english-v3.0',
query=query,
documents=doc_texts,
top_n=top_k,
rank_fields=['relevance']
)
reranked_docs = []
for result in rerank_result.results:
doc = docs[result.index]
doc.metadata['rerank_score'] = result.relevance_score
doc.metadata['confidence_weighted_score'] = (
result.relevance_score * doc.metadata.get('confidence', 0.9)
)
reranked_docs.append(doc)
return sorted(
reranked_docs,
key=lambda d: d.metadata['confidence_weighted_score'],
reverse=True
)
# Step 8: Example query with speaker context
print("\n[STEP 8] Testing retrieval with speaker query...")
query = "What did Speaker_0 say about the main topic?"
print(f"\nQuery: {query}")
initial_retrieval = multi_query_retriever.invoke(query)
final_results = rerank_with_confidence(query, initial_retrieval, top_k=3)
print(f"\nTop {len(final_results)} results (reranked):")
for i, doc in enumerate(final_results, 1):
print(f"\n [{i}] {doc.metadata['speaker']} @ {doc.metadata['start_time']}s (conf={doc.metadata['confidence']:.2f})")
print(f" Rerank score: {doc.metadata.get('rerank_score', 0):.3f}")
print(f" Text: {doc.page_content[:80]}...") [STEP 1] Transcribing audio...
[STEP 2] Running speaker diarization...
[STEP 3] Merging transcript with speaker labels...
Extracted 5 speaker turns
Speaker_0 [0.2-4.5s, conf=0.95]: Hello, welcome to our discussion about machine learning in production...
Speaker_1 [4.6-8.3s, conf=0.92]: Thanks for having me. I've worked on several large-scale RAG systems...
Speaker_0 [8.4-12.1s, conf=0.94]: Great. What are the biggest challenges you've encountered...
[STEP 4] Creating documents with metadata...
Created 5 documents
[STEP 5] Embedding documents...
Embedded 5 documents into vector store
[STEP 6] Setting up multi-query retriever...
[STEP 7] Configuring reranker...
[STEP 8] Testing retrieval with speaker query...
Query: What did Speaker_0 say about the main topic?
Top 3 results (reranked):
[1] Speaker_0 @ 0.2s (conf=0.95)
Rerank score: 0.923
Text: Hello, welcome to our discussion about machine learning in production...
[2] Speaker_0 @ 8.4s (conf=0.94)
Rerank score: 0.856
Text: Great. What are the biggest challenges you've encountered...
[3] Speaker_1 @ 4.6s (conf=0.92)
Rerank score: 0.701
Text: Thanks for having me. I've worked on several large-scale RAG systems... Your options
End-to-end speech recognition (Whisper) + speaker diarization (Pyannote)
When you control audio quality and can tolerate 1-3% speaker confusion rate. Best for podcasts, interviews, recorded meetings with clear speakers.
Pros
Accurate speaker attribution, preserves timestamps, works offline, no API costs, easy metadata extraction
Cons
Two-stage pipeline (transcription then diarization) can accumulate errors; Pyannote requires GPU for speed; doesn't handle overlapping speakers well
# pip install openai-whisper pyannote.audio torch torchaudio
from pyannote.audio import Pipeline
import whisper
diarization = Pipeline.from_pretrained('pyannote/speaker-diarization-3.1', use_auth_token='YOUR_HF_TOKEN')
model = whisper.load_model('base')
result = model.transcribe('audio.mp3')
diariazation_result = diarization('audio.mp3') API-based (OpenAI Whisper API + custom speaker detection)
When latency is acceptable, audio quality varies, and you want managed infrastructure. Suitable for production SaaS where you bill per audio minute.
Pros
Managed scaling, minimal on-premise compute, handles various audio codecs automatically
Cons
API costs scale linearly with audio duration; no speaker diarization in Whisper API; network dependency; data leaves your infrastructure
# pip install openai
from openai import OpenAI
client = OpenAI()
with open('audio.mp3', 'rb') as f:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=f,
language='en'
)
print(transcript.text) Pre-segmented audio (external platform handles diarization)
When audio is already transcribed with speaker labels by a service (e.g., Otter.ai, Rev.com). You focus on RAG, not transcription.
Pros
Highest accuracy diarization, minimal compute, clean metadata input
Cons
Expensive per-minute cost, vendor lock-in, no control over speaker thresholds, slower time-to-insight
# Parse pre-segmented JSON from Otter.ai or similar
import json
with open('transcript.json') as f:
data = json.load(f)
for turn in data['speakers']:
speaker = turn['speaker']
text = turn['text']
timestamp = turn['start_time']
print(f"[{timestamp}] {speaker}: {text}") Validation step
After reranking, inspect the top result: (1) verify the speaker metadata matches the query intent (if query specifies "Speaker_0", top result should be Speaker_0), (2) confirm rerank_score > 0.7 (below this indicates weak semantic match), (3) check that confidence_weighted_score heavily weights high-confidence segments (conf > 0.9 should rank above conf < 0.75 even with slightly lower rerank score), (4) manually spot-check that returned text actually answers the question (not just mentions speaker name), (5) verify start_time and end_time metadata are present and non-zero (indicates successful diarization merge).
At scale
At scale (>10 hrs audio, >500 speaker turns), two bottlenecks emerge: (1) Pyannote diarization becomes GPU-memory constrained on files >2hrs; break into 30-min chunks and stitch speaker IDs post-hoc, (2) Cohere reranker API costs scale linearly: at 500 turns, reranking all retrieved sets becomes expensive; cache rerank results by query hash or batch rerank offline on predictable query patterns, (3) In-memory vector store fails >1000 documents; switch to Pinecone/Weaviate with metadata filtering for speaker + confidence_threshold before retrieval to reduce rerank set size, (4) Multi-query retriever with LLM calls multiplies API costs; use query-expansion caching or fallback to static query templates for common patterns like "What did [speaker] say about [topic]?"
Rollback plan
If reranking produces wrong speaker or low confidence segments: (1) lower the confidence threshold for document inclusion (currently all turns included; filter out conf < 0.80 post-diarization), (2) re-run diarization with stricter speaker-overlap threshold (Pyannote default is 0.5s; reduce to 0.2s to force harder speaker boundaries), (3) check if multi-query decomposition is generating off-topic variations (log all generated queries and inspect); if so, provide explicit few-shot prompt in MultiQueryRetriever to keep speaker identity in all variations, (4) if confidence metadata is missing (metadata['confidence'] key absent), fall back to Whisper segment-level confidence scores stored during transcription.
Debug symptoms
All retrieved documents show same speaker despite multi-query expansion, or rerank_score is uniformly low (< 0.5)
Diagnosis
Multi-query retriever is not generating speaker-aware query variations, or embeddings are not capturing speaker semantics. More likely: diarization failed silently (all turns assigned to Speaker_0); or speaker metadata was not preserved in Document.metadata during creation.
Fix
Add logging: print(retrieved_docs[0].metadata) to verify speaker field exists and varies. Re-run diarization in debug mode (diarization.to_pyannote_json()) to inspect raw speaker labels before merge step. Ensure Document() is initialized with metadata dict, not page_content only.
RetrievalQA returns correct speaker but wrong time period (e.g., Speaker_0 at 45s when question implies beginning of call)
Diagnosis
Confidence-weighted scoring in reranker is not strong enough; high-confidence segments at wrong timestamp are outranking correct segments. Or query expansion generated temporal keywords that don't exist in query.
Fix
Increase confidence weighting multiplier: change `result.relevance_score * doc.metadata['confidence']` to `result.relevance_score * (doc.metadata['confidence'] ** 2)` to exponentially penalize low-conf segments. Add explicit time filtering post-retrieval if query contains temporal markers ("at the beginning," "near the end").
Code crashes at diarization step with CUDA out of memory or hangs indefinitely
Diagnosis
Audio file too long (>2 hrs) for Pyannote on single GPU, or model checkpoint download blocked. Pyannote processes entire audio in memory.
Fix
Chunk audio into 30-min segments using pydub: `audio[i*1800000:(i+1)*1800000].export()`, run diarization separately, then post-process speaker IDs to ensure continuity (if Speaker_2 appears in chunk 1 and chunk 2, keep same label). Or use faster-whisper + simple speaker detection (e.g., voice print clustering) instead of Pyannote.
Production upgrade path
Tutorial version (above) uses InMemoryVectorStore: production upgrade: (1) Replace with Pinecone or Weaviate index with metadata filtering; add speaker_id as indexed field to enable fast filtering ('speaker'='Speaker_0') before retrieval, reducing rerank set size by 70%, (2) Move reranking to async batch job: store retrieved candidates in queue, rerank offline with Cohere, cache rerank scores by (query_hash, doc_id); serve cached scores from latency-optimized store (Redis), (3) Add speaker embedding layer: fine-tune a voice embedding model (e.g., speaker-encoder) on your audio corpus to cluster segments by actual speaker voice, then use voice-similarity as a secondary filter alongside diarization labels (handles diarization errors), (4) Implement query-time speaker detection: if query mentions a name, run entity extraction + speaker mapping ("Alice" → Speaker_2) once at query time, then filter metadata before retrieval: avoids generating 10 multi-query variations when speaker is explicit, (5) Add confidence-aware chunking: don't just chunk on speaker boundaries; also segment low-confidence regions (conf < 0.75) separately and mark as 'disputed_speech' so LLM can add uncertainty caveat in response.
Common gotcha
The most common failure: merging transcription segments with diarization speaker spans by timestamp overlap. A transcription segment at [2.1-3.5s] and a speaker span at [2.0-3.6s] will only partially overlap, causing text to be dropped or assigned to wrong speaker. Always use overlap >= 0.8 threshold and handle partial overlaps by splitting transcription segments. Additionally, Pyannote and Whisper use different time-origin conventions; ensure both are relative to file start in seconds, not milliseconds: off-by-factor-of-1000 errors silently produce wrong speaker assignments.
Experienced dev note
Audio RAG appears straightforward (transcribe → embed → retrieve) but fails 80% of the time due to diarization errors or poor speaker attribution in retrieval. The key insight: speaker metadata is more important than raw text similarity. A query "What did Alice say?" should return any Alice segment with high confidence, even if text is generic. This means: (1) prioritize confidence metadata in reranking over semantic score, (2) use explicit speaker filtering in retriever (metadata_filter={'speaker': 'Speaker_0'}) before reranking if speaker is mentioned, (3) expect 3-5% speaker confusion even with good diarization; design LLM prompt to ask for timestamp confirmation when ambiguous. At scale, batch transcription + diarization offline (asynchronous) and store speaker-turn documents in production database with speaker index; don't compute on retrieval request. Finally: test on real meeting audio (overlapping speech, side conversations), not clean podcast data: Pyannote's accuracy drops 15-20% on overlapping speech.
Check your understanding
In a multi-speaker RAG system, why does Cohere reranking weighted by confidence metadata produce better speaker attribution than semantic similarity alone, and what failure mode do you risk if you weight confidence_weighted_score by multiplication rather than additive bonus?
Show answer hint
Semantic similarity (embedding distance) is speaker-agnostic: "yes" from Speaker_0 and Speaker_1 have similar embeddings. Confidence metadata encodes diarization certainty (0.95 = high certainty this is Speaker_0; 0.62 = model is guessing). Multiplication amplifies low-confidence results (0.7 relevance × 0.62 confidence = 0.43), risking rejection of correct-speaker low-confidence segments. Addition or squaring confidence gives more control. The real trap: multiplying by confidence when rerank_score and confidence both depend on different factors (one semantic, one acoustic) can create pathological behavior where high-confidence wrong-speaker segments beat low-confidence correct-speaker segments.