How to intermediate · 3 min read

Cerebras for RAG pipelines

Q: Cerebras for RAG pipelines

Use the Cerebras API with its OpenAI-compatible interface to build RAG pipelines by combining vector search with chat.completions.create calls. First, perform a vector similarity search on your document embeddings, then pass the retrieved context as part of the chat messages to the llama3.3-70b model for generation.

Quick answer

Use the Cerebras API with its OpenAI-compatible interface to build RAG pipelines by combining vector search with chat.completions.create calls. First, perform a vector similarity search on your document embeddings, then pass the retrieved context as part of the chat messages to the llama3.3-70b model for generation.

PREREQUISITES

Python 3.8+
CEREBRAS_API_KEY environment variable set
pip install openai>=1.0 numpy

Setup

Install the openai Python package (v1+) to access the Cerebras API via its OpenAI-compatible endpoint. Set your API key in the environment variable CEREBRAS_API_KEY. You will also need a vector store or embeddings for your documents to perform retrieval.

bash

pip install openai numpy

output

Collecting openai
Collecting numpy
Installing collected packages: numpy, openai
Successfully installed numpy-1.25.2 openai-1.0.0

Step by step

This example demonstrates a simple RAG pipeline using Cerebras API. It performs a vector similarity search on precomputed embeddings, retrieves top documents, and sends them as context to the llama3.3-70b chat model for answer generation.

python

import os
import numpy as np
from openai import OpenAI

# Initialize Cerebras client with OpenAI-compatible SDK
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

# Example document embeddings (precomputed, shape: num_docs x embedding_dim)
doc_embeddings = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.1, 0.5],
    [0.3, 0.7, 0.2]
])
doc_texts = [
    "Document about AI and machine learning.",
    "Information on natural language processing.",
    "Details on retrieval-augmented generation techniques."
]

# Query embedding (example)
query_embedding = np.array([0.2, 0.1, 0.4])

# Compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Retrieve top-k documents
k = 2
similarities = [cosine_similarity(query_embedding, d) for d in doc_embeddings]
top_k_indices = np.argsort(similarities)[-k:][::-1]

# Prepare context from top documents
context = "\n".join(doc_texts[i] for i in top_k_indices)

# Create chat completion with context
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Context:\n{context}\n\nAnswer the question: What is retrieval-augmented generation?"}
]

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=messages,
    max_tokens=512
)

print("Answer:", response.choices[0].message.content)

output

Answer: Retrieval-augmented generation (RAG) is a technique that combines retrieval of relevant documents with generative language models to produce more accurate and context-aware responses. It uses vector search to find pertinent information and then generates answers based on that context.

Common variations

Use asynchronous calls with asyncio and await client.chat.completions.create(...) for non-blocking RAG pipelines.
Switch to smaller Cerebras models like llama3.1-8b for faster inference with less memory.
Integrate external vector databases (e.g., Pinecone, FAISS) for scalable retrieval instead of in-memory embeddings.
Stream responses by setting stream=True in chat.completions.create to handle large outputs efficiently.

python

import asyncio

async def async_rag():
    response = await client.chat.completions.create(
        model="llama3.3-70b",
        messages=messages,
        max_tokens=512,
        stream=True
    )
    async for chunk in response:
        print(chunk.choices[0].delta.content or '', end='', flush=True)

# asyncio.run(async_rag())

output

Retrieval-augmented generation (RAG) is a technique that combines retrieval of relevant documents with generative language models to produce more accurate and context-aware responses...

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly.
For slow responses, try smaller models like llama3.1-8b or reduce max_tokens.
If vector search returns irrelevant documents, check your embedding quality and similarity metric.
Ensure your network allows HTTPS requests to https://api.cerebras.ai/v1.

✅

Key Takeaways

Use Cerebras API's OpenAI-compatible client with base_url set to https://api.cerebras.ai/v1.
Combine vector similarity search with chat completions for effective RAG pipelines.
Leverage streaming and async calls for scalable and responsive applications.
Choose model size based on latency and resource constraints.
Validate API key and network connectivity to avoid common errors.

Verified 2026-04 · llama3.3-70b, llama3.1-8b

Verify ↗