How to intermediate · 3 min read

Cerebras for RAG pipelines

Quick answer
Use the Cerebras API with its OpenAI-compatible interface to build RAG pipelines by combining vector search with chat.completions.create calls. First, perform a vector similarity search on your document embeddings, then pass the retrieved context as part of the chat messages to the llama3.3-70b model for generation.

PREREQUISITES

  • Python 3.8+
  • CEREBRAS_API_KEY environment variable set
  • pip install openai>=1.0 numpy

Setup

Install the openai Python package (v1+) to access the Cerebras API via its OpenAI-compatible endpoint. Set your API key in the environment variable CEREBRAS_API_KEY. You will also need a vector store or embeddings for your documents to perform retrieval.

bash
pip install openai numpy
output
Collecting openai
Collecting numpy
Installing collected packages: numpy, openai
Successfully installed numpy-1.25.2 openai-1.0.0

Step by step

This example demonstrates a simple RAG pipeline using Cerebras API. It performs a vector similarity search on precomputed embeddings, retrieves top documents, and sends them as context to the llama3.3-70b chat model for answer generation.

python
import os
import numpy as np
from openai import OpenAI

# Initialize Cerebras client with OpenAI-compatible SDK
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

# Example document embeddings (precomputed, shape: num_docs x embedding_dim)
doc_embeddings = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.1, 0.5],
    [0.3, 0.7, 0.2]
])
doc_texts = [
    "Document about AI and machine learning.",
    "Information on natural language processing.",
    "Details on retrieval-augmented generation techniques."
]

# Query embedding (example)
query_embedding = np.array([0.2, 0.1, 0.4])

# Compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Retrieve top-k documents
k = 2
similarities = [cosine_similarity(query_embedding, d) for d in doc_embeddings]
top_k_indices = np.argsort(similarities)[-k:][::-1]

# Prepare context from top documents
context = "\n".join(doc_texts[i] for i in top_k_indices)

# Create chat completion with context
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"Context:\n{context}\n\nAnswer the question: What is retrieval-augmented generation?"}
]

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=messages,
    max_tokens=512
)

print("Answer:", response.choices[0].message.content)
output
Answer: Retrieval-augmented generation (RAG) is a technique that combines retrieval of relevant documents with generative language models to produce more accurate and context-aware responses. It uses vector search to find pertinent information and then generates answers based on that context.

Common variations

  • Use asynchronous calls with asyncio and await client.chat.completions.create(...) for non-blocking RAG pipelines.
  • Switch to smaller Cerebras models like llama3.1-8b for faster inference with less memory.
  • Integrate external vector databases (e.g., Pinecone, FAISS) for scalable retrieval instead of in-memory embeddings.
  • Stream responses by setting stream=True in chat.completions.create to handle large outputs efficiently.
python
import asyncio

async def async_rag():
    response = await client.chat.completions.create(
        model="llama3.3-70b",
        messages=messages,
        max_tokens=512,
        stream=True
    )
    async for chunk in response:
        print(chunk.choices[0].delta.content or '', end='', flush=True)

# asyncio.run(async_rag())
output
Retrieval-augmented generation (RAG) is a technique that combines retrieval of relevant documents with generative language models to produce more accurate and context-aware responses...

Troubleshooting

  • If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly.
  • For slow responses, try smaller models like llama3.1-8b or reduce max_tokens.
  • If vector search returns irrelevant documents, check your embedding quality and similarity metric.
  • Ensure your network allows HTTPS requests to https://api.cerebras.ai/v1.

Key Takeaways

  • Use Cerebras API's OpenAI-compatible client with base_url set to https://api.cerebras.ai/v1.
  • Combine vector similarity search with chat completions for effective RAG pipelines.
  • Leverage streaming and async calls for scalable and responsive applications.
  • Choose model size based on latency and resource constraints.
  • Validate API key and network connectivity to avoid common errors.
Verified 2026-04 · llama3.3-70b, llama3.1-8b
Verify ↗