How to use Llama on Cerebras
Quick answer
Use the OpenAI Python SDK with the base_url set to Cerebras's API endpoint and specify a Llama model like llama3.3-70b. Instantiate the client with your Cerebras API key from os.environ, then call chat.completions.create with your messages to interact with the model.
PREREQUISITES
Python 3.8+Cerebras API key set in environment variable CEREBRAS_API_KEYpip install openai>=1.0
Setup
Install the openai Python package and set your Cerebras API key as an environment variable. Cerebras uses an OpenAI-compatible API with a custom base_url.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use the OpenAI client with base_url set to Cerebras's API endpoint. Specify the Llama model name and send chat messages. The example below sends a prompt and prints the assistant's reply.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["CEREBRAS_API_KEY"],
base_url="https://api.cerebras.ai/v1"
)
response = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "Hello, how do I use Llama on Cerebras?"}]
)
print("Assistant reply:", response.choices[0].message.content) output
Assistant reply: Hello! You can use Llama models on Cerebras by connecting via their OpenAI-compatible API endpoint and sending chat requests as shown.
Common variations
Use smaller Llama models like llama3.1-8b by changing the model parameter. For streaming responses, add stream=True and iterate over the response. Use async calls with an async OpenAI client if your environment supports it.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(
api_key=os.environ["CEREBRAS_API_KEY"],
base_url="https://api.cerebras.ai/v1"
)
stream = await client.chat.completions.acreate(
model="llama3.1-8b",
messages=[{"role": "user", "content": "Stream a response from Llama on Cerebras."}],
stream=True
)
async for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
asyncio.run(main()) output
Streaming assistant reply text appears token by token in the console.
Troubleshooting
If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly. For network errors, check your internet connection and that https://api.cerebras.ai/v1 is reachable. If the model name is invalid, confirm you are using a supported Llama model like llama3.3-70b or llama3.1-8b.
Key Takeaways
- Use the OpenAI SDK with Cerebras's base_url to access Llama models.
- Set your Cerebras API key in the environment variable CEREBRAS_API_KEY.
- Specify supported Llama model names like llama3.3-70b for chat completions.
- Streaming and async calls are supported with the OpenAI SDK.
- Check environment variables and network connectivity if errors occur.