How to beginner · 3 min read

How to use Mistral with LiteLLM

Q: How to use Mistral with LiteLLM

To use Mistral models with LiteLLM, first install the litellm Python package and download the Mistral model checkpoint compatible with LiteLLM. Then load the model locally via LiteLLM and run inference by passing prompts to the model's generate method. This enables efficient, offline usage of Mistral models without API calls.

Quick answer

To use Mistral models with LiteLLM, first install the litellm Python package and download the Mistral model checkpoint compatible with LiteLLM. Then load the model locally via LiteLLM and run inference by passing prompts to the model's generate method. This enables efficient, offline usage of Mistral models without API calls.

PREREQUISITES

Python 3.8+
pip install litellm
Mistral model checkpoint downloaded locally

Setup

Install the litellm package via pip and ensure you have the Mistral model checkpoint downloaded locally. LiteLLM supports running Mistral models efficiently on CPU or GPU.

bash

pip install litellm

Step by step

Load the Mistral model checkpoint with LiteLLM and generate text completions by passing prompts. The example below demonstrates loading the mistral-large-latest model locally and generating a response.

python

from litellm import LLM

# Path to your local Mistral model checkpoint
model_path = "/path/to/mistral-large-latest"

# Initialize the LiteLLM model
llm = LLM(model_path)

# Define a prompt
prompt = "Explain the benefits of using LiteLLM with Mistral models."

# Generate a completion
response = llm.generate(prompt)

print("Response:", response)

output

Response: LiteLLM enables fast and efficient local inference of Mistral models, reducing latency and avoiding API costs.

Common variations

Streaming output: Use llm.stream_generate(prompt) to get token-by-token output.
GPU acceleration: Pass device="cuda" when initializing LLM if you have a compatible GPU.
Different Mistral models: Replace model_path with other Mistral checkpoints like mistral-small-latest.

python

from litellm import LLM

# GPU accelerated loading
llm_gpu = LLM("/path/to/mistral-large-latest", device="cuda")

# Streaming generation example
for token in llm_gpu.stream_generate("Hello, world!"):
    print(token, end="", flush=True)
print()

output

Hello, world! This is a streamed response from the Mistral model.

Troubleshooting

If you see FileNotFoundError, verify the model_path points to a valid Mistral checkpoint directory.
For CUDA out of memory errors, try running on CPU by omitting device="cuda" or use a smaller model.
If generation is slow, ensure you have installed the latest litellm version and consider enabling quantization if supported.

Key Takeaways

Use the litellm Python package to run Mistral models locally without API calls.
Load Mistral checkpoints via LLM(model_path) and generate text with generate().
Enable GPU acceleration by specifying device="cuda" if available.
Streaming output and different Mistral model sizes are supported by LiteLLM.
Check model path and device settings to troubleshoot common errors.

Verified 2026-04 · mistral-large-latest, mistral-small-latest

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.