How to beginner · 3 min read

How to use Mistral with LiteLLM

Quick answer
To use Mistral models with LiteLLM, first install the litellm Python package and download the Mistral model checkpoint compatible with LiteLLM. Then load the model locally via LiteLLM and run inference by passing prompts to the model's generate method. This enables efficient, offline usage of Mistral models without API calls.

PREREQUISITES

  • Python 3.8+
  • pip install litellm
  • Mistral model checkpoint downloaded locally

Setup

Install the litellm package via pip and ensure you have the Mistral model checkpoint downloaded locally. LiteLLM supports running Mistral models efficiently on CPU or GPU.

bash
pip install litellm

Step by step

Load the Mistral model checkpoint with LiteLLM and generate text completions by passing prompts. The example below demonstrates loading the mistral-large-latest model locally and generating a response.

python
from litellm import LLM

# Path to your local Mistral model checkpoint
model_path = "/path/to/mistral-large-latest"

# Initialize the LiteLLM model
llm = LLM(model_path)

# Define a prompt
prompt = "Explain the benefits of using LiteLLM with Mistral models."

# Generate a completion
response = llm.generate(prompt)

print("Response:", response)
output
Response: LiteLLM enables fast and efficient local inference of Mistral models, reducing latency and avoiding API costs.

Common variations

  • Streaming output: Use llm.stream_generate(prompt) to get token-by-token output.
  • GPU acceleration: Pass device="cuda" when initializing LLM if you have a compatible GPU.
  • Different Mistral models: Replace model_path with other Mistral checkpoints like mistral-small-latest.
python
from litellm import LLM

# GPU accelerated loading
llm_gpu = LLM("/path/to/mistral-large-latest", device="cuda")

# Streaming generation example
for token in llm_gpu.stream_generate("Hello, world!"):
    print(token, end="", flush=True)
print()
output
Hello, world! This is a streamed response from the Mistral model.

Troubleshooting

  • If you see FileNotFoundError, verify the model_path points to a valid Mistral checkpoint directory.
  • For CUDA out of memory errors, try running on CPU by omitting device="cuda" or use a smaller model.
  • If generation is slow, ensure you have installed the latest litellm version and consider enabling quantization if supported.

Key Takeaways

  • Use the litellm Python package to run Mistral models locally without API calls.
  • Load Mistral checkpoints via LLM(model_path) and generate text with generate().
  • Enable GPU acceleration by specifying device="cuda" if available.
  • Streaming output and different Mistral model sizes are supported by LiteLLM.
  • Check model path and device settings to troubleshoot common errors.
Verified 2026-04 · mistral-large-latest, mistral-small-latest
Verify ↗