How to beginner · 3 min read

How to add models to LiteLLM proxy

Quick answer

To add models to LiteLLM proxy, define your model configurations in the proxy's models.yaml or JSON config file, specifying model name, path, and parameters. Then restart the proxy server to load the new models for inference via the proxy API.

PREREQUISITES

Python 3.8+
LiteLLM installed (pip install litellm)
Basic knowledge of YAML or JSON configuration files
Access to model files or URLs

Setup LiteLLM proxy

Install LiteLLM via pip and prepare your environment variables if needed. The proxy runs locally and manages multiple models through a configuration file.

bash

pip install litellm

Step by step model addition

Edit the models.yaml file in your LiteLLM proxy directory to add new models. Each model entry requires a unique name, type (e.g., llama, gptq), and path to the model files. After saving, restart the proxy to load the models.

yaml

models.yaml example:

models:
  - name: llama-3b
    type: llama
    path: /models/llama-3b
    quantization: q4_0
  - name: gptq-4bit
    type: gptq
    path: /models/gptq-4bit

# Start or restart LiteLLM proxy
litellm proxy start --config models.yaml

# Python example to query the proxy
import requests

url = 'http://localhost:11434/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = {
    'model': 'llama-3b',
    'messages': [{'role': 'user', 'content': 'Hello LiteLLM!'}]
}
response = requests.post(url, json=data, headers=headers)
print(response.json())

output

{"id": "chatcmpl-xxx", "choices": [{"message": {"role": "assistant", "content": "Hello LiteLLM! How can I assist you today?"}}]}

Common variations

You can add models of different types such as llama, gptq, or ggml by specifying the type in the config. For asynchronous usage, query the proxy API with async HTTP clients like httpx. You can also configure quantization and device options per model.

python

import asyncio
import httpx

async def query_litellm():
    async with httpx.AsyncClient() as client:
        data = {
            'model': 'gptq-4bit',
            'messages': [{'role': 'user', 'content': 'Async query example'}]
        }
        response = await client.post('http://localhost:11434/v1/chat/completions', json=data)
        print(response.json())

asyncio.run(query_litellm())

output

{"id": "chatcmpl-yyy", "choices": [{"message": {"role": "assistant", "content": "This is an async response from LiteLLM proxy."}}]}

Troubleshooting

If the proxy fails to start, verify your models.yaml syntax and paths.
Check that model files exist and have correct permissions.
Use litellm proxy logs to inspect runtime errors.
Ensure no port conflicts on 11434 or your configured proxy port.

✅

Key Takeaways

Add models by editing the LiteLLM proxy config file with model name, type, and path.
Restart the LiteLLM proxy server after config changes to load new models.
Use HTTP requests to query models served by LiteLLM proxy locally or remotely.
Support for multiple model types and async querying enhances flexibility.
Check logs and config syntax if the proxy fails to load models or start.

Verified 2026-04 · llama-3b, gptq-4bit

Verify ↗