Custom LLM wrapper: any API
Why this matters
Not every LLM lives on OpenAI or Anthropic. If you're using a proprietary API, local model via custom inference server, or specialized domain model, you need to wrap it so llama-index can use it natively in indexes, queries, and agents without hacky workarounds.
Explanation
What it is: A custom LLM wrapper is a Python class that inherits from llama_index.core.llms.LLM and translates llama-index's standard interface (like complete() and chat()) into calls to your target API. This lets you plug any LLM backend into llama-index's entire ecosystem: indexing, retrieval, agents: without modifying core llama-index code.
How it works mechanically: You subclass LLM, implement complete() (for text-in, text-out) and optionally chat() (for message-based chat), handle parameter translation from llama-index's CompletionResponse and ChatResponse objects, and call your external API. The wrapper sits between llama-index and your API, converting requests and responses in both directions. Inside llama-index, you assign your custom LLM to Settings.llm, and every downstream component uses it automatically.
When to use it: Use custom LLM wrappers when you have a proprietary inference service, local quantized model via vLLM or text-generation-webui, domain-specific fine-tuned models, or rate-limited academic APIs. The investment pays off when that integration is used across multiple indexes, retrievers, or agents.
Analogy
A custom LLM wrapper is like writing an adapter for an unusual power plug. Your laptop charger (llama-index) expects a standard socket, but your hotel room in Japan has a different outlet (your API). The adapter translates between them so your laptop charges without you rewriting the charger.
Code
from llama_index.core.llms import LLM, CompletionResponse, ChatResponse, ChatMessage
from llama_index.core.base.llms.types import LLMMetadata
import httpx
import json
class CustomAPILLM(LLM):
"""Wrapper for a custom inference API."""
model: str = "custom-model-v1"
api_base: str = "https://api.example.com"
api_key: str = ""
temperature: float = 0.7
max_tokens: int = 512
@property
def metadata(self) -> LLMMetadata:
"""Return metadata about the LLM."""
return LLMMetadata(
name=self.__class__.__name__,
model_name=self.model,
context_window=4096,
num_output=self.max_tokens,
)
def complete(
self,
prompt: str,
**kwargs,
) -> CompletionResponse:
"""Generate text completion from a single prompt."""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
payload = {
"model": self.model,
"prompt": prompt,
"temperature": kwargs.get("temperature", self.temperature),
"max_tokens": kwargs.get("max_tokens", self.max_tokens),
}
with httpx.Client() as client:
response = client.post(
f"{self.api_base}/v1/completions",
json=payload,
headers=headers,
timeout=30.0,
)
response.raise_for_status()
data = response.json()
text = data["choices"][0]["text"]
return CompletionResponse(
text=text,
raw=data,
additional_kwargs={"finish_reason": data["choices"][0].get("finish_reason")},
)
def chat(
self,
messages: list[ChatMessage],
**kwargs,
) -> ChatResponse:
"""Generate a chat response from a list of messages."""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
formatted_messages = [
{
"role": message.role,
"content": message.content,
}
for message in messages
]
payload = {
"model": self.model,
"messages": formatted_messages,
"temperature": kwargs.get("temperature", self.temperature),
"max_tokens": kwargs.get("max_tokens", self.max_tokens),
}
with httpx.Client() as client:
response = client.post(
f"{self.api_base}/v1/chat/completions",
json=payload,
headers=headers,
timeout=30.0,
)
response.raise_for_status()
data = response.json()
message_content = data["choices"][0]["message"]["content"]
return ChatResponse(
message=ChatMessage(role="assistant", content=message_content),
raw=data,
additional_kwargs={"finish_reason": data["choices"][0].get("finish_reason")},
)
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.vector_stores.faiss import FaissVectorStore
from faiss import IndexFlatL2
import tempfile
import os
llm = CustomAPILLM(
api_base="https://api.example.com",
api_key="your-api-key-here",
model="custom-model-v1",
temperature=0.5,
max_tokens=256,
)
Settings.llm = llm
with tempfile.TemporaryDirectory() as tmpdir:
sample_file = os.path.join(tmpdir, "sample.txt")
with open(sample_file, "w") as f:
f.write("Artificial intelligence is transforming how we work and live.")
documents = SimpleDirectoryReader(tmpdir).load_data()
faiss_index = IndexFlatL2(384)
vector_store = FaissVectorStore(faiss_index=faiss_index)
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
)
query_engine = index.as_query_engine()
result = query_engine.query("What is artificial intelligence?")
print("Query Result:")
print(result) Query Result: (ChatResponse or CompletionResponse object from your custom API with the LLM's generated text. Note: Actual output depends on your API's response. Code structure is production-ready; API endpoint and credentials are placeholders.)
What just happened?
You defined a <code>CustomAPILLM</code> class inheriting from <code>LLM</code>, implemented <code>complete()</code> and <code>chat()</code> methods that translate llama-index's request format into HTTP calls to your custom API, set it globally via <code>Settings.llm</code>, loaded sample documents, built a vector index, and ran a query through it. Every step from indexing to retrieval to LLM generation used your custom wrapper transparently.
Common gotcha
Forgetting to implement metadata property causes downstream errors in agents or multi-step chains that introspect LLM capabilities. Also, not handling API timeouts or rate limits means llama-index will hang or crash: always wrap API calls in try-except with sensible backoff and timeouts, especially in production retrieval loops.
Error recovery
httpx.ConnectErrorKeyError on response parsingAttributeError: 'dict' object has no attribute 'role'401 UnauthorizedExperienced dev note
Custom LLM wrappers are tempting to use for quick integrations, but they're a high-maintenance pattern. Before writing one, ask: (1) Does this API have a public Python SDK? Use that instead. (2) Will this API change its endpoints or response format? Budget time for maintenance. (3) Is this a one-off integration or used in 5+ places in your codebase? Only custom-wrap if it's the latter. Also, implement retry logic and exponential backoff in your wrapper from day one: llama-index will call your LLM thousands of times in production, and transient network errors will silently kill queries. Use tenacity or a similar retry library inside complete() and chat().
Check your understanding
Why does implementing both complete() and chat() matter, and what breaks if you only implement complete()?
Show answer hint
A correct answer explains that llama-index components make different assumptions: some use <code>complete()</code> (text-in, text-out), others use <code>chat()</code> (message history with roles). If you only implement one, downstream code that expects the other will raise <code>NotImplementedError</code> or fail silently. Most modern indexing and agents use chat-based LLMs, so skipping chat() leaves your wrapper incompatible with agents and multi-turn retrieval.
llama_index.core.llms.LLM (not deprecated llama_index.llms.base.LLM). The metadata property is required since 0.10.0; earlier versions did not enforce it. ChatMessage and ChatResponse objects have stable APIs in 0.12.x.