Code Advanced hard · 8 min

Custom LLM wrapper: any API

What you will learn

Build a custom LLM class to integrate any API into llama-index by inheriting from LLM and implementing required methods.

Why this matters

Not every LLM lives on OpenAI or Anthropic. If you're using a proprietary API, local model via custom inference server, or specialized domain model, you need to wrap it so llama-index can use it natively in indexes, queries, and agents without hacky workarounds.

Skip if: Don't write a custom LLM wrapper if the model is already supported by llama-index (OpenAI, Anthropic, Ollama, HuggingFace Inference API, etc.) or if you only need to make one isolated API call: use the client library directly instead. Custom wrappers add maintenance burden; use them only when integration is repeated across your codebase.

Explanation

What it is: A custom LLM wrapper is a Python class that inherits from llama_index.core.llms.LLM and translates llama-index's standard interface (like complete() and chat()) into calls to your target API. This lets you plug any LLM backend into llama-index's entire ecosystem: indexing, retrieval, agents: without modifying core llama-index code.

How it works mechanically: You subclass LLM, implement complete() (for text-in, text-out) and optionally chat() (for message-based chat), handle parameter translation from llama-index's CompletionResponse and ChatResponse objects, and call your external API. The wrapper sits between llama-index and your API, converting requests and responses in both directions. Inside llama-index, you assign your custom LLM to Settings.llm, and every downstream component uses it automatically.

When to use it: Use custom LLM wrappers when you have a proprietary inference service, local quantized model via vLLM or text-generation-webui, domain-specific fine-tuned models, or rate-limited academic APIs. The investment pays off when that integration is used across multiple indexes, retrievers, or agents.

Analogy

A custom LLM wrapper is like writing an adapter for an unusual power plug. Your laptop charger (llama-index) expects a standard socket, but your hotel room in Japan has a different outlet (your API). The adapter translates between them so your laptop charges without you rewriting the charger.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core.llms import LLM, CompletionResponse, ChatResponse, ChatMessage
from llama_index.core.base.llms.types import LLMMetadata
import httpx
import json

class CustomAPILLM(LLM):
    """Wrapper for a custom inference API."""

    model: str = "custom-model-v1"
    api_base: str = "https://api.example.com"
    api_key: str = ""
    temperature: float = 0.7
    max_tokens: int = 512

    @property
    def metadata(self) -> LLMMetadata:
        """Return metadata about the LLM."""
        return LLMMetadata(
            name=self.__class__.__name__,
            model_name=self.model,
            context_window=4096,
            num_output=self.max_tokens,
        )

    def complete(
        self,
        prompt: str,
        **kwargs,
    ) -> CompletionResponse:
        """Generate text completion from a single prompt."""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }
        payload = {
            "model": self.model,
            "prompt": prompt,
            "temperature": kwargs.get("temperature", self.temperature),
            "max_tokens": kwargs.get("max_tokens", self.max_tokens),
        }

        with httpx.Client() as client:
            response = client.post(
                f"{self.api_base}/v1/completions",
                json=payload,
                headers=headers,
                timeout=30.0,
            )
            response.raise_for_status()
            data = response.json()

        text = data["choices"][0]["text"]
        return CompletionResponse(
            text=text,
            raw=data,
            additional_kwargs={"finish_reason": data["choices"][0].get("finish_reason")},
        )

    def chat(
        self,
        messages: list[ChatMessage],
        **kwargs,
    ) -> ChatResponse:
        """Generate a chat response from a list of messages."""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

        formatted_messages = [
            {
                "role": message.role,
                "content": message.content,
            }
            for message in messages
        ]

        payload = {
            "model": self.model,
            "messages": formatted_messages,
            "temperature": kwargs.get("temperature", self.temperature),
            "max_tokens": kwargs.get("max_tokens", self.max_tokens),
        }

        with httpx.Client() as client:
            response = client.post(
                f"{self.api_base}/v1/chat/completions",
                json=payload,
                headers=headers,
                timeout=30.0,
            )
            response.raise_for_status()
            data = response.json()

        message_content = data["choices"][0]["message"]["content"]
        return ChatResponse(
            message=ChatMessage(role="assistant", content=message_content),
            raw=data,
            additional_kwargs={"finish_reason": data["choices"][0].get("finish_reason")},
        )


from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.vector_stores.faiss import FaissVectorStore
from faiss import IndexFlatL2
import tempfile
import os

llm = CustomAPILLM(
    api_base="https://api.example.com",
    api_key="your-api-key-here",
    model="custom-model-v1",
    temperature=0.5,
    max_tokens=256,
)

Settings.llm = llm

with tempfile.TemporaryDirectory() as tmpdir:
    sample_file = os.path.join(tmpdir, "sample.txt")
    with open(sample_file, "w") as f:
        f.write("Artificial intelligence is transforming how we work and live.")

    documents = SimpleDirectoryReader(tmpdir).load_data()

    faiss_index = IndexFlatL2(384)
    vector_store = FaissVectorStore(faiss_index=faiss_index)
    index = VectorStoreIndex.from_documents(
        documents,
        vector_store=vector_store,
    )

    query_engine = index.as_query_engine()
    result = query_engine.query("What is artificial intelligence?")
    print("Query Result:")
    print(result)

Output

Query Result:
(ChatResponse or CompletionResponse object from your custom API with the LLM's generated text.
Note: Actual output depends on your API's response. Code structure is production-ready; API endpoint and credentials are placeholders.)

What just happened?

You defined a <code>CustomAPILLM</code> class inheriting from <code>LLM</code>, implemented <code>complete()</code> and <code>chat()</code> methods that translate llama-index's request format into HTTP calls to your custom API, set it globally via <code>Settings.llm</code>, loaded sample documents, built a vector index, and ran a query through it. Every step from indexing to retrieval to LLM generation used your custom wrapper transparently.

Common gotcha

Forgetting to implement metadata property causes downstream errors in agents or multi-step chains that introspect LLM capabilities. Also, not handling API timeouts or rate limits means llama-index will hang or crash: always wrap API calls in try-except with sensible backoff and timeouts, especially in production retrieval loops.

Error recovery

httpx.ConnectError

Your API endpoint is unreachable or the domain is wrong. Fix: verify <code>api_base</code> URL and network connectivity. Use curl or Postman to test the endpoint directly before running llama-index code.

KeyError on response parsing

Your API returns a different JSON structure than expected (e.g., 'output' instead of 'choices'). Fix: print the raw response with <code>print(data)</code> inside the method to see the actual structure, then adjust the parsing logic to match your API's schema.

AttributeError: 'dict' object has no attribute 'role'

You're passing raw dicts to <code>ChatMessage</code> instead of <code>ChatMessage</code> objects. Fix: in the <code>chat()</code> method, ensure messages are iterated as <code>ChatMessage</code> objects with <code>.role</code> and <code>.content</code> attributes, not raw dicts.

401 Unauthorized

API key is missing, expired, or malformed. Fix: verify the key is correct, confirm the Authorization header format matches your API's spec (Bearer token vs. API-Key header vs. query param), and test with curl: <code>curl -H "Authorization: Bearer YOUR_KEY" https://api.example.com/v1/chat/completions</code>.

Experienced dev note

Custom LLM wrappers are tempting to use for quick integrations, but they're a high-maintenance pattern. Before writing one, ask: (1) Does this API have a public Python SDK? Use that instead. (2) Will this API change its endpoints or response format? Budget time for maintenance. (3) Is this a one-off integration or used in 5+ places in your codebase? Only custom-wrap if it's the latter. Also, implement retry logic and exponential backoff in your wrapper from day one: llama-index will call your LLM thousands of times in production, and transient network errors will silently kill queries. Use tenacity or a similar retry library inside complete() and chat().

Check your understanding

Why does implementing both complete() and chat() matter, and what breaks if you only implement complete()?

Show answer hint

A correct answer explains that llama-index components make different assumptions: some use <code>complete()</code> (text-in, text-out), others use <code>chat()</code> (message history with roles). If you only implement one, downstream code that expects the other will raise <code>NotImplementedError</code> or fail silently. Most modern indexing and agents use chat-based LLMs, so skipping chat() leaves your wrapper incompatible with agents and multi-turn retrieval.

VERSION llama-index-core >= 0.10.0 requires inheriting from llama_index.core.llms.LLM (not deprecated llama_index.llms.base.LLM). The metadata property is required since 0.10.0; earlier versions did not enforce it. ChatMessage and ChatResponse objects have stable APIs in 0.12.x.

Now that you can wrap any LLM, learn how to use custom LLM wrappers in agents to give them multi-step reasoning and tool use.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.