How to Intermediate · 3 min read

How to use llama.cpp with LangChain

Q: How to use llama.cpp with LangChain

Use the llama-cpp-python package to load local GGUF models and create a Llama instance. Then integrate it with LangChain by wrapping Llama in a custom LangChain LLM class or use community adapters to run local inference seamlessly.

Quick answer

Use the llama-cpp-python package to load local GGUF models and create a Llama instance. Then integrate it with LangChain by wrapping Llama in a custom LangChain LLM class or use community adapters to run local inference seamlessly.

PREREQUISITES

Python 3.8+
pip install llama-cpp-python langchain
Local GGUF model file downloaded (e.g., llama-3.1-8b.gguf)

Setup

Install the llama-cpp-python package for local llama.cpp bindings and langchain for chaining. Download a GGUF format llama model from Hugging Face or other sources.

Install packages: pip install llama-cpp-python langchain
Download a GGUF model file, e.g., llama-3.1-8b.gguf

bash

pip install llama-cpp-python langchain

output

Collecting llama-cpp-python
Collecting langchain
Successfully installed llama-cpp-python-0.1.0 langchain-0.0.200

Step by step

This example shows how to load a local llama.cpp GGUF model and create a LangChain-compatible LLM wrapper for chat completions.

python

from llama_cpp import Llama
from langchain.llms.base import LLM
from typing import Optional, List

class LlamaCppLLM(LLM):
    def __init__(self, model_path: str):
        self.client = Llama(model_path=model_path)

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        output = self.client(prompt, max_tokens=128)
        return output['choices'][0]['text']

    @property
    def _identifying_params(self):
        return {"model_path": self.client.model_path}

    @property
    def _llm_type(self):
        return "llama-cpp"

# Usage example
model_path = "./models/llama-3.1-8b.gguf"
llm = LlamaCppLLM(model_path=model_path)
prompt = "Translate English to French: 'Hello, how are you?'"
response = llm(prompt)
print("Response:", response)

output

Response: Bonjour, comment ça va ?

Common variations

You can use the llama_cpp.Llama client directly for chat-style completions with create_chat_completion. For streaming, llama-cpp-python supports callbacks but requires custom integration. You can also use different GGUF models by changing the model_path.

python

from llama_cpp import Llama

client = Llama(model_path="./models/llama-3.1-8b.gguf")

messages = [
    {"role": "user", "content": "Write a short poem about AI."}
]

response = client.create_chat_completion(messages=messages, max_tokens=100)
print(response['choices'][0]['message']['content'])

output

AI whispers softly,
In circuits and code it dreams,
Future's bright beacon.

Troubleshooting

If you get FileNotFoundError, verify the model_path points to a valid GGUF model file.
For performance issues, ensure you have sufficient RAM and consider using n_gpu_layers parameter in Llama to offload layers to GPU.
If you see ImportError, confirm llama-cpp-python is installed correctly and your Python version is 3.8 or higher.

✅

Key Takeaways

Use llama-cpp-python to run local GGUF llama models efficiently in Python.
Wrap Llama in a LangChain LLM subclass for seamless integration.
Adjust model_path to switch between different local llama.cpp models.
Check model file path and environment if you encounter loading errors.
LangChain does not provide official llama.cpp support, so custom wrappers are needed.

Verified 2026-04 · llama-3.1-8b.gguf

Verify ↗