How to use llama.cpp with LangChain
Quick answer
Use the
llama-cpp-python package to load local GGUF models and create a Llama instance. Then integrate it with LangChain by wrapping Llama in a custom LangChain LLM class or use community adapters to run local inference seamlessly.PREREQUISITES
Python 3.8+pip install llama-cpp-python langchainLocal GGUF model file downloaded (e.g., llama-3.1-8b.gguf)
Setup
Install the llama-cpp-python package for local llama.cpp bindings and langchain for chaining. Download a GGUF format llama model from Hugging Face or other sources.
- Install packages:
pip install llama-cpp-python langchain - Download a GGUF model file, e.g.,
llama-3.1-8b.gguf
pip install llama-cpp-python langchain output
Collecting llama-cpp-python Collecting langchain Successfully installed llama-cpp-python-0.1.0 langchain-0.0.200
Step by step
This example shows how to load a local llama.cpp GGUF model and create a LangChain-compatible LLM wrapper for chat completions.
from llama_cpp import Llama
from langchain.llms.base import LLM
from typing import Optional, List
class LlamaCppLLM(LLM):
def __init__(self, model_path: str):
self.client = Llama(model_path=model_path)
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
output = self.client(prompt, max_tokens=128)
return output['choices'][0]['text']
@property
def _identifying_params(self):
return {"model_path": self.client.model_path}
@property
def _llm_type(self):
return "llama-cpp"
# Usage example
model_path = "./models/llama-3.1-8b.gguf"
llm = LlamaCppLLM(model_path=model_path)
prompt = "Translate English to French: 'Hello, how are you?'"
response = llm(prompt)
print("Response:", response) output
Response: Bonjour, comment ça va ?
Common variations
You can use the llama_cpp.Llama client directly for chat-style completions with create_chat_completion. For streaming, llama-cpp-python supports callbacks but requires custom integration. You can also use different GGUF models by changing the model_path.
from llama_cpp import Llama
client = Llama(model_path="./models/llama-3.1-8b.gguf")
messages = [
{"role": "user", "content": "Write a short poem about AI."}
]
response = client.create_chat_completion(messages=messages, max_tokens=100)
print(response['choices'][0]['message']['content']) output
AI whispers softly, In circuits and code it dreams, Future's bright beacon.
Troubleshooting
- If you get
FileNotFoundError, verify themodel_pathpoints to a valid GGUF model file. - For performance issues, ensure you have sufficient RAM and consider using
n_gpu_layersparameter inLlamato offload layers to GPU. - If you see
ImportError, confirmllama-cpp-pythonis installed correctly and your Python version is 3.8 or higher.
Key Takeaways
- Use
llama-cpp-pythonto run local GGUF llama models efficiently in Python. - Wrap
Llamain a LangChainLLMsubclass for seamless integration. - Adjust
model_pathto switch between different local llama.cpp models. - Check model file path and environment if you encounter loading errors.
- LangChain does not provide official llama.cpp support, so custom wrappers are needed.