llama.cpp supported model architectures
Quick answer
The
llama.cpp Python bindings support loading and running local GGUF quantized models based on Meta's LLaMA architecture, including llama-3.1-8b and llama-3.3-70b. It supports both standard causal language models and chat-style completions via create_chat_completion.PREREQUISITES
Python 3.8+pip install llama-cpp-pythonDownload GGUF format LLaMA models
Setup
Install the llama-cpp-python package and download GGUF quantized LLaMA models from Hugging Face or other trusted sources. Ensure Python 3.8 or newer is installed.
pip install llama-cpp-python output
Collecting llama-cpp-python Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.1.0
Step by step
Load a GGUF LLaMA model and generate text completions or chat completions using the Python API.
from llama_cpp import Llama
import os
model_path = os.path.expanduser('~/models/llama-3.1-8b.Q4_K_M.gguf')
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)
# Simple text generation
output = llm('Hello, llama.cpp! Tell me about supported models.', max_tokens=50)
print('Text generation output:', output['choices'][0]['text'])
# Chat completion example
messages = [
{"role": "user", "content": "What model architectures does llama.cpp support?"}
]
chat_output = llm.create_chat_completion(messages=messages)
print('Chat completion output:', chat_output['choices'][0]['message']['content']) output
Text generation output: llama.cpp supports Meta's LLaMA architectures in GGUF format, including 7B, 13B, 30B, and 70B models. Chat completion output: llama.cpp supports LLaMA model architectures in GGUF format, enabling local inference with quantized models such as llama-3.1-8b and llama-3.3-70b.
Common variations
You can adjust n_ctx for context length, n_gpu_layers for GPU acceleration, or use different GGUF models like llama-3.3-70b.Q8_0.gguf. Async usage is not supported natively but can be wrapped with asyncio.
from llama_cpp import Llama
import os
model_path = os.path.expanduser('~/models/llama-3.3-70b.Q8_0.gguf')
llm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=20)
response = llm('Explain llama.cpp model support in detail.', max_tokens=100)
print(response['choices'][0]['text']) output
llama.cpp supports GGUF quantized LLaMA models including 7B, 13B, 30B, and 70B variants. It allows flexible context sizes and GPU acceleration for faster inference.
Troubleshooting
- If you get
FileNotFoundError, verify the GGUF model path is correct. - If inference is slow, increase
n_gpu_layersor run on a machine with a compatible GPU. - Ensure the model is in GGUF format; older formats like
.binare not supported.
Key Takeaways
-
llama.cppsupports Meta LLaMA GGUF quantized models from 7B to 70B parameters. - Use
Llamaclass withmodel_pathpointing to GGUF files for local inference. - Supports both text generation and chat completions with flexible context and GPU acceleration.
- Ensure models are in GGUF format and paths are correct to avoid errors.