How to beginner · 3 min read

llama.cpp supported model architectures

Q: llama.cpp supported model architectures

The llama.cpp Python bindings support loading and running local GGUF quantized models based on Meta's LLaMA architecture, including llama-3.1-8b and llama-3.3-70b. It supports both standard causal language models and chat-style completions via create_chat_completion.

Quick answer

The llama.cpp Python bindings support loading and running local GGUF quantized models based on Meta's LLaMA architecture, including llama-3.1-8b and llama-3.3-70b. It supports both standard causal language models and chat-style completions via create_chat_completion.

PREREQUISITES

Python 3.8+
pip install llama-cpp-python
Download GGUF format LLaMA models

Setup

Install the llama-cpp-python package and download GGUF quantized LLaMA models from Hugging Face or other trusted sources. Ensure Python 3.8 or newer is installed.

bash

pip install llama-cpp-python

output

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB)
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.0

Step by step

Load a GGUF LLaMA model and generate text completions or chat completions using the Python API.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser('~/models/llama-3.1-8b.Q4_K_M.gguf')

llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

# Simple text generation
output = llm('Hello, llama.cpp! Tell me about supported models.', max_tokens=50)
print('Text generation output:', output['choices'][0]['text'])

# Chat completion example
messages = [
    {"role": "user", "content": "What model architectures does llama.cpp support?"}
]
chat_output = llm.create_chat_completion(messages=messages)
print('Chat completion output:', chat_output['choices'][0]['message']['content'])

output

Text generation output: llama.cpp supports Meta's LLaMA architectures in GGUF format, including 7B, 13B, 30B, and 70B models.
Chat completion output: llama.cpp supports LLaMA model architectures in GGUF format, enabling local inference with quantized models such as llama-3.1-8b and llama-3.3-70b.

Common variations

You can adjust n_ctx for context length, n_gpu_layers for GPU acceleration, or use different GGUF models like llama-3.3-70b.Q8_0.gguf. Async usage is not supported natively but can be wrapped with asyncio.

python

from llama_cpp import Llama
import os

model_path = os.path.expanduser('~/models/llama-3.3-70b.Q8_0.gguf')
llm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=20)

response = llm('Explain llama.cpp model support in detail.', max_tokens=100)
print(response['choices'][0]['text'])

output

llama.cpp supports GGUF quantized LLaMA models including 7B, 13B, 30B, and 70B variants. It allows flexible context sizes and GPU acceleration for faster inference.

Troubleshooting

If you get FileNotFoundError, verify the GGUF model path is correct.
If inference is slow, increase n_gpu_layers or run on a machine with a compatible GPU.
Ensure the model is in GGUF format; older formats like .bin are not supported.

✅

Key Takeaways

llama.cpp supports Meta LLaMA GGUF quantized models from 7B to 70B parameters.
Use Llama class with model_path pointing to GGUF files for local inference.
Supports both text generation and chat completions with flexible context and GPU acceleration.
Ensure models are in GGUF format and paths are correct to avoid errors.

Verified 2026-04 · llama-3.1-8b, llama-3.3-70b

Verify ↗