llama.cpp batch size tuning
Quick answer
To tune batch size in
llama.cpp, adjust the n_batch parameter when creating the Llama instance or calling the inference method. Increasing n_batch improves throughput by processing multiple prompts simultaneously but requires more VRAM; find the optimal size by testing your hardware limits.PREREQUISITES
Python 3.8+pip install llama-cpp-pythonllama.cpp GGUF model file downloaded
Setup
Install the llama-cpp-python package and download a compatible GGUF model file for llama.cpp. Ensure your environment has sufficient VRAM to handle larger batch sizes.
pip install llama-cpp-python output
Collecting llama-cpp-python Downloading llama_cpp_python-0.1.0-cp38-cp38-manylinux1_x86_64.whl (10 MB) Installing collected packages: llama-cpp-python Successfully installed llama-cpp-python-0.1.0
Step by step
Use the n_batch parameter to set batch size when initializing the Llama model or invoking create_chat_completion. Larger batch sizes increase throughput but consume more memory.
from llama_cpp import Llama
import os
# Path to your GGUF model file
model_path = "./models/llama-3.1-8b.Q4_K_M.gguf"
# Initialize Llama with batch size 8
llm = Llama(model_path=model_path, n_ctx=2048, n_batch=8)
# Prepare multiple prompts for batch inference
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "user", "content": "Explain batch size tuning."},
{"role": "user", "content": "What is llama.cpp?"}
]
# Create chat completions for all messages in one batch
response = llm.create_chat_completion(messages=messages)
# Print each response
for i, choice in enumerate(response["choices"]):
print(f"Response {i+1}: {choice["message"]["content"]}\n") output
Response 1: I'm doing well, thank you! How can I assist you today? Response 2: Batch size tuning in llama.cpp involves adjusting the n_batch parameter to optimize throughput and memory usage. Response 3: llama.cpp is a lightweight C++ implementation for running LLaMA models efficiently on local hardware.
Common variations
You can tune n_batch dynamically depending on your hardware. For GPU with limited VRAM, start with smaller batch sizes like 1 or 2. For CPU or high VRAM GPUs, increase up to 16 or 32. Also, adjust n_ctx for context length. Use create_chat_completion for chat-style prompts or llm() for single prompt generation.
from llama_cpp import Llama
# Smaller batch size for low VRAM
llm_small_batch = Llama(model_path=model_path, n_ctx=2048, n_batch=2)
# Larger batch size for high VRAM
llm_large_batch = Llama(model_path=model_path, n_ctx=2048, n_batch=16)
# Single prompt generation
output = llm_small_batch("Explain batch size in llama.cpp", max_tokens=50)
print(output["choices"][0]["text"]) output
Batch size in llama.cpp controls how many tokens are processed simultaneously, improving throughput but increasing memory usage.
Troubleshooting
- If you get
CUDA out of memoryerrors, reducen_batchorn_ctx. - If inference is slow, try increasing
n_batchto better utilize GPU parallelism. - Ensure your GGUF model file matches the
llama-cpp-pythonversion.
Key Takeaways
- Set
n_batchinLlamato control batch size for inference. - Larger batch sizes improve throughput but require more VRAM.
- Test batch sizes incrementally to find the optimal balance for your hardware.