How to Intermediate · 3 min read

Fix Llama slow inference

Q: Fix Llama slow inference

Fix slow inference with Llama by using 4-bit quantization via BitsAndBytesConfig and enabling device_map="auto" to leverage GPU acceleration. Also, batch inputs and use efficient tokenizers to reduce overhead and improve throughput.

Quick answer

Fix slow inference with Llama by using 4-bit quantization via BitsAndBytesConfig and enabling device_map="auto" to leverage GPU acceleration. Also, batch inputs and use efficient tokenizers to reduce overhead and improve throughput.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install bitsandbytes
CUDA-enabled GPU recommended

Setup

Install the required packages for optimized Llama inference including transformers and bitsandbytes for 4-bit quantization support. Ensure you have a CUDA-enabled GPU for best performance.

bash

pip install transformers bitsandbytes

Step by step

Use BitsAndBytesConfig to load the Llama model in 4-bit precision and enable automatic device mapping to utilize GPU efficiently. Batch your inputs to reduce overhead and speed up inference.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer and model with quantization and device map
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Prepare batch inputs
texts = ["Hello, how are you?", "What is the capital of France?"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

# Generate outputs
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print results
for i, output in enumerate(outputs):
    print(f"Input: {texts[i]}")
    print(f"Output: {tokenizer.decode(output, skip_special_tokens=True)}\n")

output

Input: Hello, how are you?
Output: Hello, how are you? I am doing well, thank you!

Input: What is the capital of France?
Output: The capital of France is Paris.

Common variations

Use load_in_8bit=True instead of 4-bit if 4-bit quantization causes instability.
For CPU-only environments, avoid device_map="auto" and use torch_dtype=torch.float16 cautiously.
Use the mistral or groq OpenAI-compatible APIs to access Llama models with faster cloud inference.

Troubleshooting

If inference is still slow, verify your GPU drivers and CUDA installation are up to date.
Check that bitsandbytes is installed correctly; reinstall if you encounter import errors.
Reduce batch size if you run out of GPU memory.
Use torch.cuda.empty_cache() between runs to free GPU memory.

✅

Key Takeaways

Use 4-bit quantization with BitsAndBytesConfig to speed up Llama inference.
Enable device_map="auto" to leverage GPU acceleration automatically.
Batch inputs to reduce overhead and improve throughput.
Ensure CUDA and GPU drivers are properly installed and updated.
Consider cloud APIs for Llama models if local inference remains slow.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗