How to Intermediate · 3 min read

How to run CodeLlama locally

Quick answer
Run CodeLlama locally by installing transformers and bitsandbytes for quantized inference, then load the model with AutoModelForCausalLM and AutoTokenizer. Use BitsAndBytesConfig to enable 4-bit or 8-bit quantization for efficient local performance.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0 bitsandbytes torch
  • A GPU with CUDA support recommended for performance

Setup

Install the required Python packages to run CodeLlama locally. Use transformers for model loading, bitsandbytes for quantization, and torch for tensor operations. A CUDA-enabled GPU is recommended for faster inference.

bash
pip install transformers bitsandbytes torch

Step by step

Load and run CodeLlama locally with quantization for efficient memory use. The example below loads the 7B instruct model in 4-bit mode and generates a simple completion.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Load tokenizer and model
model_name = "meta-llama/CodeLlama-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Prepare prompt
prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

Common variations

  • Use 8-bit quantization by setting load_in_8bit=True in BitsAndBytesConfig for a balance of speed and memory.
  • Run inference on CPU by removing device_map="auto" but expect slower performance.
  • Use larger CodeLlama models like 13B or 34B with sufficient GPU memory.
  • For async or streaming generation, integrate with frameworks like accelerate or custom loops.

Troubleshooting

  • If you get CUDA out of memory errors, reduce batch size or switch to 8-bit quantization.
  • Ensure your GPU drivers and CUDA toolkit are up to date for compatibility.
  • If bitsandbytes installation fails, install from source or use a compatible Python version.
  • For tokenizer errors, verify the model name is correct and internet connection is active for downloading.

Key Takeaways

  • Use transformers and bitsandbytes to run CodeLlama locally with quantization.
  • 4-bit quantization offers significant memory savings with minimal quality loss.
  • A CUDA-enabled GPU is recommended for practical local inference speed.
  • Adjust quantization and device settings based on your hardware capabilities.
Verified 2026-04 · meta-llama/CodeLlama-7b-instruct
Verify ↗