How to beginner · 3 min read

How to run Llama with Modal

Q: How to run Llama with Modal

Use modal to deploy Llama models with GPU support by defining a @app.function with the Llama inference code inside. Run the function remotely with run_inference.remote() to scale inference easily with Modal's serverless infrastructure.

Quick answer

Use modal to deploy Llama models with GPU support by defining a @app.function with the Llama inference code inside. Run the function remotely with run_inference.remote() to scale inference easily with Modal's serverless infrastructure.

PREREQUISITES

Python 3.8+
pip install modal llama-cpp-python
Llama GGUF model file downloaded locally
Modal account and CLI configured

Setup

Install the required packages and set up your Modal environment. You need modal for serverless deployment and llama-cpp-python for running Llama models locally.

Download a Llama GGUF model file (e.g., llama-3.1-8b.Q4_K_M.gguf) from Hugging Face or other sources.

bash

pip install modal llama-cpp-python

output

Collecting modal
Collecting llama-cpp-python
Successfully installed modal-1.x llama-cpp-python-0.x

Step by step

Create a Modal app that runs Llama inference on a GPU-enabled function. The example below loads the Llama model and runs a simple prompt, returning the generated text.

python

import modal
from llama_cpp import Llama

app = modal.App("llama-modal-app")

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("llama-cpp-python"))
def run_inference(prompt: str) -> str:
    llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_ctx=2048, n_gpu_layers=30)
    output = llm.create_completion(prompt=prompt)
    return output["choices"][0]["text"]

if __name__ == "__main__":
    with modal.runner.deploy_stub(app):
        result = run_inference.remote("Hello, how are you?")
        print("Llama response:", result)

output

Llama response: Hello! I'm doing well, thank you. How can I assist you today?

Common variations

Use different GPU types by changing gpu="A10G" to other supported GPUs.
Run inference asynchronously by defining async def functions and using await.
Change the Llama model path to use other GGUF models.
Customize n_ctx and n_gpu_layers for performance tuning.

Troubleshooting

If you see FileNotFoundError, verify the model path is correct and the GGUF file is downloaded.
For GPU allocation errors, ensure your Modal account has GPU quota and the specified GPU type is available.
If inference is slow, try reducing n_ctx or n_gpu_layers or use a more powerful GPU.

✅

Key Takeaways

Use modal.function with gpu and image parameters to run Llama on GPUs.
Load Llama GGUF models with llama_cpp.Llama inside the Modal function for inference.
Deploy and invoke the function remotely with run_inference.remote() for scalable serverless inference.
Adjust model parameters and GPU types to optimize performance and cost.
Ensure your Modal environment has access to the model files and GPU quota.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, llama-3.1-8b.Q4_K_M.gguf

Verify ↗