How to run Llama with Modal
Quick answer
Use
modal to deploy Llama models with GPU support by defining a @app.function with the Llama inference code inside. Run the function remotely with run_inference.remote() to scale inference easily with Modal's serverless infrastructure.PREREQUISITES
Python 3.8+pip install modal llama-cpp-pythonLlama GGUF model file downloaded locallyModal account and CLI configured
Setup
Install the required packages and set up your Modal environment. You need modal for serverless deployment and llama-cpp-python for running Llama models locally.
Download a Llama GGUF model file (e.g., llama-3.1-8b.Q4_K_M.gguf) from Hugging Face or other sources.
pip install modal llama-cpp-python output
Collecting modal Collecting llama-cpp-python Successfully installed modal-1.x llama-cpp-python-0.x
Step by step
Create a Modal app that runs Llama inference on a GPU-enabled function. The example below loads the Llama model and runs a simple prompt, returning the generated text.
import modal
from llama_cpp import Llama
app = modal.App("llama-modal-app")
@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("llama-cpp-python"))
def run_inference(prompt: str) -> str:
llm = Llama(model_path="./models/llama-3.1-8b.Q4_K_M.gguf", n_ctx=2048, n_gpu_layers=30)
output = llm.create_completion(prompt=prompt)
return output["choices"][0]["text"]
if __name__ == "__main__":
with modal.runner.deploy_stub(app):
result = run_inference.remote("Hello, how are you?")
print("Llama response:", result) output
Llama response: Hello! I'm doing well, thank you. How can I assist you today?
Common variations
- Use different GPU types by changing
gpu="A10G"to other supported GPUs. - Run inference asynchronously by defining
async deffunctions and usingawait. - Change the Llama model path to use other GGUF models.
- Customize
n_ctxandn_gpu_layersfor performance tuning.
Troubleshooting
- If you see
FileNotFoundError, verify the model path is correct and the GGUF file is downloaded. - For GPU allocation errors, ensure your Modal account has GPU quota and the specified GPU type is available.
- If inference is slow, try reducing
n_ctxorn_gpu_layersor use a more powerful GPU.
Key Takeaways
- Use
modal.functionwithgpuandimageparameters to run Llama on GPUs. - Load Llama GGUF models with
llama_cpp.Llamainside the Modal function for inference. - Deploy and invoke the function remotely with
run_inference.remote()for scalable serverless inference. - Adjust model parameters and GPU types to optimize performance and cost.
- Ensure your Modal environment has access to the model files and GPU quota.