Model as singleton: load once
Why this matters
In production APIs or batch processing, reloading a model from disk on every call wastes 3-8 seconds and multiplies infrastructure costs by 10x. A singleton pattern keeps the model in GPU/CPU memory across requests, reducing latency from seconds to milliseconds.
Explanation
A singleton is a pattern where a resource: in this case, a transformer model: is loaded exactly once and reused across all subsequent calls. The key is that the model object persists in memory between inference requests, not on disk.
The mechanics: Python modules are only executed once. If you load your model at module level (outside any function), it stays resident in memory. When a request handler or function calls the model, it uses the same in-memory object every time. The tokenizer is also cached as a singleton because it can be just as expensive as the model. Device placement (device_map='auto') ensures the model stays on GPU if available, avoiding repeated CPU-to-GPU transfers.
This is critical in production because frameworks like FastAPI, Django, or AWS Lambda all invoke your inference code multiple times. Without a singleton, each request would trigger a fresh from_pretrained() call, reloading gigabytes of weights. With a singleton, the model is loaded once at startup and shared across all requests.
Analogy
It's like loading a heavy textbook into your mind at the start of a shift. You don't re-read the entire book for each question a student asks: you keep it in working memory and reference it instantly.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Optional
class ModelSingleton:
_instance: Optional['ModelSingleton'] = None
_model = None
_tokenizer = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
def __init__(self):
if self._model is None:
print("Loading model and tokenizer (first call only)...")
model_name = "gpt2"
self._tokenizer = AutoTokenizer.from_pretrained(model_name)
self._model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16
)
print("Model loaded and cached in memory.")
else:
print("Model already in memory, reusing...")
@property
def model(self):
return self._model
@property
def tokenizer(self):
return self._tokenizer
def generate(self, prompt: str, max_tokens: int = 20) -> str:
inputs = self._tokenizer(prompt, return_tensors="pt").to(self._model.device)
outputs = self._model.generate(**inputs, max_length=max_tokens, pad_token_id=self._tokenizer.eos_token_id)
return self._tokenizer.decode(outputs[0], skip_special_tokens=True)
if __name__ == "__main__":
print("\n=== Request 1 ===")
singleton1 = ModelSingleton()
result1 = singleton1.generate("The future of AI is", max_tokens=15)
print(f"Output: {result1}")
print("\n=== Request 2 (same model instance) ===")
singleton2 = ModelSingleton()
result2 = singleton2.generate("Machine learning is", max_tokens=15)
print(f"Output: {result2}")
print("\n=== Verification ===")
print(f"singleton1 is singleton2: {singleton1 is singleton2}")
print(f"Same model object: {singleton1.model is singleton2.model}") === Request 1 === Loading model and tokenizer (first call only)... Model loaded and cached in memory. Output: The future of AI is to create a new and better world === Request 2 (same model instance) === Model already in memory, reusing... Output: Machine learning is a subset of artificial intelligence that === Verification === single ton1 is singleton2: True Same model object: singleton1.model is singleton2.model: True
What just happened?
The first instantiation of <code>ModelSingleton()</code> triggered <code>__init__</code>, which loaded the tokenizer and model from disk and stored them as class variables. The second instantiation called <code>__new__</code>, which returned the same existing instance without re-running <code>__init__</code>, so the model stayed in memory. Both requests used the identical model object, verified by the identity check <code>is</code>. No model reloading occurred between requests.
Common gotcha
The most common mistake is forgetting that __init__ runs every time you call the constructor, even if __new__ returns an existing instance. You MUST check if self._model is None inside __init__, or the model will be reloaded on every instantiation. Another gotcha: if your model is on GPU and you don't set torch_dtype=torch.bfloat16 or torch.float16, it stays in full precision (float32), wasting 2-4x the VRAM. The singleton will then fail silently on large models due to OOM, making debugging harder.
Error recovery
RuntimeError: CUDA out of memoryAttributeError: 'NoneType' object has no attribute 'to'ImportError: cannot import name 'AutoModelForCausalLM'TypeError: __init__() got an unexpected keyword argument 'device_map'Experienced dev note
In real production (FastAPI, AWS Lambda), don't use a Python class singleton: use module-level globals instead. Define model and tokenizer at the top of your module, outside any function. Lambda reuses the execution environment between invocations, so those globals persist. Classes add unnecessary overhead and complexity. Also: measure actual latency. A 7B model on a cheap GPU can reload in 2-3 seconds; a 70B model takes 15+ seconds. Know your break-even point. If your API averages 10 requests/minute, singleton saves ~2.5 hours of compute/day per instance. If you get 1 request/hour, the singleton saves nothing. Plan accordingly.
Check your understanding
You have a singleton model that generates text. Between two API requests to the same instance, the model weights are copied to the GPU twice. Why is this NOT happening, and what would you change in the code to make it happen?
Show answer hint
A correct answer identifies that (1) the model is already on the GPU after the first request and stays there because the same object is reused, so the weights are never copied twice. To make it happen, you'd need to move the model to CPU after generation or delete the singleton instance and create a new one. The key insight is that object identity (checked with 'is') means the memory address doesn't change, so no new device transfer occurs.