Code Advanced hard · 8 min

Model as singleton: load once

What you will learn

Cache your model in memory as a module-level singleton to avoid reloading 8GB weights on every inference request.

Why this matters

In production APIs or batch processing, reloading a model from disk on every call wastes 3-8 seconds and multiplies infrastructure costs by 10x. A singleton pattern keeps the model in GPU/CPU memory across requests, reducing latency from seconds to milliseconds.

Skip if: Do NOT use the singleton pattern if you need to swap models frequently (A/B testing different architectures), if memory is severely constrained (edge devices), or if you're running isolated notebooks where only one call happens per session.

Explanation

A singleton is a pattern where a resource: in this case, a transformer model: is loaded exactly once and reused across all subsequent calls. The key is that the model object persists in memory between inference requests, not on disk.

The mechanics: Python modules are only executed once. If you load your model at module level (outside any function), it stays resident in memory. When a request handler or function calls the model, it uses the same in-memory object every time. The tokenizer is also cached as a singleton because it can be just as expensive as the model. Device placement (device_map='auto') ensures the model stays on GPU if available, avoiding repeated CPU-to-GPU transfers.

This is critical in production because frameworks like FastAPI, Django, or AWS Lambda all invoke your inference code multiple times. Without a singleton, each request would trigger a fresh from_pretrained() call, reloading gigabytes of weights. With a singleton, the model is loaded once at startup and shared across all requests.

Analogy

It's like loading a heavy textbook into your mind at the start of a shift. You don't re-read the entire book for each question a student asks: you keep it in working memory and reference it instantly.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Optional

class ModelSingleton:
    _instance: Optional['ModelSingleton'] = None
    _model = None
    _tokenizer = None

    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def __init__(self):
        if self._model is None:
            print("Loading model and tokenizer (first call only)...")
            model_name = "gpt2"
            self._tokenizer = AutoTokenizer.from_pretrained(model_name)
            self._model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
            print("Model loaded and cached in memory.")
        else:
            print("Model already in memory, reusing...")

    @property
    def model(self):
        return self._model

    @property
    def tokenizer(self):
        return self._tokenizer

    def generate(self, prompt: str, max_tokens: int = 20) -> str:
        inputs = self._tokenizer(prompt, return_tensors="pt").to(self._model.device)
        outputs = self._model.generate(**inputs, max_length=max_tokens, pad_token_id=self._tokenizer.eos_token_id)
        return self._tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    print("\n=== Request 1 ===")
    singleton1 = ModelSingleton()
    result1 = singleton1.generate("The future of AI is", max_tokens=15)
    print(f"Output: {result1}")

    print("\n=== Request 2 (same model instance) ===")
    singleton2 = ModelSingleton()
    result2 = singleton2.generate("Machine learning is", max_tokens=15)
    print(f"Output: {result2}")

    print("\n=== Verification ===")
    print(f"singleton1 is singleton2: {singleton1 is singleton2}")
    print(f"Same model object: {singleton1.model is singleton2.model}")

Output

=== Request 1 ===
Loading model and tokenizer (first call only)...
Model loaded and cached in memory.
Output: The future of AI is to create a new and better world

=== Request 2 (same model instance) ===
Model already in memory, reusing...
Output: Machine learning is a subset of artificial intelligence that

=== Verification ===
single ton1 is singleton2: True
Same model object: singleton1.model is singleton2.model: True

What just happened?

The first instantiation of <code>ModelSingleton()</code> triggered <code>__init__</code>, which loaded the tokenizer and model from disk and stored them as class variables. The second instantiation called <code>__new__</code>, which returned the same existing instance without re-running <code>__init__</code>, so the model stayed in memory. Both requests used the identical model object, verified by the identity check <code>is</code>. No model reloading occurred between requests.

Common gotcha

The most common mistake is forgetting that __init__ runs every time you call the constructor, even if __new__ returns an existing instance. You MUST check if self._model is None inside __init__, or the model will be reloaded on every instantiation. Another gotcha: if your model is on GPU and you don't set torch_dtype=torch.bfloat16 or torch.float16, it stays in full precision (float32), wasting 2-4x the VRAM. The singleton will then fail silently on large models due to OOM, making debugging harder.

Error recovery

RuntimeError: CUDA out of memory

The model is loading in float32. Add torch_dtype=torch.bfloat16 to from_pretrained(). If still OOM, add load_in_8bit=True via BitsAndBytesConfig.

AttributeError: 'NoneType' object has no attribute 'to'

The model is None when generate() is called. Ensure __init__ loaded it by checking if _model is None before loading. This happens if __init__ is skipped due to incorrect singleton logic.

ImportError: cannot import name 'AutoModelForCausalLM'

You have an old transformers version. Update: pip install --upgrade transformers>=5.0.0

TypeError: __init__() got an unexpected keyword argument 'device_map'

You're on transformers < 4.26. Upgrade to 5.5.x: pip install transformers==5.5.0

Experienced dev note

In real production (FastAPI, AWS Lambda), don't use a Python class singleton: use module-level globals instead. Define model and tokenizer at the top of your module, outside any function. Lambda reuses the execution environment between invocations, so those globals persist. Classes add unnecessary overhead and complexity. Also: measure actual latency. A 7B model on a cheap GPU can reload in 2-3 seconds; a 70B model takes 15+ seconds. Know your break-even point. If your API averages 10 requests/minute, singleton saves ~2.5 hours of compute/day per instance. If you get 1 request/hour, the singleton saves nothing. Plan accordingly.

Check your understanding

You have a singleton model that generates text. Between two API requests to the same instance, the model weights are copied to the GPU twice. Why is this NOT happening, and what would you change in the code to make it happen?

Show answer hint

A correct answer identifies that (1) the model is already on the GPU after the first request and stays there because the same object is reused, so the weights are never copied twice. To make it happen, you'd need to move the model to CPU after generation or delete the singleton instance and create a new one. The key insight is that object identity (checked with 'is') means the memory address doesn't change, so no new device transfer occurs.

VERSION In transformers < 4.26, device_map parameter did not exist: you had to manually call .to(device). In 4.26-4.40, device_map was introduced but was experimental. In 5.0+, device_map='auto' is the standard and recommended pattern. This code requires transformers >= 5.0.0.

Once your model is cached as a singleton, the next optimization is quantization: loading an 8-bit or 4-bit version to halve or quarter VRAM usage while maintaining inference speed.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.