High severity HTTP 404 beginner · Fix: 2-5 min

ModelNotFoundError

openai.NotFoundError or huggingface_hub.ModelNotFoundError (Model llama-2-7b no longer available)

What this error means
Llama 2 models (llama-2-7b, llama-2-13b, llama-2-70b) are deprecated and no longer available on inference platforms; you must migrate to Llama 3.2 or Llama 3.3.

Stack trace

traceback
ModelNotFoundError: Could not find model: llama-2-7b. Llama 2 models are deprecated. Please use llama-3.2-1b-instruct, llama-3.2-3b-instruct, llama-3.1-8b-instruct, or llama-3.3-70b-instruct instead.

To fix: Update your model parameter from 'llama-2-7b' to 'meta-llama/Llama-2-7b' → 'meta-llama/Llama-3.2-3b-instruct' or 'meta-llama/Llama-3.3-70b-instruct'.
QUICK FIX
Replace model_id='llama-2-7b' with model_id='meta-llama/Llama-3.2-3b-instruct' (3B on-device) or 'meta-llama/Llama-3.3-70b-instruct' (70B best quality), then re-run your inference call.

Why it happens

Meta deprecated Llama 2 in 2024 and removed it from inference APIs (Replicate, Together AI, Groq, HuggingFace Inference API) in 2025. Llama 2 was superseded by Llama 3.1 and Llama 3.2, which have better instruction-following, longer context windows (8K–128K tokens), and improved reasoning. If you request a llama-2-* model ID on any inference platform, it returns 404 Not Found because the model endpoint no longer exists.

Detection

Check your model_id parameter in production code for any reference to 'llama-2-7b', 'llama-2-13b', or 'llama-2-70b'. Run `grep -r 'llama-2' . --include='*.py'` to find deprecated model IDs before deployment. Test against the new model ID in a staging environment to catch this error before it reaches production.

Causes & fixes

1

Code still references llama-2-7b or llama-2-13b model ID that was removed from the inference platform

✓ Fix

Replace model_id='llama-2-7b' with model_id='meta-llama/Llama-3.2-3b-instruct' (for on-device) or model_id='meta-llama/Llama-3.3-70b-instruct' (for best quality). Check your inference provider's documentation for the exact model ID format (HuggingFace, Groq, Together AI, Replicate, or Ollama each use slightly different paths).

2

Using an old inference SDK or client that still references deprecated Llama 2 endpoint URLs

✓ Fix

Upgrade to the latest SDK version: `pip install --upgrade openai replicate together` and verify your model_id matches the latest available models on your chosen platform. Deprecated endpoints (llama-2-7b-chat, llama-2-70b-chat) no longer resolve.

3

Running local inference with Ollama but 'llama2' model not pulled locally

✓ Fix

Run `ollama pull llama2` to download, but Ollama officially recommends `ollama pull llama3.2:3b` or `ollama pull llama3.3:70b` as modern replacements. Update your local model pull command and your Python code to reference the new local model name.

4

Using HuggingFace Transformers with a gated Llama 2 model without accepting the Meta license agreement

✓ Fix

If you must use Llama 2, accept the license agreement at https://huggingface.co/meta-llama/Llama-2-7b-hf, request HuggingFace API access, and authenticate with `huggingface_hub.login()`. Better: switch to Llama 3.2 or 3.3 which have no gating and are available immediately.

Code: broken vs fixed

Broken - triggers the error
python
import os
from openai import OpenAI

# THIS FAILS — llama-2-7b is deprecated
client = OpenAI(
    api_key=os.environ.get('TOGETHER_API_KEY'),
    base_url='https://api.together.xyz/v1'
)

response = client.chat.completions.create(
    model='llama-2-7b-chat',  # ❌ DEPRECATED — 404 Not Found
    messages=[
        {'role': 'user', 'content': 'What is machine learning?'}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)
Fixed - works correctly
python
import os
from openai import OpenAI

# FIXED — use Llama 3.2 or 3.3
client = OpenAI(
    api_key=os.environ.get('TOGETHER_API_KEY'),
    base_url='https://api.together.xyz/v1'
)

response = client.chat.completions.create(
    model='meta-llama/Llama-3.2-3b-instruct',  # ✅ MODERN — available on all platforms
    messages=[
        {'role': 'user', 'content': 'What is machine learning?'}
    ],
    temperature=0.7,
    max_tokens=256
)

print('Response:', response.choices[0].message.content)
Changed model_id from deprecated 'llama-2-7b-chat' to 'meta-llama/Llama-3.2-3b-instruct', which is the recommended drop-in replacement available on all major inference platforms (Together AI, Groq, Replicate, HuggingFace) as of April 2026.

Workaround

If you absolutely cannot change your model ID immediately, you can catch the 404 error and fall back to a newer model at runtime: wrap the inference call in a try/except block, catch ModelNotFoundError or NotFoundError, and retry with model_id='meta-llama/Llama-3.3-70b-instruct'. However, this is a temporary patch: migrate to Llama 3.2/3.3 in your codebase as soon as possible.

Prevention

Adopt a model versioning strategy: (1) store model_id as a configuration parameter (environment variable or config file), not hardcoded; (2) subscribe to Meta's Llama release announcements and inference platform deprecation notices; (3) test your inference pipeline monthly against the latest available models; (4) use feature detection in your code to fallback gracefully if a specific model is unavailable. For production: use Llama 3.3 70B (best quality) or Llama 3.2 3B (on-device/cost-sensitive) as your primary models: both are stable, well-supported, and will not be deprecated in 2026.

Python 3.8+ · openai (or together, replicate, huggingface_hub, ollama depending on your provider) >=1.0.0 (OpenAI SDK v1+) or equivalent modern versions of other SDKs · tested on openai==1.54.0+, together==1.0.0+, huggingface_hub==0.21.0+, ollama==0.3.0+
Verified 2026-04 · llama-3.2-1b-instruct, llama-3.2-3b-instruct, llama-3.3-70b-instruct
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.