ModelNotFoundError
openai.NotFoundError or huggingface_hub.ModelNotFoundError (Model llama-2-7b no longer available)
Stack trace
ModelNotFoundError: Could not find model: llama-2-7b. Llama 2 models are deprecated. Please use llama-3.2-1b-instruct, llama-3.2-3b-instruct, llama-3.1-8b-instruct, or llama-3.3-70b-instruct instead. To fix: Update your model parameter from 'llama-2-7b' to 'meta-llama/Llama-2-7b' → 'meta-llama/Llama-3.2-3b-instruct' or 'meta-llama/Llama-3.3-70b-instruct'.
Why it happens
Meta deprecated Llama 2 in 2024 and removed it from inference APIs (Replicate, Together AI, Groq, HuggingFace Inference API) in 2025. Llama 2 was superseded by Llama 3.1 and Llama 3.2, which have better instruction-following, longer context windows (8K–128K tokens), and improved reasoning. If you request a llama-2-* model ID on any inference platform, it returns 404 Not Found because the model endpoint no longer exists.
Detection
Check your model_id parameter in production code for any reference to 'llama-2-7b', 'llama-2-13b', or 'llama-2-70b'. Run `grep -r 'llama-2' . --include='*.py'` to find deprecated model IDs before deployment. Test against the new model ID in a staging environment to catch this error before it reaches production.
Causes & fixes
Code still references llama-2-7b or llama-2-13b model ID that was removed from the inference platform
Replace model_id='llama-2-7b' with model_id='meta-llama/Llama-3.2-3b-instruct' (for on-device) or model_id='meta-llama/Llama-3.3-70b-instruct' (for best quality). Check your inference provider's documentation for the exact model ID format (HuggingFace, Groq, Together AI, Replicate, or Ollama each use slightly different paths).
Using an old inference SDK or client that still references deprecated Llama 2 endpoint URLs
Upgrade to the latest SDK version: `pip install --upgrade openai replicate together` and verify your model_id matches the latest available models on your chosen platform. Deprecated endpoints (llama-2-7b-chat, llama-2-70b-chat) no longer resolve.
Running local inference with Ollama but 'llama2' model not pulled locally
Run `ollama pull llama2` to download, but Ollama officially recommends `ollama pull llama3.2:3b` or `ollama pull llama3.3:70b` as modern replacements. Update your local model pull command and your Python code to reference the new local model name.
Using HuggingFace Transformers with a gated Llama 2 model without accepting the Meta license agreement
If you must use Llama 2, accept the license agreement at https://huggingface.co/meta-llama/Llama-2-7b-hf, request HuggingFace API access, and authenticate with `huggingface_hub.login()`. Better: switch to Llama 3.2 or 3.3 which have no gating and are available immediately.
Code: broken vs fixed
import os
from openai import OpenAI
# THIS FAILS — llama-2-7b is deprecated
client = OpenAI(
api_key=os.environ.get('TOGETHER_API_KEY'),
base_url='https://api.together.xyz/v1'
)
response = client.chat.completions.create(
model='llama-2-7b-chat', # ❌ DEPRECATED — 404 Not Found
messages=[
{'role': 'user', 'content': 'What is machine learning?'}
],
temperature=0.7,
max_tokens=256
)
print(response.choices[0].message.content) import os
from openai import OpenAI
# FIXED — use Llama 3.2 or 3.3
client = OpenAI(
api_key=os.environ.get('TOGETHER_API_KEY'),
base_url='https://api.together.xyz/v1'
)
response = client.chat.completions.create(
model='meta-llama/Llama-3.2-3b-instruct', # ✅ MODERN — available on all platforms
messages=[
{'role': 'user', 'content': 'What is machine learning?'}
],
temperature=0.7,
max_tokens=256
)
print('Response:', response.choices[0].message.content) Workaround
If you absolutely cannot change your model ID immediately, you can catch the 404 error and fall back to a newer model at runtime: wrap the inference call in a try/except block, catch ModelNotFoundError or NotFoundError, and retry with model_id='meta-llama/Llama-3.3-70b-instruct'. However, this is a temporary patch: migrate to Llama 3.2/3.3 in your codebase as soon as possible.
Prevention
Adopt a model versioning strategy: (1) store model_id as a configuration parameter (environment variable or config file), not hardcoded; (2) subscribe to Meta's Llama release announcements and inference platform deprecation notices; (3) test your inference pipeline monthly against the latest available models; (4) use feature detection in your code to fallback gracefully if a specific model is unavailable. For production: use Llama 3.3 70B (best quality) or Llama 3.2 3B (on-device/cost-sensitive) as your primary models: both are stable, well-supported, and will not be deprecated in 2026.