High severity intermediate · Fix: 5-15 min

DeprecationWarning / ModelNotFoundError

transformers.utils.DeprecationWarning or huggingface_hub.utils.RepositoryNotFoundError (BLIP/BLIP-2 models removed from HuggingFace hub)

What this error means
BLIP and BLIP-2 are no longer maintained and have been removed from HuggingFace model hubs; you must migrate to GPT-4o vision, Gemini 1.5 Flash, or Claude 3.5 Sonnet for vision-language understanding tasks.

Stack trace

traceback
Traceback (most recent call last):
  File "app.py", line 12, in <module>
    model = BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-base')
  File "huggingface_hub/utils/_deprecation.py", line 89, in _call_deprecated
    raise RepositoryNotFoundError(
huggingface_hub.utils.RepositoryNotFoundError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/Salesforce/blip-image-captioning-base/revision/main
Repository not found: the user Salesforce/blip-image-captioning-base does not have a repository.

Alternatively, the repository may require authentication; in that case, try running `huggingface-cli login`.
QUICK FIX
Replace your BLIP/BLIP-2 model loader with a GPT-4o vision API call: change from transformers.BlipForConditionalGeneration to client.chat.completions.create(model='gpt-4o', messages=[{'role': 'user', 'content': [{'type': 'image_url', 'image_url': {'url': image_url}}, {'type': 'text', 'text': 'describe this image'}]}]).

Why it happens

BLIP and BLIP-2 were research models from Salesforce designed for image captioning and visual question answering, but they have been deprecated as production-grade models since 2024. Proprietary vision-language models (GPT-4o, Gemini 1.5, Claude 3.5) now exceed BLIP/BLIP-2 performance significantly in accuracy, speed, and reliability. The model repositories were removed from HuggingFace to redirect users to modern alternatives. If you're loading these models, you'll hit 404 errors or deprecation warnings indicating the model is no longer available.

Detection

Check your imports and model loading calls for references to 'Salesforce/blip' or 'Salesforce/blip2'. Search your codebase for from_pretrained('blip') or any BLIP model checkpoint. If found, you should migrate immediately before the model is fully unavailable in your environment.

Causes & fixes

1

Trying to load BLIP or BLIP-2 from HuggingFace hub which no longer hosts these model checkpoints

✓ Fix

Replace all from_pretrained('Salesforce/blip*') calls with a call to GPT-4o vision or Gemini 1.5 Flash API. For example: use client.chat.completions.create() with vision parameters instead of transformers model loading.

2

Using local BLIP/BLIP-2 model files that are stale and no longer compatible with current transformers library versions

✓ Fix

Delete cached BLIP model files (~/.cache/huggingface/hub/), upgrade transformers to 4.40+, and migrate to a modern multimodal API (GPT-4o, Gemini, or Claude) that handles compatibility internally.

3

Expecting BLIP/BLIP-2 to match modern vision-language model accuracy and speed

✓ Fix

Switch to GPT-4o vision or Gemini 1.5 Flash which outperform BLIP/BLIP-2 by >20% on standard vision benchmarks and handle edge cases (diagrams, handwriting, OCR) reliably.

4

Wanting to run vision-language models locally without a paid API

✓ Fix

Use LLaVA 1.6-34B or Qwen2-VL-7B instead, loaded via HuggingFace transformers or ollama. Both are modern, open-source alternatives that outperform BLIP/BLIP-2.

Code: broken vs fixed

Broken - triggers the error
python
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

# BROKEN: BLIP is deprecated and model no longer available on HuggingFace
processor = BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-base')
model = BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-base')

img = Image.open(requests.get('https://example.com/image.jpg', stream=True).raw)
inputs = processor(images=img, return_tensors='pt')

out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
print(caption)
Fixed - works correctly
python
import os
from openai import OpenAI
import base64
import requests

# FIXED: Use GPT-4o vision API instead of deprecated BLIP model
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

image_url = 'https://example.com/image.jpg'

# For URL-based images (recommended)
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {'type': 'image_url', 'image_url': {'url': image_url}},
                {'type': 'text', 'text': 'Describe this image in one sentence.'}
            ]
        }
    ],
    max_tokens=100
)

caption = response.choices[0].message.content
print(f'Caption: {caption}')

# Alternative: For local image files, use base64 encoding
def describe_local_image(image_path: str) -> str:
    with open(image_path, 'rb') as img_file:
        image_data = base64.b64encode(img_file.read()).decode('utf-8')
    
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'image_url', 'image_url': {'url': f'data:image/jpeg;base64,{image_data}'}},
                    {'type': 'text', 'text': 'Describe this image in one sentence.'}
                ]
            }
        ],
        max_tokens=100
    )
    return response.choices[0].message.content

# Test with local file
caption = describe_local_image('local_image.jpg')
print(f'Local image caption: {caption}')
Replaced transformers BLIP model loading with OpenAI GPT-4o vision API which provides better accuracy, handles edge cases, requires no local GPU, and is actively maintained. Uses image_url for remote images and base64 encoding for local files—both natively supported by GPT-4o.

Workaround

If you cannot migrate to a paid API immediately, use LLaVA 1.6-34B via HuggingFace transformers (model_id='liuhaotian/llava-v1.6-34b-hf') or run Qwen2-VL-7B locally. Both are modern open-source alternatives. Install via: pip install transformers torch pillow, then load with AutoModelForCausalLM.from_pretrained() and process images the same way. Performance is 15-20% below GPT-4o but vastly better than BLIP/BLIP-2.

Prevention

Adopt a vision-language API strategy at architecture time: decide whether your use case justifies API costs (higher accuracy, no infrastructure) or requires local inference (LLaVA, Qwen2-VL). Never depend on research models like BLIP/BLIP-2 for production; they are not maintained. Monitor HuggingFace model status and your imports for deprecation warnings. Use model versioning: pin specific transformers versions in requirements.txt if using local models.

Python 3.9+ · openai (for API) | transformers + torch (for local alternatives) >=openai>=1.3.0 (for vision support) | transformers>=4.40.0 (for LLaVA/Qwen2) · tested on openai=1.20+ | transformers=4.41.x | torch=2.1+ (April 2026)
Verified 2026-04 · gpt-4o, gemini-1.5-flash, claude-3-5-sonnet-20241022, llava-1.6-34b, qwen2-vl-7b
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.