Concept Intermediate · 3 min read

What is model stealing attack

Quick answer
A model stealing attack is a security exploit where an attacker queries a proprietary AI model extensively to reconstruct a copy of it without permission. This attack compromises intellectual property and can enable misuse of the replicated model.
Model stealing attack is an AI security exploit that extracts a proprietary model's functionality by repeatedly querying it to create an unauthorized duplicate.

How it works

A model stealing attack works by an adversary sending numerous inputs to a target AI model and collecting its outputs. By analyzing these input-output pairs, the attacker trains a surrogate model that mimics the original's behavior. This is similar to reverse-engineering a software program by observing its responses to various inputs.

For example, if the target is a text classification model, the attacker queries it with many text samples and records the predicted labels. Using this dataset, the attacker trains a new model that approximates the original's decision boundaries.

Concrete example

Below is a simplified Python example demonstrating a model stealing attack using the OpenAI API to query a proprietary model and then training a surrogate model with the collected data.

python
import os
from openai import OpenAI
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample inputs to query the target model
inputs = [
    "I love this product!",
    "This is the worst service ever.",
    "Not bad, could be better.",
    "Absolutely fantastic experience.",
    "I will never buy this again."
]

# Query the proprietary model to get labels
labels = []
for text in inputs:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Classify sentiment: {text}"}]
    )
    label = response.choices[0].message.content.strip().lower()
    labels.append(label)

# Vectorize inputs
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(inputs)

# Encode labels (simple mapping)
label_map = {"positive": 1, "negative": 0, "neutral": 2}
y = np.array([label_map.get(l, 2) for l in labels])

# Train surrogate model
surrogate = RandomForestClassifier()
surrogate.fit(X, y)

# Surrogate model can now predict sentiment without querying the original
print(surrogate.predict(vectorizer.transform(["I hate waiting in line."])))
output
[0]

When to use it

While model stealing attacks are unethical and illegal when performed without consent, understanding them is crucial for AI developers and security professionals. Use knowledge of these attacks to:

  • Design robust defenses such as output perturbation or rate limiting.
  • Evaluate the risk of exposing models via APIs.
  • Develop watermarking techniques to detect stolen models.

Do not use model stealing techniques to replicate proprietary models without authorization, as this violates intellectual property rights and AI ethics.

Key terms

TermDefinition
Model stealing attackAn attack that reconstructs a proprietary AI model by querying it extensively to create a copy.
Surrogate modelA model trained by an attacker to mimic the behavior of the target model using collected input-output pairs.
QueryAn input sent to an AI model to receive an output or prediction.
WatermarkingTechniques to embed identifiable patterns in models to detect unauthorized copies.
Rate limitingRestricting the number of queries to an API to prevent abuse or attacks.

Key Takeaways

  • Model stealing attacks extract proprietary AI models by querying them repeatedly and training surrogates.
  • Defenses include limiting query rates, adding noise to outputs, and watermarking models.
  • Understanding these attacks helps protect AI intellectual property and maintain ethical AI deployment.
Verified 2026-04 · gpt-4o-mini
Verify ↗