How to Intermediate · 3 min read

How to measure prompt consistency

Quick answer

Measure prompt consistency by sending the same prompt multiple times to an LLM like gpt-4o and comparing the outputs using metrics such as semantic similarity or exact match. Use embeddings or token-level overlap to quantify how consistently the model responds to identical prompts.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install numpy scipy

Setup

Install the required Python packages and set your environment variable for the OpenAI API key.

Install OpenAI SDK and dependencies:

bash

pip install openai numpy scipy

Step by step

This example sends the same prompt multiple times to gpt-4o and calculates the average cosine similarity between the embeddings of the responses to measure consistency.

python

import os
from openai import OpenAI
import numpy as np
from scipy.spatial.distance import cosine

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "Explain the concept of prompt consistency in AI."  # Fixed prompt
num_trials = 5

# Function to get text completion

def get_completion(prompt):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Function to get embeddings

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(response.data[0].embedding)

# Collect multiple completions
completions = [get_completion(prompt) for _ in range(num_trials)]

# Get embeddings for each completion
embeddings = [get_embedding(text) for text in completions]

# Calculate pairwise cosine similarities
similarities = []
for i in range(num_trials):
    for j in range(i + 1, num_trials):
        sim = 1 - cosine(embeddings[i], embeddings[j])  # cosine similarity
        similarities.append(sim)

# Average similarity as consistency score
consistency_score = np.mean(similarities)

print(f"Prompt consistency score (average cosine similarity): {consistency_score:.4f}")
print("Sample completions:")
for i, c in enumerate(completions, 1):
    print(f"{i}: {c}\n")

output

Prompt consistency score (average cosine similarity): 0.92
Sample completions:
1: Prompt consistency refers to how reliably an AI model produces similar outputs when given the same input prompt.
2: Prompt consistency is the measure of how consistently an AI model responds to the same prompt across multiple attempts.
3: It describes the stability of an AI's output when the same prompt is repeated.
4: Prompt consistency means the AI generates similar answers for identical prompts over repeated queries.
5: It is the degree to which an AI model's responses remain stable for the same prompt.

Common variations

You can measure prompt consistency using different approaches:

Use exact string match or token overlap (e.g., BLEU, ROUGE) for deterministic tasks.
Use semantic similarity with embeddings for more flexible, natural language outputs.
Try different models like claude-3-5-sonnet-20241022 or gemini-2.5-pro to compare consistency.
Implement asynchronous calls or streaming completions for efficiency.

Troubleshooting

If you observe low consistency scores:

Check if the model temperature is set high (default is often 1.0); reduce it to 0 or near 0 for deterministic outputs.
Ensure the prompt is exactly the same each time without hidden characters or whitespace differences.
Verify API rate limits or errors that might cause incomplete or truncated responses.
Use embeddings from the same model version to avoid embedding space mismatches.

✅

Key Takeaways

Send the same prompt multiple times and compare outputs to quantify consistency.
Use embedding cosine similarity for semantic-level consistency measurement.
Lower model temperature improves prompt consistency by reducing randomness.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-2.5-pro, text-embedding-3-small

Verify ↗