How to Intermediate · 3 min read

How to test prompts systematically

Quick answer
Use automated scripts with controlled input variations and output logging to test prompts systematically. Employ batch testing with models like gpt-4o or claude-3-5-sonnet-20241022 to evaluate prompt performance across scenarios and refine iteratively.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to authenticate requests.

bash
pip install openai>=1.0

Step by step

This example demonstrates how to test multiple prompt variations systematically by sending batch requests to gpt-4o and logging the outputs for comparison.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = [
    "Explain quantum computing in simple terms.",
    "Explain quantum computing as if to a 5-year-old.",
    "Summarize quantum computing in one sentence."
]

results = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content
    results.append((prompt, output))

for i, (prompt, output) in enumerate(results, 1):
    print(f"Prompt {i}: {prompt}\nResponse:\n{output}\n{'-'*40}")
output
Prompt 1: Explain quantum computing in simple terms.
Response:
Quantum computing uses the principles of quantum mechanics to process information in ways classical computers cannot, enabling faster problem-solving for certain tasks.
----------------------------------------
Prompt 2: Explain quantum computing as if to a 5-year-old.
Response:
Imagine a magic box that can try many answers at once to solve puzzles super fast. That's what quantum computers do!
----------------------------------------
Prompt 3: Summarize quantum computing in one sentence.
Response:
Quantum computing harnesses quantum bits to perform complex calculations more efficiently than classical computers.

Common variations

You can test prompts asynchronously, use streaming responses, or switch models like claude-3-5-sonnet-20241022 for comparison. Adjust max_tokens and temperature to explore output diversity.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def test_prompt_async(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Describe AI ethics.",
        "Describe AI ethics simply."
    ]
    tasks = [test_prompt_async(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for prompt, output in zip(prompts, results):
        print(f"Prompt: {prompt}\nResponse: {output}\n{'-'*30}")

asyncio.run(main())
output
Prompt: Describe AI ethics.
Response: AI ethics involves principles guiding the responsible development and use of artificial intelligence to ensure fairness, transparency, and safety.
------------------------------
Prompt: Describe AI ethics simply.
Response: AI ethics means making sure computers do the right thing and don’t hurt people.
------------------------------

Troubleshooting

If you receive rate limit errors, implement exponential backoff retries. For unexpected outputs, verify prompt clarity and test with different temperature settings. Ensure environment variables are correctly set to avoid authentication failures.

python
import time
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "Explain blockchain technology."

for attempt in range(3):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        print(response.choices[0].message.content)
        break
    except Exception as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(2 ** attempt)  # exponential backoff
else:
    print("Failed after 3 attempts.")
output
Quantum blockchain technology is a theoretical approach that combines quantum computing with blockchain to enhance security and processing speed.

Key Takeaways

  • Automate prompt testing by batching inputs and logging outputs for systematic comparison.
  • Use asynchronous calls and different models to explore prompt behavior variations efficiently.
  • Adjust parameters like temperature and max_tokens to test prompt robustness and output diversity.
  • Implement retry logic to handle API rate limits and ensure reliable testing.
  • Clear, specific prompts reduce ambiguity and improve test result consistency.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗