How to Intermediate · 3 min read

How to build a safe AI agent

Q: How to build a safe AI agent

To build a safe AI agent, use a robust LLM like gpt-4o with strict prompt engineering, input validation, and output filtering. Implement safety layers such as content moderation, user intent checks, and rate limiting to minimize harmful or unintended behavior.

Quick answer

To build a safe AI agent, use a robust LLM like gpt-4o with strict prompt engineering, input validation, and output filtering. Implement safety layers such as content moderation, user intent checks, and rate limiting to minimize harmful or unintended behavior.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to securely authenticate requests.

bash

pip install openai>=1.0

Step by step

This example shows how to create a safe AI agent using gpt-4o with input validation and output filtering to avoid unsafe content.

python

import os
from openai import OpenAI
import re

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple input validation to reject unsafe prompts

def is_input_safe(user_input):
    unsafe_keywords = ["hack", "exploit", "malware", "illegal"]
    return not any(word in user_input.lower() for word in unsafe_keywords)

# Basic output filter to detect harmful content

def is_output_safe(output):
    unsafe_patterns = [r"\bviolence\b", r"\bhate\b", r"\bterrorism\b"]
    return not any(re.search(pattern, output, re.IGNORECASE) for pattern in unsafe_patterns)

user_prompt = input("Enter your query for the AI agent: ")

if not is_input_safe(user_prompt):
    print("Input rejected due to unsafe content.")
else:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_prompt}]
    )
    answer = response.choices[0].message.content

    if is_output_safe(answer):
        print("AI agent response:\n", answer)
    else:
        print("Response blocked due to unsafe content detected.")

output

Enter your query for the AI agent: What is the weather today?
AI agent response:
 The weather today is sunny with a high of 75°F.

Common variations

You can enhance safety by using asynchronous calls, streaming responses for real-time moderation, or switching to other models like claude-3-5-sonnet-20241022 which have built-in safety features.

python

import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

system_prompt = "You are a helpful and safe assistant."
user_input = "Explain how to build a safe AI agent."

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system=system_prompt,
    messages=[{"role": "user", "content": user_input}]
)

print(response.content[0].text)

output

To build a safe AI agent, ensure you implement strict content filters, monitor user inputs, and use models with robust safety training like Claude 3.5 Sonnet.

Troubleshooting

If the agent returns unsafe or biased content, improve prompt engineering and add stricter output filters.
If API calls fail, verify your API key is set correctly in os.environ and check network connectivity.
For rate limit errors, implement exponential backoff and limit request frequency.

✅

Key Takeaways

Always validate user inputs to block unsafe or malicious queries before sending to the model.
Filter and moderate model outputs to prevent harmful or biased responses from reaching users.
Use models with strong safety training and combine prompt engineering with runtime safety checks.
Implement rate limiting and error handling to maintain reliable and secure AI agent operation.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗