How to build a safe AI agent
Quick answer
To build a safe AI agent, use a robust LLM like
gpt-4o with strict prompt engineering, input validation, and output filtering. Implement safety layers such as content moderation, user intent checks, and rate limiting to minimize harmful or unintended behavior.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to securely authenticate requests.
pip install openai>=1.0 Step by step
This example shows how to create a safe AI agent using gpt-4o with input validation and output filtering to avoid unsafe content.
import os
from openai import OpenAI
import re
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Simple input validation to reject unsafe prompts
def is_input_safe(user_input):
unsafe_keywords = ["hack", "exploit", "malware", "illegal"]
return not any(word in user_input.lower() for word in unsafe_keywords)
# Basic output filter to detect harmful content
def is_output_safe(output):
unsafe_patterns = [r"\bviolence\b", r"\bhate\b", r"\bterrorism\b"]
return not any(re.search(pattern, output, re.IGNORECASE) for pattern in unsafe_patterns)
user_prompt = input("Enter your query for the AI agent: ")
if not is_input_safe(user_prompt):
print("Input rejected due to unsafe content.")
else:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_prompt}]
)
answer = response.choices[0].message.content
if is_output_safe(answer):
print("AI agent response:\n", answer)
else:
print("Response blocked due to unsafe content detected.") output
Enter your query for the AI agent: What is the weather today? AI agent response: The weather today is sunny with a high of 75°F.
Common variations
You can enhance safety by using asynchronous calls, streaming responses for real-time moderation, or switching to other models like claude-3-5-sonnet-20241022 which have built-in safety features.
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
system_prompt = "You are a helpful and safe assistant."
user_input = "Explain how to build a safe AI agent."
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": user_input}]
)
print(response.content[0].text) output
To build a safe AI agent, ensure you implement strict content filters, monitor user inputs, and use models with robust safety training like Claude 3.5 Sonnet.
Troubleshooting
- If the agent returns unsafe or biased content, improve prompt engineering and add stricter output filters.
- If API calls fail, verify your API key is set correctly in
os.environand check network connectivity. - For rate limit errors, implement exponential backoff and limit request frequency.
Key Takeaways
- Always validate user inputs to block unsafe or malicious queries before sending to the model.
- Filter and moderate model outputs to prevent harmful or biased responses from reaching users.
- Use models with strong safety training and combine prompt engineering with runtime safety checks.
- Implement rate limiting and error handling to maintain reliable and secure AI agent operation.