How to Intermediate · 3 min read

How to implement output filtering

Quick answer

Implement output filtering by sanitizing and validating AI model responses using rule-based checks, regular expressions, or secondary AI models to detect and block harmful or manipulated content. Use libraries or custom logic to remove or flag suspicious outputs before presenting them to users.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

This example shows how to call gpt-4o to generate text and then apply output filtering using simple keyword blocking to prevent prompt injection or harmful content.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define a simple filter function
blocked_keywords = ["\"; DROP TABLE", "rm -rf", "sudo", "eval(", "<script>"]

def output_filter(text: str) -> bool:
    """Return True if output is safe, False if blocked."""
    lower_text = text.lower()
    for keyword in blocked_keywords:
        if keyword.lower() in lower_text:
            return False
    return True

# Generate text from the model
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a safe greeting message."}]
)
output_text = response.choices[0].message.content

# Apply output filtering
if output_filter(output_text):
    print("Filtered output:", output_text)
else:
    print("Output blocked due to unsafe content.")

output

Filtered output: Hello! How can I assist you today?

Common variations

Output filtering can be enhanced by using:

Regular expressions for pattern matching.
Secondary AI models trained to detect prompt injection or harmful content.
Asynchronous calls for real-time filtering in streaming outputs.
Different models like claude-3-5-sonnet-20241022 with similar filtering logic.

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

blocked_phrases = ["rm -rf", "sudo", "<script>"]

def is_safe(text: str) -> bool:
    return not any(phrase in text.lower() for phrase in blocked_phrases)

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Generate a safe response."}]
)

output = message.content[0].text

if is_safe(output):
    print("Safe output:", output)
else:
    print("Blocked unsafe output.")

output

Safe output: Hello! I'm here to help you with your questions.

Troubleshooting

If your output filtering blocks legitimate content, refine your keyword list or use more sophisticated NLP techniques like semantic similarity checks. If harmful content bypasses filters, consider layered filtering with multiple detection methods or human review for high-risk applications.

✅

Key Takeaways

Use rule-based keyword or pattern matching to block suspicious outputs effectively.
Leverage secondary AI models for more nuanced detection of prompt injection attempts.
Always validate and sanitize model outputs before exposing them to end users.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗