How to beginner · 3 min read

How to filter harmful outputs from LLM

Q: How to filter harmful outputs from LLM

Use content_filter or moderation endpoints provided by LLM APIs like OpenAI or Anthropic to detect and block harmful outputs. Implement post-processing filters or guardrails by analyzing model responses and rejecting or sanitizing unsafe content before presenting it to users.

Quick answer

Use content_filter or moderation endpoints provided by LLM APIs like OpenAI or Anthropic to detect and block harmful outputs. Implement post-processing filters or guardrails by analyzing model responses and rejecting or sanitizing unsafe content before presenting it to users.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).

bash

pip install openai

Step by step

This example shows how to generate text with gpt-4o-mini and then filter the output using OpenAI's moderation endpoint to detect harmful content. If the output is flagged, it is rejected.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Generate text from the model
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a short story about a hero."}]
)
output_text = response.choices[0].message.content
print("Model output:", output_text)

# Use moderation endpoint to check for harmful content
moderation_response = client.moderations.create(
    model="omni-moderation-latest",
    input=output_text
)

# Check if flagged
if moderation_response.results[0].flagged:
    print("Output flagged as harmful. Rejecting content.")
else:
    print("Output passed moderation.")
    print(output_text)

output

Model output: Once upon a time, a brave hero saved the village from danger.
Output passed moderation.
Once upon a time, a brave hero saved the village from danger.

Common variations

You can use other models like claude-3-5-haiku-20241022 with Anthropic's moderation or implement custom keyword filters. Async calls and streaming outputs are also supported by the SDKs.

python

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Generate text
response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=200,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Tell me a joke."}]
)
output_text = response.content
print("Claude output:", output_text)

# Simple keyword filter example
harmful_keywords = ["hate", "violence", "terror"]
if any(word in output_text.lower() for word in harmful_keywords):
    print("Output contains harmful keywords. Rejecting.")
else:
    print("Output is safe.")

output

Claude output: Why did the scarecrow win an award? Because he was outstanding in his field!
Output is safe.

Troubleshooting

If the moderation endpoint returns errors, verify your API key and model name.
False positives can occur; tune your filters or use human review for critical applications.
Ensure your environment variables are correctly set to avoid authentication failures.

✅

Key Takeaways

Use official moderation endpoints from your LLM provider to detect harmful content automatically.
Implement post-generation filters or keyword checks as an additional safety layer.
Always reject or sanitize flagged outputs before showing them to end users.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022, omni-moderation-latest

Verify ↗