How to beginner · 3 min read

How to filter harmful outputs from LLM

Quick answer
Use content_filter or moderation endpoints provided by LLM APIs like OpenAI or Anthropic to detect and block harmful outputs. Implement post-processing filters or guardrails by analyzing model responses and rejecting or sanitizing unsafe content before presenting it to users.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).
bash
pip install openai

Step by step

This example shows how to generate text with gpt-4o-mini and then filter the output using OpenAI's moderation endpoint to detect harmful content. If the output is flagged, it is rejected.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Generate text from the model
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a short story about a hero."}]
)
output_text = response.choices[0].message.content
print("Model output:", output_text)

# Use moderation endpoint to check for harmful content
moderation_response = client.moderations.create(
    model="omni-moderation-latest",
    input=output_text
)

# Check if flagged
if moderation_response.results[0].flagged:
    print("Output flagged as harmful. Rejecting content.")
else:
    print("Output passed moderation.")
    print(output_text)
output
Model output: Once upon a time, a brave hero saved the village from danger.
Output passed moderation.
Once upon a time, a brave hero saved the village from danger.

Common variations

You can use other models like claude-3-5-haiku-20241022 with Anthropic's moderation or implement custom keyword filters. Async calls and streaming outputs are also supported by the SDKs.

python
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Generate text
response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=200,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Tell me a joke."}]
)
output_text = response.content
print("Claude output:", output_text)

# Simple keyword filter example
harmful_keywords = ["hate", "violence", "terror"]
if any(word in output_text.lower() for word in harmful_keywords):
    print("Output contains harmful keywords. Rejecting.")
else:
    print("Output is safe.")
output
Claude output: Why did the scarecrow win an award? Because he was outstanding in his field!
Output is safe.

Troubleshooting

  • If the moderation endpoint returns errors, verify your API key and model name.
  • False positives can occur; tune your filters or use human review for critical applications.
  • Ensure your environment variables are correctly set to avoid authentication failures.

Key Takeaways

  • Use official moderation endpoints from your LLM provider to detect harmful content automatically.
  • Implement post-generation filters or keyword checks as an additional safety layer.
  • Always reject or sanitize flagged outputs before showing them to end users.
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022, omni-moderation-latest
Verify ↗