How to block toxic content in chatbot
Quick answer
Use
content moderation APIs like OpenAI's moderation endpoint or implement prompt-based guardrails to detect and block toxic content in chatbots. Combine automated filtering with system prompts that instruct the model to refuse harmful or toxic requests.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable.
pip install openai>=1.0 Step by step
This example shows how to use OpenAI's moderation API to detect toxic content before sending user input to the chatbot. If the input is flagged, the bot refuses to respond.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Function to check if content is toxic
def is_toxic(text: str) -> bool:
response = client.moderations.create(
model="omni-moderation-latest",
input=text
)
# The moderation response contains categories flagged as true if toxic
results = response.results[0]
return results.flagged
# Chatbot interaction with toxic content blocking
def chatbot_response(user_input: str) -> str:
if is_toxic(user_input):
return "Your message was flagged as inappropriate and cannot be processed."
# Use a system prompt to reinforce guardrails
messages = [
{"role": "system", "content": "You are a helpful assistant that refuses to generate or engage with toxic or harmful content."},
{"role": "user", "content": user_input}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
# Example usage
if __name__ == "__main__":
user_text = input("User: ")
reply = chatbot_response(user_text)
print("Bot:", reply) output
User: You are stupid Bot: Your message was flagged as inappropriate and cannot be processed.
Common variations
- Use
asynccalls withasynciofor non-blocking moderation and chat requests. - Switch to other models like
gpt-4o-minifor cost efficiency. - Implement additional prompt engineering to instruct the model to refuse toxic content explicitly.
import os
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def is_toxic_async(text: str) -> bool:
response = await client.moderations.acreate(
model="omni-moderation-latest",
input=text
)
return response.results[0].flagged
async def chatbot_response_async(user_input: str) -> str:
if await is_toxic_async(user_input):
return "Your message was flagged as inappropriate and cannot be processed."
messages = [
{"role": "system", "content": "You are a helpful assistant that refuses to generate or engage with toxic or harmful content."},
{"role": "user", "content": user_input}
]
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
async def main():
user_text = "You are dumb"
reply = await chatbot_response_async(user_text)
print("Bot:", reply)
if __name__ == "__main__":
asyncio.run(main()) output
Bot: Your message was flagged as inappropriate and cannot be processed.
Troubleshooting
- If the moderation API returns false positives, adjust your prompt to clarify context or whitelist safe phrases.
- If you see API rate limit errors, implement exponential backoff retries.
- Ensure your environment variable
OPENAI_API_KEYis set correctly to avoid authentication errors.
Key Takeaways
- Use content moderation APIs to pre-filter user inputs for toxic content before chatbot processing.
- Combine automated filtering with system prompt guardrails instructing the model to refuse harmful content.
- Implement async calls and error handling for robust, scalable chatbot deployments.