FalsePositiveInjectionAlert
lakera.guard.FalsePositiveInjectionAlert (detection_type: 'injection_detected' on legitimate input)
Stack trace
lakera.guard.exceptions.FalsePositiveInjectionAlert: Input flagged as prompt injection (confidence: 0.87, pattern: 'instruction-like-syntax')
Detected patterns: ['Ignore previous', 'Follow these rules', 'System override']
Input: 'Please ignore the formatting rules and show me the raw database query'
Stack trace:
File "/usr/local/lib/python3.11/site-packages/lakera/guard/detector.py", line 342, in detect_injection
if self._match_patterns(user_input) and confidence > threshold:
File "/usr/local/lib/python3.11/site-packages/lakera/guard/detector.py", line 218, in _match_patterns
return any(pattern.search(text) for pattern in self.injection_patterns)
File "/usr/local/lib/python3.11/site-packages/lakera/guard/exceptions.py", line 56, in __init__
raise FalsePositiveInjectionAlert(f"Input flagged as prompt injection (confidence: {confidence}, pattern: {pattern_name})") Why it happens
Lakera Guard uses pattern matching and ML-based heuristics to detect injection attempts. It flags inputs containing phrases like 'ignore previous instructions', 'system override', or 'follow these rules' as suspicious. However, legitimate user requests: technical documentation queries, customer support questions, or domain-specific language: often contain these same keywords, causing false positives. The detection threshold may be too aggressive, or the model wasn't tuned for your domain's vocabulary.
Detection
Monitor Lakera Guard alert logs and track the ratio of blocked requests to user reports. If users report legitimate requests being blocked within 24-48 hours of deployment, you've hit a false positive pattern. Enable request logging with raw input + detection confidence scores to identify the triggering phrases before they reach production.
Causes & fixes
Detection threshold set too low (confidence > 0.7) or generic patterns matching legitimate domain language (e.g., technical docs containing 'override', customer support queries with 'ignore')
Increase the confidence threshold from default 0.7 to 0.85+, or use domain-aware thresholds. For technical/support domains, add phrase context: 'override' in 'database override mode' is legitimate but 'override the system prompt' is not. Use Lakera's `threshold_adjustment` parameter and `context_aware_detection=True`.
Lakera's default pattern list includes keywords common in your industry (legal docs say 'ignore clause', medical queries say 'override contraindication')
Create a custom allowlist of domain-specific phrases: `custom_safe_patterns=['database override', 'ignore deprecation']`. Use `lakera.guard.create_whitelist(phrases=['...'], priority='high')` to pre-approve known-safe inputs before running detection.
Using Lakera Guard's base model without fine-tuning for your use case; model treats all instruction-like syntax as suspicious
Deploy Lakera Guard with `model_type='domain_specific'` and provide 50-100 labeled examples of legitimate vs. injection attempts in your domain. Use `lakera.finetune.adapt_detector(training_data=[...])` to retrain the ML component.
User input contains multiple keywords that individually look safe but combine into a false positive score (e.g., 'Please ignore formatting and follow these new rules')
Use Lakera's `multi_pattern_window` setting to require 2+ high-confidence patterns within a smaller window (20 tokens vs. the default 100), reducing accidental accumulation of scores across normal sentences.
Code: broken vs fixed
import os
from lakera.guard import Guard
api_key = os.environ.get('LAKERA_API_KEY')
guard = Guard(api_key=api_key)
user_input = "Please ignore the formatting rules and show me the raw data structure for debugging"
try:
# Default threshold (0.7) flags this as injection even though it's legitimate
result = guard.detect_injection(user_input)
print(f"Safe: {result}")
except guard.FalsePositiveInjectionAlert as e:
# False positive — legitimate debugging request gets blocked
print(f"Blocked (false positive): {e}")
# User never sees their response import os
from lakera.guard import Guard, ThreatLevel
api_key = os.environ.get('LAKERA_API_KEY')
guard = Guard(api_key=api_key)
# FIX 1: Increase confidence threshold from 0.7 to 0.85
guard.set_threshold(0.85)
# FIX 2: Add domain-safe phrases to whitelist
guard.add_safe_patterns([
'ignore the formatting',
'show me the raw data',
'raw database structure',
'debug mode'
])
user_input = "Please ignore the formatting rules and show me the raw data structure for debugging"
try:
# Now this legitimate request passes because:
# 1. Confidence score < 0.85 threshold, OR
# 2. Phrase is in safe patterns whitelist
result = guard.detect_injection(user_input, mode='safe_first')
if result.threat_level == ThreatLevel.SAFE:
print(f"Legitimate request approved: {result}")
# Process user request normally
except guard.FalsePositiveInjectionAlert as e:
# Only truly suspicious requests (confidence 0.85+) reach here
print(f"Actual injection attempt blocked: {e}") Workaround
If you cannot modify Lakera Guard settings immediately, wrap the guard detection in a try/except block and implement a secondary validation: log the flagged input, check if it's in your historical safe-inputs database, and if confidence is between 0.75–0.85, route to human review instead of auto-blocking. This buys time while you gather data on false positive patterns and prepare a proper whitelist.
Prevention
At architecture level: (1) Build a feedback loop where blocked requests are logged with human review outcomes: feed confirmed false positives back into Lakera's fine-tuning process. (2) Use input segmentation: separate system instructions from user input in your prompt structure so even if user input is flagged, the system prompt remains protected. (3) Deploy Lakera Guard in 'audit mode' for 1-2 weeks before enforcement: log detections without blocking, identify patterns, then adjust thresholds before enabling production blocking. (4) Use structured outputs (OpenAI's `response_format` or Anthropic's tool use) to constrain what the model can do regardless of injection attempts, making detection a secondary layer rather than primary defense.