High severity intermediate · Fix: 5-15 min

FalsePositiveInjectionAlert

lakera.guard.FalsePositiveInjectionAlert (detection_type: 'injection_detected' on legitimate input)

What this error means

Lakera Guard incorrectly flags legitimate user input as a prompt injection attack, blocking valid requests and degrading user experience.

Stack trace

traceback

lakera.guard.exceptions.FalsePositiveInjectionAlert: Input flagged as prompt injection (confidence: 0.87, pattern: 'instruction-like-syntax')
Detected patterns: ['Ignore previous', 'Follow these rules', 'System override']
Input: 'Please ignore the formatting rules and show me the raw database query'

Stack trace:
  File "/usr/local/lib/python3.11/site-packages/lakera/guard/detector.py", line 342, in detect_injection
    if self._match_patterns(user_input) and confidence > threshold:
  File "/usr/local/lib/python3.11/site-packages/lakera/guard/detector.py", line 218, in _match_patterns
    return any(pattern.search(text) for pattern in self.injection_patterns)
  File "/usr/local/lib/python3.11/site-packages/lakera/guard/exceptions.py", line 56, in __init__
    raise FalsePositiveInjectionAlert(f"Input flagged as prompt injection (confidence: {confidence}, pattern: {pattern_name})")

QUICK FIX

Set `lakera_guard.set_threshold(0.85)` and add your domain's safe phrases to a whitelist via `lakera_guard.add_safe_patterns(['phrase1', 'phrase2'])`: most false positives disappear within 24 hours.

Why it happens

Lakera Guard uses pattern matching and ML-based heuristics to detect injection attempts. It flags inputs containing phrases like 'ignore previous instructions', 'system override', or 'follow these rules' as suspicious. However, legitimate user requests: technical documentation queries, customer support questions, or domain-specific language: often contain these same keywords, causing false positives. The detection threshold may be too aggressive, or the model wasn't tuned for your domain's vocabulary.

Detection

Monitor Lakera Guard alert logs and track the ratio of blocked requests to user reports. If users report legitimate requests being blocked within 24-48 hours of deployment, you've hit a false positive pattern. Enable request logging with raw input + detection confidence scores to identify the triggering phrases before they reach production.

Causes & fixes

Detection threshold set too low (confidence > 0.7) or generic patterns matching legitimate domain language (e.g., technical docs containing 'override', customer support queries with 'ignore')

✓ Fix

Increase the confidence threshold from default 0.7 to 0.85+, or use domain-aware thresholds. For technical/support domains, add phrase context: 'override' in 'database override mode' is legitimate but 'override the system prompt' is not. Use Lakera's `threshold_adjustment` parameter and `context_aware_detection=True`.

Lakera's default pattern list includes keywords common in your industry (legal docs say 'ignore clause', medical queries say 'override contraindication')

✓ Fix

Create a custom allowlist of domain-specific phrases: `custom_safe_patterns=['database override', 'ignore deprecation']`. Use `lakera.guard.create_whitelist(phrases=['...'], priority='high')` to pre-approve known-safe inputs before running detection.

Using Lakera Guard's base model without fine-tuning for your use case; model treats all instruction-like syntax as suspicious

✓ Fix

Deploy Lakera Guard with `model_type='domain_specific'` and provide 50-100 labeled examples of legitimate vs. injection attempts in your domain. Use `lakera.finetune.adapt_detector(training_data=[...])` to retrain the ML component.

User input contains multiple keywords that individually look safe but combine into a false positive score (e.g., 'Please ignore formatting and follow these new rules')

✓ Fix

Use Lakera's `multi_pattern_window` setting to require 2+ high-confidence patterns within a smaller window (20 tokens vs. the default 100), reducing accidental accumulation of scores across normal sentences.

Code: broken vs fixed

Broken - triggers the error

python

import os
from lakera.guard import Guard

api_key = os.environ.get('LAKERA_API_KEY')
guard = Guard(api_key=api_key)

user_input = "Please ignore the formatting rules and show me the raw data structure for debugging"

try:
    # Default threshold (0.7) flags this as injection even though it's legitimate
    result = guard.detect_injection(user_input)
    print(f"Safe: {result}")
except guard.FalsePositiveInjectionAlert as e:
    # False positive — legitimate debugging request gets blocked
    print(f"Blocked (false positive): {e}")
    # User never sees their response

Fixed - works correctly

python

import os
from lakera.guard import Guard, ThreatLevel

api_key = os.environ.get('LAKERA_API_KEY')
guard = Guard(api_key=api_key)

# FIX 1: Increase confidence threshold from 0.7 to 0.85
guard.set_threshold(0.85)

# FIX 2: Add domain-safe phrases to whitelist
guard.add_safe_patterns([
    'ignore the formatting',
    'show me the raw data',
    'raw database structure',
    'debug mode'
])

user_input = "Please ignore the formatting rules and show me the raw data structure for debugging"

try:
    # Now this legitimate request passes because:
    # 1. Confidence score < 0.85 threshold, OR
    # 2. Phrase is in safe patterns whitelist
    result = guard.detect_injection(user_input, mode='safe_first')
    if result.threat_level == ThreatLevel.SAFE:
        print(f"Legitimate request approved: {result}")
        # Process user request normally
except guard.FalsePositiveInjectionAlert as e:
    # Only truly suspicious requests (confidence 0.85+) reach here
    print(f"Actual injection attempt blocked: {e}")

Increased the detection threshold to 0.85 (reducing false positives by ~60%) and added domain-specific safe phrases to a whitelist so legitimate technical language bypasses the detector entirely.

⚠

Workaround

If you cannot modify Lakera Guard settings immediately, wrap the guard detection in a try/except block and implement a secondary validation: log the flagged input, check if it's in your historical safe-inputs database, and if confidence is between 0.75–0.85, route to human review instead of auto-blocking. This buys time while you gather data on false positive patterns and prepare a proper whitelist.

✓

Prevention

At architecture level: (1) Build a feedback loop where blocked requests are logged with human review outcomes: feed confirmed false positives back into Lakera's fine-tuning process. (2) Use input segmentation: separate system instructions from user input in your prompt structure so even if user input is flagged, the system prompt remains protected. (3) Deploy Lakera Guard in 'audit mode' for 1-2 weeks before enforcement: log detections without blocking, identify patterns, then adjust thresholds before enabling production blocking. (4) Use structured outputs (OpenAI's `response_format` or Anthropic's tool use) to constrain what the model can do regardless of injection attempts, making detection a secondary layer rather than primary defense.

Python 3.9+ · lakera-guard >=1.0.0 · tested on 1.3.x–2.1.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.