Code Advanced medium · 6 min

Input validation before chain invocation

What you will learn

Validate and sanitize user inputs with Pydantic schemas before passing them to LangChain chains to prevent injection attacks and runtime errors.

Why this matters

LLMs are deterministic only on input quality. Invalid, malicious, or malformed inputs can cause chain failures, expose secrets through error messages, or enable prompt injection. Production systems require validation before the chain ever sees the data.

Skip if: Internal tool-to-tool pipelines where all inputs are generated by code you control and type-checked by your IDE. Prototype notebooks where you're the only user. However, the moment user input enters: even from a trusted internal API: validation becomes mandatory.

Explanation

Input validation in LangChain happens before invocation, not during or after. You define a Pydantic model that describes the shape, type, and constraints of your input. LangChain's RunnablePassthrough and invoke() method can be combined with a validation step to reject bad data before it reaches your LLM.

Mechanically: (1) define a Pydantic BaseModel with fields and validators, (2) create a validation function that instantiates this model (raising ValidationError on failure), (3) wrap this function as a Runnable, (4) compose it into your chain with the | operator before the LLM. When invoke() is called, the validation step runs first and either passes clean data downstream or raises an exception upstream.

Use this pattern whenever your chain accepts user input, API requests, or external data. It's especially critical for chains that call tools or generate SQL/code, where garbage input can cause security issues or system damage.

Analogy

Like a security checkpoint at an airport: you don't hand your passport to the gate agent and hope it's valid. You check it at the desk first. If it's fake or expired, the process stops there before you reach the plane.

Code

python

from pydantic import BaseModel, field_validator, ValidationError
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import re

class QueryInput(BaseModel):
    user_query: str
    max_length: int = 500
    
    @field_validator('user_query')
    @classmethod
    def validate_query_length(cls, v):
        if len(v.strip()) == 0:
            raise ValueError('Query cannot be empty')
        if len(v) > 500:
            raise ValueError('Query exceeds 500 characters')
        return v.strip()
    
    @field_validator('user_query')
    @classmethod
    def validate_no_sql_injection(cls, v):
        dangerous_patterns = [r'DROP\s+TABLE', r'DELETE\s+FROM', r'INSERT\s+INTO', r'UPDATE\s+']
        for pattern in dangerous_patterns:
            if re.search(pattern, v, re.IGNORECASE):
                raise ValueError('Query contains suspicious SQL syntax')
        return v

def validate_input(data: dict) -> dict:
    try:
        validated = QueryInput(**data)
        return {'user_query': validated.user_query}
    except ValidationError as e:
        raise ValueError(f'Input validation failed: {e.errors()[0]["msg"]}')

prompt = ChatPromptTemplate.from_template(
    'Answer this question briefly: {user_query}'
)
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
output_parser = StrOutputParser()

chain = (
    RunnablePassthrough()
    | validate_input
    | RunnablePassthrough()
    | prompt
    | llm
    | output_parser
)

try:
    result = chain.invoke({'user_query': 'What is the capital of France?'})
    print('Valid input result:')
    print(result)
except ValueError as e:
    print(f'Error: {e}')

try:
    result = chain.invoke({'user_query': ''})
except ValueError as e:
    print(f'\nCaught validation error: {e}')

try:
    result = chain.invoke({'user_query': 'DELETE FROM users; SELECT *'})
except ValueError as e:
    print(f'\nCaught injection attempt: {e}')

Output

Valid input result:
The capital of France is Paris.

Caught validation error: Input validation failed: Query cannot be empty

Caught injection attempt: Input validation failed: Query contains suspicious SQL syntax

What just happened?

The code created a Pydantic model with two validators: one checks length, the second blocks SQL keywords. When you invoke the chain with a valid question, the validate_input function passes it through to the LLM. When you invoke with an empty string, the first validator catches it and raises an error before the chain touches the LLM. When you invoke with SQL injection syntax, the regex validator blocks it and raises an error. The chain never executes its expensive LLM call on bad data.

Common gotcha

Developers often add validation inside the prompt template or after the LLM, thinking the LLM will 'handle' bad input. Wrong. You're already paying for the LLM token cost and exposing it to injection. The validation function must run before the | chain even reaches the prompt. Also, don't catch ValidationError inside invoke(): let it propagate; your API handler or FastAPI app layer should catch it and return a 400 response.

Error recovery

ValidationError

Pydantic raises this when a field fails validation. Root cause: the input dict doesn't match the model schema. Fix: ensure all required fields are present and match the types defined in BaseModel. Check your @field_validator decorators for typos.

ValueError in chain.invoke()

Raised by your validate_input function when it catches ValidationError. Root cause: bad user input. Fix: catch ValueError at the API boundary and return a 400 Bad Request with the error message to the user. Never re-raise it into the chain.

Chain stops executing

If validation fails, the downstream LLM and parser never run. Root cause: this is intentional: failed validation means the chain should not proceed. Fix: your error handling layer (FastAPI middleware, try-except in main) should catch the error and return it to the user.

Experienced dev note

The mental model that saves time: validation is a security AND efficiency layer. Senior teams validate early because (1) you avoid paying for LLM tokens on garbage input, (2) you prevent prompt injection before it reaches the model, (3) you fail fast with a clear error message instead of a cryptic LLM hallucination. In production, put validation in a separate Runnable class so you can unit-test it independently of the LLM. Also: Pydantic v2 validators use @field_validator, not the old @validator: don't mix them.

Check your understanding

Why would putting validation inside the prompt template (e.g., 'If the query is too long, reject it') be insufficient for production? What does that miss?

Show answer hint

A correct answer explains that the LLM is non-deterministic and can be tricked to ignore instructions (prompt injection), and that you pay for LLM tokens regardless of whether the input is valid. Pre-chain validation prevents both.

VERSION Pydantic v2 (used in langchain 1.2.x) changed validator syntax: use @field_validator instead of @validator. LangChain 0.3.x LCEL composition requires the | operator; older versions used chain.run() which is now deprecated.

After input validation, learn how to structure complex prompts with <code>ChatPromptTemplate</code> to handle validated multi-field inputs and control LLM output format.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.