Code Advanced hard · 8 min

Mocking LLM calls in tests

What you will learn

Replace real LLM API calls with controlled mock responses in langgraph tests to avoid costs, latency, and API failures.

Why this matters

LangGraph agents make real LLM calls during development and testing, costing money, adding latency, and creating flaky tests when APIs are down. Mocking lets you test agent logic, state transitions, and error handling in isolation without touching external services: critical for CI/CD pipelines and fast local iteration.

Skip if: Do not mock LLM calls when you're testing actual LLM behavior (e.g., prompt quality, model reasoning, token usage). In those cases, use a staging model or integration tests with a fixed budget. Also skip mocking if you're doing final acceptance testing against real models.

Explanation

Mocking LLM calls means replacing the actual ChatOpenAI (or similar) invocation with a function that returns a predetermined response. In langgraph, this is done by patching the LLM's invoke or batch methods before your graph runs. Mechanically: you use Python's unittest.mock.patch or pytest-mock to intercept the LLM call, capture its arguments, and return a fake AIMessage with your chosen content. The graph executes normally: it has no idea the LLM was mocked: and you control every response, error condition, and edge case. When to use it: in unit and integration tests where you're validating agent behavior (tool selection, state updates, routing logic), not LLM quality. Pair mocking with integration tests (monthly or pre-release) that run against real models to catch prompt drift.

Analogy

Like a stunt double in film: the stunt double (mock) performs the dangerous jump (expensive API call) in a controlled environment (test). The real actor (production code) never knows the difference. When you need to film the final take (acceptance test), the real actor steps in.

Code

python

import unittest
from unittest.mock import patch, MagicMock
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    final_answer: str

def create_agent_graph():
    graph = StateGraph(AgentState)
    
    def call_llm(state):
        llm = ChatOpenAI(model="gpt-4o", temperature=0)
        response = llm.invoke(state["messages"])
        return {"messages": state["messages"] + [response]}
    
    def process_answer(state):
        last_msg = state["messages"][-1]
        return {"final_answer": last_msg.content}
    
    graph.add_node("llm", call_llm)
    graph.add_node("answer", process_answer)
    graph.add_edge(START, "llm")
    graph.add_edge("llm", "answer")
    graph.add_edge("answer", END)
    
    return graph.compile(checkpointer=MemorySaver())

def test_agent_with_mock_llm():
    graph = create_agent_graph()
    
    mock_response = AIMessage(content="The answer is 42.")
    
    with patch("langchain_openai.chat_models.ChatOpenAI.invoke", return_value=mock_response) as mock_invoke:
        result = graph.invoke(
            {"messages": [HumanMessage(content="What is the meaning of life?")]},
            config={"configurable": {"thread_id": "test_1"}}
        )
        
        assert result["final_answer"] == "The answer is 42."
        assert mock_invoke.called
        print(f"✓ Test passed. Mock was called {mock_invoke.call_count} time(s).")
        print(f"✓ Final answer: {result['final_answer']}")

def test_agent_with_mock_error():
    graph = create_agent_graph()
    
    with patch("langchain_openai.chat_models.ChatOpenAI.invoke", side_effect=Exception("API rate limited")):
        try:
            graph.invoke(
                {"messages": [HumanMessage(content="Test")]},
                config={"configurable": {"thread_id": "test_2"}}
            )
        except Exception as e:
            print(f"✓ Error handling test passed. Caught: {type(e).__name__}: {e}")

if __name__ == "__main__":
    test_agent_with_mock_llm()
    test_agent_with_mock_error()

Output

✓ Test passed. Mock was called 1 time(s).
✓ Final answer: The answer is 42.
✓ Error handling test passed. Caught: Exception: API rate limited

What just happened?

The code defined a simple langgraph agent that calls an LLM. In the first test, we patched <code>ChatOpenAI.invoke</code> to return a fake response without making a real API call. The graph executed normally, received the mock response, processed it, and stored the final answer. We verified the mock was called once. In the second test, we patched the same method to raise an exception, simulating an API failure, and caught it without the graph crashing. No real OpenAI calls were made; no money was spent.

Common gotcha

Patching at the wrong import path. If you patch langchain_openai.ChatOpenAI.invoke but the graph imports it as from langchain_openai import ChatOpenAI, the patch won't work because the reference is already bound in the module namespace. Always patch where the object is used, not where it's defined. In this case, that's langchain_openai.chat_models.ChatOpenAI.invoke or better: patch at the exact module and class where invoke is called inside your graph node.

Error recovery

AttributeError: 'MagicMock' object has no attribute 'content'

Your mock response doesn't have the required attributes. Return an actual <code>AIMessage(content="...")</code> or a MagicMock with <code>content</code> set: <code>mock.return_value = AIMessage(content="text")</code>.

AssertionError: assert False

The mock was never called, meaning the patch target was wrong. Verify the import path matches exactly where the LLM is invoked. Use <code>print(mock_invoke.call_count)</code> to debug; if it's 0, your patch didn't intercept the call.

TypeError: invoke() got unexpected keyword arguments

The mock is too strict. Use <code>return_value=</code> instead of <code>side_effect=</code> for normal responses, or ensure your mock signature matches the real method's. Better: use <code>MagicMock()</code> with no restrictions and set <code>.return_value</code> explicitly.

Experienced dev note

Don't mock at the ChatOpenAI constructor level; mock at the invoke/batch method level. Beginners try patch("ChatOpenAI") and get confused when the graph still calls the real API. Also: use @pytest.mark.parametrize to run the same test with multiple mock responses (success, retry, malformed JSON, etc.). This gives you edge-case coverage without writing ten separate test functions. Finally, in CI/CD, use a separate test suite flag (e.g., pytest -m unit for mocked tests, pytest -m integration for real API calls) so you can run cheap mocked tests on every commit and expensive integration tests only on staging.

Check your understanding

If you mock ChatOpenAI.invoke to return different responses on successive calls (first call returns 'action', second returns 'final_answer'), how would your graph behave differently compared to mocking it to always return the same response? What would you need to do in the mock setup to test a multi-turn agent interaction?

Show answer hint

A correct answer recognizes that mocking can use <code>side_effect</code> with a list to return different values per call, simulating a real agent loop. It would mention setting up the mock with <code>side_effect=[first_response, second_response]</code> and understanding that each invocation consumes one item from the list, mirroring the graph's multi-step execution.

VERSION Patch paths are specific to langchain >= 0.2.0. In langchain < 0.2.0, ChatOpenAI was at langchain.chat_models.openai.ChatOpenAI. Also: langgraph 0.2.x requires from langgraph.graph import START, END not string literals.

Test langgraph state persistence and recovery using mocked checkpoints: saving and replaying agent decisions from a frozen state without re-calling the LLM.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.