Code Advanced hard · 8 min

Circuit breaker for external calls

What you will learn
Implement a circuit breaker pattern in LangGraph to fail fast when external services degrade, preventing cascading failures.

Why this matters

In production, external APIs fail, timeout, or degrade. A circuit breaker stops your agent from hammering a broken service, saves costs, and surfaces real errors instead of hanging. This is the difference between a recoverable outage and a cascading system failure.

Skip if: Don't use a circuit breaker for synchronous, always-critical operations that must never fail silently (e.g., payment verification). Don't use it if the service is internal and you control its deployment: use retry logic instead. Don't use it for streaming responses where you need real-time data regardless of degradation.

Explanation

What it is: A circuit breaker is a state machine that wraps external calls. It has three states: CLOSED (normal operation), OPEN (failing fast without attempting calls), and HALF_OPEN (probing to see if the service recovered). When failure thresholds are exceeded, the breaker trips to OPEN, rejecting calls immediately. After a timeout, it enters HALF_OPEN to test the service.

How it works mechanically: In LangGraph, you wrap a node that calls an external service with circuit breaker logic. The breaker tracks failure count and timestamps. Before each call, it checks the current state. If OPEN and the timeout hasn't elapsed, it raises an exception or returns a fallback immediately: no API call made. If the call fails, increment the counter; if it succeeds, reset the counter. When transitioning from OPEN to HALF_OPEN, allow a single test call. A successful test resets to CLOSED; a failed test resets the timer and stays OPEN.

When to use it: Use a circuit breaker when calling third-party APIs (LLM providers, search engines, payment gateways) that may degrade or fail. Combine it with retry logic (retries happen before the breaker opens) and fallbacks (execute when the breaker is open). This is essential in agent loops where repeated external calls could amplify impact of a downed service.

Analogy

A circuit breaker is like an electrical circuit breaker in your home. When current spikes (repeated failures), the breaker flips to OFF (OPEN state), cutting power immediately instead of letting dangerous current flow through every outlet (nodes). After a cooldown, you manually check if the problem is fixed; if it is, you flip it back ON (CLOSED). If the problem persists, you flip it back OFF.

Code

Illustrative only - not runnable without a valid API key
python
import time
from enum import Enum
from typing import Any
from dataclasses import dataclass, field
from langgraph.graph import StateGraph, START, END
from typing import TypedDict

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 3
    timeout: float = 5.0
    
    state: CircuitState = field(default=CircuitState.CLOSED)
    failure_count: int = field(default=0)
    last_failure_time: float = field(default=0.0)
    
    def call(self, func, *args, **kwargs):
        current_time = time.time()
        
        if self.state == CircuitState.OPEN:
            if current_time - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise RuntimeError(f"Circuit breaker is OPEN. Retry after {self.timeout}s.")
        
        try:
            result = func(*args, **kwargs)
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = current_time
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            raise

class AgentState(TypedDict):
    query: str
    result: str
    attempts: int

breaker = CircuitBreaker(failure_threshold=2, timeout=3.0)
call_count = 0

def unreliable_service(query: str) -> str:
    global call_count
    call_count += 1
    if call_count <= 2:
        raise ConnectionError("Service unavailable")
    return f"Response to: {query}"

def call_external_api(state: AgentState) -> AgentState:
    try:
        result = breaker.call(unreliable_service, state["query"])
        return {**state, "result": result, "attempts": state["attempts"] + 1}
    except RuntimeError as e:
        return {**state, "result": f"Fallback: Circuit breaker open. {str(e)}", "attempts": state["attempts"] + 1}
    except ConnectionError:
        return {**state, "result": "Fallback: Service failed", "attempts": state["attempts"] + 1}

def should_retry(state: AgentState) -> str:
    if state["attempts"] < 4 and "Fallback" in state["result"]:
        return "retry"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("call_api", call_external_api)
graph.add_node("end_node", lambda x: x)

graph.add_edge(START, "call_api")
graph.add_conditional_edges("call_api", should_retry, {"retry": "call_api", "end": "end_node"})
graph.add_edge("end_node", END)

compiled_graph = graph.compile()

initial_state = {"query": "What is AI?", "result": "", "attempts": 0}
final_state = compiled_graph.invoke(initial_state)

print(f"Final result: {final_state['result']}")
print(f"Circuit breaker state: {breaker.state.value}")
print(f"Attempts: {final_state['attempts']}")
Output
Final result: Fallback: Circuit breaker open. Circuit breaker is OPEN. Retry after 3.0s.
Circuit breaker state: open
Attempts: 3

What just happened?

The code simulates an unreliable service that fails twice, causing the circuit breaker to open after 2 failures (threshold met). On the third attempt, the breaker is in OPEN state and rejects the call immediately without even invoking the service, returning a RuntimeError. The conditional edge catches this and returns a fallback message. The breaker stays OPEN because the 3-second timeout hasn't elapsed.

Common gotcha

The most common mistake is confusing when the circuit breaker state transitions. Developers often expect the breaker to immediately switch from OPEN to HALF_OPEN on timeout, but it only transitions when a call is actually attempted after the timeout. If no calls are made during the timeout window, the breaker stays OPEN indefinitely: it's lazy, not proactive. Also, in a distributed system, each instance has its own breaker state; you need a shared backend (Redis, database) to coordinate circuit state across replicas.

Error recovery

RuntimeError (Circuit breaker is OPEN)
This is intentional: the breaker is protecting your system. Catch it in the node and return a fallback result, or add a delay and retry in a separate node after the timeout window expires.
ConnectionError (Service unavailable)
The underlying service failed. The circuit breaker will count this failure. If it's a transient error, the next attempt after the timeout enters HALF_OPEN and has a chance to succeed. If it persists, the breaker stays OPEN longer.
Threshold not triggering
Ensure failure_threshold matches your tolerance. If you set threshold=100 but the service fails 3 times, the breaker won't open. Start conservative (threshold=3–5) and increase if you see too many false positives.

Experienced dev note

In production, you'll want a circuit breaker library like `pybreaker` or `tenacity` instead of rolling your own: they handle edge cases like concurrent calls, synchronization, and metrics. However, understanding the state machine here is critical: the circuit breaker is not a retry mechanism; it's a fast-fail mechanism. Combine it with exponential backoff retries *before* hitting the breaker. Also, monitor the breaker state and alert when it opens: a OPEN breaker is a signal that a downstream service is degraded, and your on-call needs to know. In LangGraph specifically, use the circuit breaker at the node level (as shown), not globally, so different agent threads can have different tolerance levels for different services.

Check your understanding

Your agent calls an external LLM service through a circuit breaker with threshold=3 and timeout=10s. The service fails 3 times in 2 seconds, opening the breaker. Your agent immediately attempts a 4th call. What happens, and how long until the breaker allows a test call?

Show answer hint

A correct answer must explain that the 4th call immediately raises RuntimeError without invoking the service (fast-fail), and the breaker will attempt a test call (HALF_OPEN) only after 10 seconds have elapsed since the last failure, not 10 seconds from when the breaker opened. The timeout is measured from the timestamp of the most recent failure.

VERSION Circuit breaker state management (using node state vs. closure variables) is compatible with langgraph 0.2.x. In 0.1.x, MessageGraph did not support arbitrary state; this example requires 0.2.x's StateGraph.
NEXT

Learn how to implement adaptive retry logic with exponential backoff: the complement to circuit breakers that safely retest failing services before the breaker opens.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.