Circuit breaker for external calls
Why this matters
In production, external APIs fail, timeout, or degrade. A circuit breaker stops your agent from hammering a broken service, saves costs, and surfaces real errors instead of hanging. This is the difference between a recoverable outage and a cascading system failure.
Explanation
What it is: A circuit breaker is a state machine that wraps external calls. It has three states: CLOSED (normal operation), OPEN (failing fast without attempting calls), and HALF_OPEN (probing to see if the service recovered). When failure thresholds are exceeded, the breaker trips to OPEN, rejecting calls immediately. After a timeout, it enters HALF_OPEN to test the service.
How it works mechanically: In LangGraph, you wrap a node that calls an external service with circuit breaker logic. The breaker tracks failure count and timestamps. Before each call, it checks the current state. If OPEN and the timeout hasn't elapsed, it raises an exception or returns a fallback immediately: no API call made. If the call fails, increment the counter; if it succeeds, reset the counter. When transitioning from OPEN to HALF_OPEN, allow a single test call. A successful test resets to CLOSED; a failed test resets the timer and stays OPEN.
When to use it: Use a circuit breaker when calling third-party APIs (LLM providers, search engines, payment gateways) that may degrade or fail. Combine it with retry logic (retries happen before the breaker opens) and fallbacks (execute when the breaker is open). This is essential in agent loops where repeated external calls could amplify impact of a downed service.
Analogy
A circuit breaker is like an electrical circuit breaker in your home. When current spikes (repeated failures), the breaker flips to OFF (OPEN state), cutting power immediately instead of letting dangerous current flow through every outlet (nodes). After a cooldown, you manually check if the problem is fixed; if it is, you flip it back ON (CLOSED). If the problem persists, you flip it back OFF.
Code
import time
from enum import Enum
from typing import Any
from dataclasses import dataclass, field
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 3
timeout: float = 5.0
state: CircuitState = field(default=CircuitState.CLOSED)
failure_count: int = field(default=0)
last_failure_time: float = field(default=0.0)
def call(self, func, *args, **kwargs):
current_time = time.time()
if self.state == CircuitState.OPEN:
if current_time - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise RuntimeError(f"Circuit breaker is OPEN. Retry after {self.timeout}s.")
try:
result = func(*args, **kwargs)
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = current_time
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
raise
class AgentState(TypedDict):
query: str
result: str
attempts: int
breaker = CircuitBreaker(failure_threshold=2, timeout=3.0)
call_count = 0
def unreliable_service(query: str) -> str:
global call_count
call_count += 1
if call_count <= 2:
raise ConnectionError("Service unavailable")
return f"Response to: {query}"
def call_external_api(state: AgentState) -> AgentState:
try:
result = breaker.call(unreliable_service, state["query"])
return {**state, "result": result, "attempts": state["attempts"] + 1}
except RuntimeError as e:
return {**state, "result": f"Fallback: Circuit breaker open. {str(e)}", "attempts": state["attempts"] + 1}
except ConnectionError:
return {**state, "result": "Fallback: Service failed", "attempts": state["attempts"] + 1}
def should_retry(state: AgentState) -> str:
if state["attempts"] < 4 and "Fallback" in state["result"]:
return "retry"
return "end"
graph = StateGraph(AgentState)
graph.add_node("call_api", call_external_api)
graph.add_node("end_node", lambda x: x)
graph.add_edge(START, "call_api")
graph.add_conditional_edges("call_api", should_retry, {"retry": "call_api", "end": "end_node"})
graph.add_edge("end_node", END)
compiled_graph = graph.compile()
initial_state = {"query": "What is AI?", "result": "", "attempts": 0}
final_state = compiled_graph.invoke(initial_state)
print(f"Final result: {final_state['result']}")
print(f"Circuit breaker state: {breaker.state.value}")
print(f"Attempts: {final_state['attempts']}") Final result: Fallback: Circuit breaker open. Circuit breaker is OPEN. Retry after 3.0s. Circuit breaker state: open Attempts: 3
What just happened?
The code simulates an unreliable service that fails twice, causing the circuit breaker to open after 2 failures (threshold met). On the third attempt, the breaker is in OPEN state and rejects the call immediately without even invoking the service, returning a RuntimeError. The conditional edge catches this and returns a fallback message. The breaker stays OPEN because the 3-second timeout hasn't elapsed.
Common gotcha
The most common mistake is confusing when the circuit breaker state transitions. Developers often expect the breaker to immediately switch from OPEN to HALF_OPEN on timeout, but it only transitions when a call is actually attempted after the timeout. If no calls are made during the timeout window, the breaker stays OPEN indefinitely: it's lazy, not proactive. Also, in a distributed system, each instance has its own breaker state; you need a shared backend (Redis, database) to coordinate circuit state across replicas.
Error recovery
RuntimeError (Circuit breaker is OPEN)ConnectionError (Service unavailable)Threshold not triggeringExperienced dev note
In production, you'll want a circuit breaker library like `pybreaker` or `tenacity` instead of rolling your own: they handle edge cases like concurrent calls, synchronization, and metrics. However, understanding the state machine here is critical: the circuit breaker is not a retry mechanism; it's a fast-fail mechanism. Combine it with exponential backoff retries *before* hitting the breaker. Also, monitor the breaker state and alert when it opens: a OPEN breaker is a signal that a downstream service is degraded, and your on-call needs to know. In LangGraph specifically, use the circuit breaker at the node level (as shown), not globally, so different agent threads can have different tolerance levels for different services.
Check your understanding
Your agent calls an external LLM service through a circuit breaker with threshold=3 and timeout=10s. The service fails 3 times in 2 seconds, opening the breaker. Your agent immediately attempts a 4th call. What happens, and how long until the breaker allows a test call?
Show answer hint
A correct answer must explain that the 4th call immediately raises RuntimeError without invoking the service (fast-fail), and the breaker will attempt a test call (HALF_OPEN) only after 10 seconds have elapsed since the last failure, not 10 seconds from when the breaker opened. The timeout is measured from the timestamp of the most recent failure.