Concept Intermediate · 3 min read

What is continuous batching in LLM serving

Q: What is continuous batching in LLM serving

Continuous batching in LLM serving is a technique that dynamically groups incoming inference requests into batches in real time to maximize GPU utilization and reduce latency. It continuously collects requests until a batch is full or a timeout occurs, then processes them together with a single model call.

Quick answer

Continuous batching in LLM serving is a technique that dynamically groups incoming inference requests into batches in real time to maximize GPU utilization and reduce latency. It continuously collects requests until a batch is full or a timeout occurs, then processes them together with a single model call.

Continuous batching is a dynamic request grouping technique in LLM serving that batches incoming inference calls on the fly to optimize throughput and latency.

How it works

Continuous batching works by collecting incoming LLM inference requests as they arrive and grouping them into batches without waiting for a fixed batch size or fixed time window. Imagine a grocery store checkout where customers arrive randomly; instead of serving each customer individually, the cashier waits briefly to group several customers' items into one transaction to speed up processing. Similarly, continuous batching accumulates requests until either a maximum batch size is reached or a short timeout expires, then sends the batch to the model for inference. This approach balances latency (not waiting too long) and throughput (maximizing GPU efficiency).

Concrete example

Here is a simplified Python example illustrating continuous batching logic for LLM serving:

python

import time
import threading
import queue

class ContinuousBatcher:
    def __init__(self, max_batch_size=8, max_wait_time=0.05):
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time  # seconds
        self.request_queue = queue.Queue()
        self.batch_lock = threading.Lock()

    def add_request(self, request):
        self.request_queue.put(request)

    def get_batch(self):
        batch = []
        start_time = time.time()
        while len(batch) < self.max_batch_size:
            try:
                timeout = max(0, self.max_wait_time - (time.time() - start_time))
                req = self.request_queue.get(timeout=timeout)
                batch.append(req)
            except queue.Empty:
                break
        return batch

# Simulated usage
batcher = ContinuousBatcher(max_batch_size=4, max_wait_time=0.1)

# Simulate incoming requests
for i in range(10):
    batcher.add_request(f"request_{i}")

batch = batcher.get_batch()
print("Batch processed:", batch)

output

Batch processed: ['request_0', 'request_1', 'request_2', 'request_3']

When to use it

Use continuous batching when you need to serve LLM inference requests with low latency and high throughput, especially in production environments with unpredictable request arrival rates. It is ideal for APIs where requests come sporadically but benefit from GPU batch processing. Avoid continuous batching if your workload requires strict per-request latency guarantees or if requests are extremely sparse, as batching may add unwanted delay.

Key terms

Term	Definition
Continuous batching	Dynamically grouping incoming LLM requests into batches in real time to optimize throughput and latency.
Batch size	The number of requests processed together in one model inference call.
Timeout	Maximum wait time before processing a batch even if it is not full.
Throughput	Number of requests processed per unit time.
Latency	Time delay from request arrival to response delivery.

✅

Key Takeaways

Continuous batching improves GPU utilization by dynamically grouping LLM requests as they arrive.
It balances latency and throughput by using a max batch size and a timeout to trigger inference.
Ideal for production LLM APIs with variable request rates needing efficient serving.
Avoid continuous batching if ultra-low latency per request is critical or requests are very sparse.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗