What is continuous batching in LLM serving
LLM serving is a technique that dynamically groups incoming inference requests into batches in real time to maximize GPU utilization and reduce latency. It continuously collects requests until a batch is full or a timeout occurs, then processes them together with a single model call.How it works
Continuous batching works by collecting incoming LLM inference requests as they arrive and grouping them into batches without waiting for a fixed batch size or fixed time window. Imagine a grocery store checkout where customers arrive randomly; instead of serving each customer individually, the cashier waits briefly to group several customers' items into one transaction to speed up processing. Similarly, continuous batching accumulates requests until either a maximum batch size is reached or a short timeout expires, then sends the batch to the model for inference. This approach balances latency (not waiting too long) and throughput (maximizing GPU efficiency).
Concrete example
Here is a simplified Python example illustrating continuous batching logic for LLM serving:
import time
import threading
import queue
class ContinuousBatcher:
def __init__(self, max_batch_size=8, max_wait_time=0.05):
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time # seconds
self.request_queue = queue.Queue()
self.batch_lock = threading.Lock()
def add_request(self, request):
self.request_queue.put(request)
def get_batch(self):
batch = []
start_time = time.time()
while len(batch) < self.max_batch_size:
try:
timeout = max(0, self.max_wait_time - (time.time() - start_time))
req = self.request_queue.get(timeout=timeout)
batch.append(req)
except queue.Empty:
break
return batch
# Simulated usage
batcher = ContinuousBatcher(max_batch_size=4, max_wait_time=0.1)
# Simulate incoming requests
for i in range(10):
batcher.add_request(f"request_{i}")
batch = batcher.get_batch()
print("Batch processed:", batch) Batch processed: ['request_0', 'request_1', 'request_2', 'request_3']
When to use it
Use continuous batching when you need to serve LLM inference requests with low latency and high throughput, especially in production environments with unpredictable request arrival rates. It is ideal for APIs where requests come sporadically but benefit from GPU batch processing. Avoid continuous batching if your workload requires strict per-request latency guarantees or if requests are extremely sparse, as batching may add unwanted delay.
Key terms
| Term | Definition |
|---|---|
| Continuous batching | Dynamically grouping incoming LLM requests into batches in real time to optimize throughput and latency. |
| Batch size | The number of requests processed together in one model inference call. |
| Timeout | Maximum wait time before processing a batch even if it is not full. |
| Throughput | Number of requests processed per unit time. |
| Latency | Time delay from request arrival to response delivery. |
Key Takeaways
- Continuous batching improves GPU utilization by dynamically grouping LLM requests as they arrive.
- It balances latency and throughput by using a max batch size and a timeout to trigger inference.
- Ideal for production LLM APIs with variable request rates needing efficient serving.
- Avoid continuous batching if ultra-low latency per request is critical or requests are very sparse.