How to beginner · 3 min read

How to measure LLM latency in production

Q: How to measure LLM latency in production

Measure LLM latency in production by recording the time before and after API calls using high-resolution timers like time.perf_counter() in Python. Log these durations for each request to monitor response times and detect performance regressions.

Quick answer

Measure LLM latency in production by recording the time before and after API calls using high-resolution timers like time.perf_counter() in Python. Log these durations for each request to monitor response times and detect performance regressions.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

Use Python's time.perf_counter() to capture start and end times around the client.chat.completions.create() call. Calculate the difference to get latency in seconds. Log or print this value for monitoring.

python

import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Say hello"}]

start_time = time.perf_counter()
response = client.chat.completions.create(model="gpt-4o", messages=messages)
end_time = time.perf_counter()

latency = end_time - start_time
print(f"LLM latency: {latency:.3f} seconds")
print("Response content:", response.choices[0].message.content)

output

LLM latency: 0.842 seconds
Response content: Hello!

Common variations

You can measure latency asynchronously using asyncio for concurrent requests or use streaming responses to measure partial latencies. Different models like gpt-4o-mini may have different latency profiles. Adjust timing code accordingly.

python

import os
import time
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def measure_latency():
    messages = [{"role": "user", "content": "Hello async"}]
    start = time.perf_counter()
    response = await client.chat.completions.acreate(model="gpt-4o-mini", messages=messages)
    end = time.perf_counter()
    print(f"Async LLM latency: {end - start:.3f} seconds")
    print("Response content:", response.choices[0].message.content)

asyncio.run(measure_latency())

output

Async LLM latency: 0.523 seconds
Response content: Hello async!

Troubleshooting

If latency measurements are inconsistent, check for network issues or API rate limits. Use retries with exponential backoff to handle transient errors. Ensure your system clock is accurate and use time.perf_counter() for high precision timing.

✅

Key Takeaways

Use high-resolution timers like time.perf_counter() to measure LLM API call latency precisely.
Log latency per request in production to monitor performance trends and detect regressions early.
Consider async timing and streaming responses for more complex latency measurement scenarios.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗