How to beginner · 3 min read

How to measure LLM latency in production

Quick answer
Measure LLM latency in production by recording the time before and after API calls using high-resolution timers like time.perf_counter() in Python. Log these durations for each request to monitor response times and detect performance regressions.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

Use Python's time.perf_counter() to capture start and end times around the client.chat.completions.create() call. Calculate the difference to get latency in seconds. Log or print this value for monitoring.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Say hello"}]

start_time = time.perf_counter()
response = client.chat.completions.create(model="gpt-4o", messages=messages)
end_time = time.perf_counter()

latency = end_time - start_time
print(f"LLM latency: {latency:.3f} seconds")
print("Response content:", response.choices[0].message.content)
output
LLM latency: 0.842 seconds
Response content: Hello!

Common variations

You can measure latency asynchronously using asyncio for concurrent requests or use streaming responses to measure partial latencies. Different models like gpt-4o-mini may have different latency profiles. Adjust timing code accordingly.

python
import os
import time
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def measure_latency():
    messages = [{"role": "user", "content": "Hello async"}]
    start = time.perf_counter()
    response = await client.chat.completions.acreate(model="gpt-4o-mini", messages=messages)
    end = time.perf_counter()
    print(f"Async LLM latency: {end - start:.3f} seconds")
    print("Response content:", response.choices[0].message.content)

asyncio.run(measure_latency())
output
Async LLM latency: 0.523 seconds
Response content: Hello async!

Troubleshooting

If latency measurements are inconsistent, check for network issues or API rate limits. Use retries with exponential backoff to handle transient errors. Ensure your system clock is accurate and use time.perf_counter() for high precision timing.

Key Takeaways

  • Use high-resolution timers like time.perf_counter() to measure LLM API call latency precisely.
  • Log latency per request in production to monitor performance trends and detect regressions early.
  • Consider async timing and streaming responses for more complex latency measurement scenarios.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗