How to Intermediate · 3 min read

How to track LLM quality in production

Quick answer
Track LLM quality in production by logging user interactions and model outputs, then measuring metrics like accuracy, perplexity, and user satisfaction. Use automated monitoring tools and periodic human reviews to detect drift and maintain performance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • Basic knowledge of logging and monitoring

Setup logging and monitoring

Start by integrating logging in your application to capture LLM inputs, outputs, and metadata such as timestamps and user IDs. Use centralized logging platforms like Datadog, WandB, or LangSmith for traceability. Set up monitoring dashboards to track key metrics over time.

python
import os
from openai import OpenAI
import logging

logging.basicConfig(filename='llm_logs.log', level=logging.INFO)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Explain RAG."}]
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)

output = response.choices[0].message.content
logging.info(f"Input: {messages[0]['content']}")
logging.info(f"Output: {output}")
print(output)
output
Explain RAG stands for Retrieval-Augmented Generation, a technique that combines retrieval of documents with generation by an LLM.

Step by step quality metrics

Measure LLM quality using metrics such as:

  • Perplexity: How well the model predicts the next token (requires access to model internals or proxy metrics).
  • Accuracy / F1: For tasks with ground truth labels, compare outputs to expected answers.
  • User feedback: Collect explicit ratings or implicit signals like task success.
  • Latency and error rates: Monitor response times and failures.

Combine automated tests with periodic human evaluation to catch subtle errors or drift.

python
def evaluate_response(response_text, expected_answer):
    # Simple exact match accuracy example
    return response_text.strip().lower() == expected_answer.strip().lower()

# Example usage
user_input = "What is RAG?"
expected = "Retrieval-Augmented Generation"
actual = output  # from previous code
accuracy = evaluate_response(actual, expected)
print(f"Accuracy: {accuracy}")
output
Accuracy: False

Common variations

You can enhance tracking by:

  • Using async calls to handle high throughput.
  • Streaming outputs to monitor partial responses in real time.
  • Switching models (e.g., gpt-4o-mini or claude-3-5-sonnet-20241022) to compare quality.
  • Integrating with observability tools like Langfuse or AgentOps for automatic tracing.
python
import asyncio

async def async_llm_call():
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Explain RAG."}],
        stream=True
    )
    async for chunk in response:
        print(chunk.choices[0].delta.content or '', end='', flush=True)

asyncio.run(async_llm_call())
output
Explain Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with language model generation to improve accuracy.

Troubleshooting common issues

  • Missing logs: Ensure logging is enabled and file permissions allow writes.
  • Metric drift: If quality degrades, retrain or fine-tune your model with fresh data.
  • High latency: Use smaller models or batch requests.
  • API errors: Check API keys, rate limits, and network connectivity.

Key Takeaways

  • Log all LLM inputs and outputs centrally for traceability.
  • Use automated metrics like accuracy and perplexity combined with human reviews.
  • Leverage streaming and async calls for real-time monitoring at scale.
  • Integrate observability tools like Langfuse or AgentOps for automatic quality tracking.
  • Monitor latency and error rates to maintain production reliability.
Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗