How to monitor LLM quality in production
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup monitoring environment
Install necessary libraries and set environment variables to access your LLM API. Prepare logging and storage for model outputs and evaluation metrics.
pip install openai pandas numpy Step by step monitoring code
This example demonstrates how to call an LLM, log outputs, and compute a simple quality metric (e.g., response length consistency) as a proxy for quality monitoring.
import os
from openai import OpenAI
import pandas as pd
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample prompts to monitor
prompts = [
"Explain quantum computing in simple terms.",
"Summarize the latest AI research breakthroughs.",
"Write a poem about spring."
]
# Collect responses
records = []
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
text = response.choices[0].message.content
# Simple metric: response length
length = len(text.split())
records.append({"prompt": prompt, "response": text, "length": length})
# Store results
df = pd.DataFrame(records)
df.to_csv("llm_monitoring_log.csv", index=False)
print(df) prompt response length 0 Explain quantum computing in simple terms. Quantum computing is a type of computation that uses... 45 1 Summarize the latest AI research breakthroughs. Recent AI research has focused on improving large language... 50 2 Write a poem about spring. Spring is the season of bloom and light, where colors burst... 40
Common variations
You can extend monitoring by integrating human feedback loops, using advanced metrics like ROUGE or BLEU, or deploying async calls for real-time monitoring. Different models like claude-3-5-haiku-20241022 or gemini-2.0-flash can be monitored similarly.
from anthropic import Anthropic
import os
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
prompts = ["Explain blockchain technology."]
for prompt in prompts:
message = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": prompt}]
)
print(message.content) Blockchain technology is a decentralized ledger system that records transactions across many computers...
Troubleshooting monitoring issues
If you notice inconsistent metrics or missing logs, verify API key permissions and network connectivity. For metric anomalies, check if prompts or model versions changed. Use logging to capture errors and retry failed API calls.
Key Takeaways
- Automate LLM quality monitoring using both quantitative metrics and human feedback.
- Log model inputs and outputs consistently for trend analysis and error diagnosis.
- Use continuous evaluation pipelines to detect model drift and maintain production reliability.