How to log and analyze LLM outputs
Quick answer
Use the
chat.completions.create method from the OpenAI SDK to capture LLM outputs programmatically. Log these outputs to files or databases, then analyze them with Python tools like pandas or visualization libraries to identify patterns, errors, or biases.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install pandas matplotlib
Setup
Install the required Python packages and set your environment variable for the OpenAI API key.
- Install OpenAI SDK and analysis libraries:
pip install openai pandas matplotlib Step by step
This example shows how to call an LLM, log the output to a CSV file, and then analyze the logged data with pandas and matplotlib.
import os
import csv
from openai import OpenAI
import pandas as pd
import matplotlib.pyplot as plt
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define prompt and call model
messages = [{"role": "user", "content": "Explain the benefits of logging LLM outputs."}]
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
# Extract text output
output_text = response.choices[0].message.content
print("LLM output:", output_text)
# Log output to CSV file
log_file = "llm_outputs.csv"
with open(log_file, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow([messages[0]["content"], output_text])
# Analyze logged outputs
# Load CSV into pandas DataFrame
try:
df = pd.read_csv(log_file, header=None, names=["prompt", "response"] )
print(f"Logged {len(df)} entries.")
# Simple analysis: response length distribution
df["response_length"] = df["response"].apply(len)
df["response_length"].hist(bins=10)
plt.title("Distribution of LLM response lengths")
plt.xlabel("Response length (characters)")
plt.ylabel("Frequency")
plt.show()
except FileNotFoundError:
print("No log file found for analysis.") output
LLM output: Logging LLM outputs helps track model behavior, debug issues, and improve performance. Logged 1 entries.
Common variations
You can adapt logging for asynchronous calls, streaming outputs, or different models like claude-3-5-sonnet-20241022. For example, use the Anthropic SDK for Claude models or add timestamps and metadata to logs for richer analysis.
import os
from anthropic import Anthropic
import csv
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
messages = [{"role": "user", "content": "Explain the benefits of logging LLM outputs."}]
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system="You are a helpful assistant.",
messages=messages
)
output_text = message.content
print("Claude output:", output_text)
# Append to CSV log
log_file = "claude_llm_outputs.csv"
with open(log_file, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow([messages[0]["content"], output_text]) output
Claude output: Logging outputs from LLMs enables better debugging, auditing, and model improvement.
Troubleshooting
- If you see empty or missing outputs, verify your API key and model name.
- For encoding errors when writing logs, ensure your file uses UTF-8 encoding.
- If logs grow too large, consider rotating files or using a database for storage.
Key Takeaways
- Always log both prompts and LLM responses for full context during analysis.
- Use structured formats like CSV or JSON for easy parsing and querying.
- Analyze logs with Python libraries such as pandas and matplotlib to identify trends and issues.