Team knowledge management
Why this matters
When a team fine-tunes models, different members run experiments with different configs, data, and hardware. Without a shared knowledge system, you lose track of which hyperparameters worked, why a training failed, and whether someone already tested that idea. This prevents redundant work and accelerates iteration.
Explanation
What it is: A lightweight system that logs training runs, hyperparameters, model checkpoints, and results to a structured store (JSON, YAML, or SQLite) that all team members can query and contribute to. It's not a full ML experiment tracker like Weights & Biases, but a minimal in-repo knowledge base.
SFTTrainer writes metadata (learning rate, batch size, adapter rank, data path, commit hash) to a log file. After training, you log validation metrics and checkpoint location. Team members can read this log, search by model/dataset combo, and see what's been tried. The log is version-controlled (Git) so history is preserved and changes are auditable.
Analogy
It's like a shared lab notebook: every experiment (fine-tuning run) gets an entry with date, conditions (hyperparams), and results (metrics). Instead of asking 'Did anyone try rank=32?', you flip through the notebook and see exactly when it was tried, what happened, and the checkpoint location.
Code
import json
import os
from datetime import datetime
from pathlib import Path
from typing import Optional, Dict, Any
import hashlib
import subprocess
class TrainingLog:
def __init__(self, log_file: str = "training_log.jsonl"):
self.log_file = log_file
if not os.path.exists(log_file):
Path(log_file).touch()
def get_current_git_hash(self) -> str:
try:
result = subprocess.run(
["git", "rev-parse", "HEAD"],
capture_output=True,
text=True,
cwd=os.getcwd()
)
return result.stdout.strip() if result.returncode == 0 else "unknown"
except Exception:
return "unknown"
def log_experiment(
self,
model_name: str,
dataset_name: str,
hyperparams: Dict[str, Any],
checkpoint_path: str,
metrics: Optional[Dict[str, float]] = None,
status: str = "completed"
) -> None:
entry = {
"timestamp": datetime.utcnow().isoformat(),
"model_name": model_name,
"dataset_name": dataset_name,
"hyperparams": hyperparams,
"checkpoint_path": checkpoint_path,
"metrics": metrics or {},
"status": status,
"git_hash": self.get_current_git_hash()
}
with open(self.log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
def query_by_model_and_dataset(self, model_name: str, dataset_name: str):
results = []
if not os.path.exists(self.log_file):
return results
with open(self.log_file, "r") as f:
for line in f:
if line.strip():
entry = json.loads(line)
if entry["model_name"] == model_name and entry["dataset_name"] == dataset_name:
results.append(entry)
return results
def query_by_hyperparams(self, key: str, value: Any):
results = []
if not os.path.exists(self.log_file):
return results
with open(self.log_file, "r") as f:
for line in f:
if line.strip():
entry = json.loads(line)
if entry["hyperparams"].get(key) == value:
results.append(entry)
return results
def get_all_runs(self):
results = []
if not os.path.exists(self.log_file):
return results
with open(self.log_file, "r") as f:
for line in f:
if line.strip():
results.append(json.loads(line))
return results
def print_summary(self):
runs = self.get_all_runs()
if not runs:
print("No training runs logged yet.")
return
print(f"Total runs: {len(runs)}")
for i, run in enumerate(runs[-3:], 1):
print(f"\nRun {i}:")
print(f" Model: {run['model_name']}")
print(f" Dataset: {run['dataset_name']}")
print(f" Status: {run['status']}")
print(f" LR: {run['hyperparams'].get('learning_rate', 'N/A')}")
print(f" Rank: {run['hyperparams'].get('lora_r', 'N/A')}")
if run['metrics']:
print(f" Best Loss: {run['metrics'].get('best_loss', 'N/A')}")
log = TrainingLog("training_log.jsonl")
log.log_experiment(
model_name="mistral-7b",
dataset_name="company_docs_v2",
hyperparams={
"learning_rate": 2e-4,
"batch_size": 8,
"lora_r": 16,
"lora_alpha": 32,
"num_epochs": 3
},
checkpoint_path="./checkpoints/mistral_v1",
metrics={"best_loss": 1.23, "eval_accuracy": 0.87},
status="completed"
)
log.log_experiment(
model_name="mistral-7b",
dataset_name="company_docs_v2",
hyperparams={
"learning_rate": 5e-4,
"batch_size": 16,
"lora_r": 32,
"lora_alpha": 64,
"num_epochs": 2
},
checkpoint_path="./checkpoints/mistral_v2",
metrics={"best_loss": 1.15, "eval_accuracy": 0.89},
status="completed"
)
log.log_experiment(
model_name="llama-13b",
dataset_name="company_docs_v2",
hyperparams={
"learning_rate": 1e-4,
"batch_size": 8,
"lora_r": 16,
"lora_alpha": 32,
"num_epochs": 4
},
checkpoint_path="./checkpoints/llama_v1",
status="running"
)
print("=== All Runs (Last 3) ===")
log.print_summary()
print("\n=== Query: mistral-7b + company_docs_v2 ===")
results = log.query_by_model_and_dataset("mistral-7b", "company_docs_v2")
for r in results:
print(f"LR={r['hyperparams']['learning_rate']}, Loss={r['metrics'].get('best_loss', 'N/A')}, Status={r['status']}")
print("\n=== Query: Rank=16 ===")
rank_results = log.query_by_hyperparams("lora_r", 16)
for r in rank_results:
print(f"{r['model_name']} on {r['dataset_name']}: {r['status']}") === All Runs (Last 3) === Total runs: 3 Run 1: Model: mistral-7b Dataset: company_docs_v2 Status: completed LR: 0.0002 Rank: 16 Best Loss: 1.23 Run 2: Model: mistral-7b Dataset: company_docs_v2 Status: completed LR: 0.0005 Rank: 32 Best Loss: 1.15 Run 3: Model: llama-13b Dataset: company_docs_v2 Status: running LR: 0.0001 Rank: 16 Best Loss: N/A === Query: mistral-7b + company_docs_v2 === LR=0.0002, Loss=1.23, Status=completed LR=0.0005, Loss=1.15, Status=completed === Query: Rank=16 === mistral-7b on company_docs_v2: completed llama-13b on company_docs_v2: running
What just happened?
The code created a <code>TrainingLog</code> class that appends structured JSON lines to a file. Each line is a complete training run record. We logged three experiments (two completed mistral runs, one in-progress llama run), then queried the log by model+dataset and by hyperparameter. The queries searched the JSONL file in memory and returned matching records. Git commit hash was captured automatically for audit trail.
Common gotcha
JSONL format (one JSON per line) is critical: if you accidentally write the entire log as a single JSON array and then append to it, you'll corrupt the file or lose previous entries. Always append one complete JSON object per line. Similarly, if team members edit the log file while others are writing, you can get interleaved partial lines. Use file locks or a simple SQLite database (with a mutex) in production.
Error recovery
FileNotFoundError on queryjson.JSONDecodeError when readinggit command returns 'unknown'Experienced dev note
The real power here is not the logging: it's searchability and version control. When someone asks 'Did we try rank=64 with this dataset?', you run one query instead of asking Slack. And because the log is version-controlled, you can diff it to see what changed between runs. In production, add a simple SQLite backend (replace the JSONL file with a .db file using sqlite3) if you exceed ~500 runs; JSONL becomes slow to query. But start with JSONL because it's Git-friendly and human-readable.
Check your understanding
Your teammate ran a fine-tune at 2am and then immediately re-ran it with slightly different hyperparams at 3am, thinking the first one failed. How would you use the log to find both runs and compare their metrics without asking them what they did?
Show answer hint
A correct answer would describe querying by both model and dataset (or by specific hyperparameters), then comparing the 'metrics' and 'timestamp' fields between the two entries. The key is that both runs are in the log and you can reconstruct exactly what happened without external communication.