Code Advanced hard · 8 min

Team knowledge management

What you will learn

Store and retrieve training artifacts, hyperparameters, and results across team members using structured metadata and version tracking.

Why this matters

When a team fine-tunes models, different members run experiments with different configs, data, and hardware. Without a shared knowledge system, you lose track of which hyperparameters worked, why a training failed, and whether someone already tested that idea. This prevents redundant work and accelerates iteration.

Skip if: If you're prototyping alone on a single machine for under a week, or if your entire workflow is within a managed platform (HuggingFace Hub with team access, Lambda Labs, or similar). However, the moment you have two people fine-tuning the same model architecture, you need this.

Explanation

What it is: A lightweight system that logs training runs, hyperparameters, model checkpoints, and results to a structured store (JSON, YAML, or SQLite) that all team members can query and contribute to. It's not a full ML experiment tracker like Weights & Biases, but a minimal in-repo knowledge base. How it works mechanically: Before training starts, your SFTTrainer writes metadata (learning rate, batch size, adapter rank, data path, commit hash) to a log file. After training, you log validation metrics and checkpoint location. Team members can read this log, search by model/dataset combo, and see what's been tried. The log is version-controlled (Git) so history is preserved and changes are auditable. When to use it: Use this when you have 2+ people fine-tuning the same base model, need to prevent duplicate experiments, or need to understand why a training approach failed. It's especially valuable for teams that iterate on the same dataset or adapter configuration across multiple runs.

Analogy

It's like a shared lab notebook: every experiment (fine-tuning run) gets an entry with date, conditions (hyperparams), and results (metrics). Instead of asking 'Did anyone try rank=32?', you flip through the notebook and see exactly when it was tried, what happened, and the checkpoint location.

Code

python

import json
import os
from datetime import datetime
from pathlib import Path
from typing import Optional, Dict, Any
import hashlib
import subprocess

class TrainingLog:
    def __init__(self, log_file: str = "training_log.jsonl"):
        self.log_file = log_file
        if not os.path.exists(log_file):
            Path(log_file).touch()
    
    def get_current_git_hash(self) -> str:
        try:
            result = subprocess.run(
                ["git", "rev-parse", "HEAD"],
                capture_output=True,
                text=True,
                cwd=os.getcwd()
            )
            return result.stdout.strip() if result.returncode == 0 else "unknown"
        except Exception:
            return "unknown"
    
    def log_experiment(
        self,
        model_name: str,
        dataset_name: str,
        hyperparams: Dict[str, Any],
        checkpoint_path: str,
        metrics: Optional[Dict[str, float]] = None,
        status: str = "completed"
    ) -> None:
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "model_name": model_name,
            "dataset_name": dataset_name,
            "hyperparams": hyperparams,
            "checkpoint_path": checkpoint_path,
            "metrics": metrics or {},
            "status": status,
            "git_hash": self.get_current_git_hash()
        }
        with open(self.log_file, "a") as f:
            f.write(json.dumps(entry) + "\n")
    
    def query_by_model_and_dataset(self, model_name: str, dataset_name: str):
        results = []
        if not os.path.exists(self.log_file):
            return results
        with open(self.log_file, "r") as f:
            for line in f:
                if line.strip():
                    entry = json.loads(line)
                    if entry["model_name"] == model_name and entry["dataset_name"] == dataset_name:
                        results.append(entry)
        return results
    
    def query_by_hyperparams(self, key: str, value: Any):
        results = []
        if not os.path.exists(self.log_file):
            return results
        with open(self.log_file, "r") as f:
            for line in f:
                if line.strip():
                    entry = json.loads(line)
                    if entry["hyperparams"].get(key) == value:
                        results.append(entry)
        return results
    
    def get_all_runs(self):
        results = []
        if not os.path.exists(self.log_file):
            return results
        with open(self.log_file, "r") as f:
            for line in f:
                if line.strip():
                    results.append(json.loads(line))
        return results
    
    def print_summary(self):
        runs = self.get_all_runs()
        if not runs:
            print("No training runs logged yet.")
            return
        print(f"Total runs: {len(runs)}")
        for i, run in enumerate(runs[-3:], 1):
            print(f"\nRun {i}:")
            print(f"  Model: {run['model_name']}")
            print(f"  Dataset: {run['dataset_name']}")
            print(f"  Status: {run['status']}")
            print(f"  LR: {run['hyperparams'].get('learning_rate', 'N/A')}")
            print(f"  Rank: {run['hyperparams'].get('lora_r', 'N/A')}")
            if run['metrics']:
                print(f"  Best Loss: {run['metrics'].get('best_loss', 'N/A')}")

log = TrainingLog("training_log.jsonl")

log.log_experiment(
    model_name="mistral-7b",
    dataset_name="company_docs_v2",
    hyperparams={
        "learning_rate": 2e-4,
        "batch_size": 8,
        "lora_r": 16,
        "lora_alpha": 32,
        "num_epochs": 3
    },
    checkpoint_path="./checkpoints/mistral_v1",
    metrics={"best_loss": 1.23, "eval_accuracy": 0.87},
    status="completed"
)

log.log_experiment(
    model_name="mistral-7b",
    dataset_name="company_docs_v2",
    hyperparams={
        "learning_rate": 5e-4,
        "batch_size": 16,
        "lora_r": 32,
        "lora_alpha": 64,
        "num_epochs": 2
    },
    checkpoint_path="./checkpoints/mistral_v2",
    metrics={"best_loss": 1.15, "eval_accuracy": 0.89},
    status="completed"
)

log.log_experiment(
    model_name="llama-13b",
    dataset_name="company_docs_v2",
    hyperparams={
        "learning_rate": 1e-4,
        "batch_size": 8,
        "lora_r": 16,
        "lora_alpha": 32,
        "num_epochs": 4
    },
    checkpoint_path="./checkpoints/llama_v1",
    status="running"
)

print("=== All Runs (Last 3) ===")
log.print_summary()

print("\n=== Query: mistral-7b + company_docs_v2 ===")
results = log.query_by_model_and_dataset("mistral-7b", "company_docs_v2")
for r in results:
    print(f"LR={r['hyperparams']['learning_rate']}, Loss={r['metrics'].get('best_loss', 'N/A')}, Status={r['status']}")

print("\n=== Query: Rank=16 ===")
rank_results = log.query_by_hyperparams("lora_r", 16)
for r in rank_results:
    print(f"{r['model_name']} on {r['dataset_name']}: {r['status']}")

Output

=== All Runs (Last 3) ===
Total runs: 3

Run 1:
  Model: mistral-7b
  Dataset: company_docs_v2
  Status: completed
  LR: 0.0002
  Rank: 16
  Best Loss: 1.23

Run 2:
  Model: mistral-7b
  Dataset: company_docs_v2
  Status: completed
  LR: 0.0005
  Rank: 32
  Best Loss: 1.15

Run 3:
  Model: llama-13b
  Dataset: company_docs_v2
  Status: running
  LR: 0.0001
  Rank: 16
  Best Loss: N/A

=== Query: mistral-7b + company_docs_v2 ===
LR=0.0002, Loss=1.23, Status=completed
LR=0.0005, Loss=1.15, Status=completed

=== Query: Rank=16 ===
mistral-7b on company_docs_v2: completed
llama-13b on company_docs_v2: running

What just happened?

The code created a <code>TrainingLog</code> class that appends structured JSON lines to a file. Each line is a complete training run record. We logged three experiments (two completed mistral runs, one in-progress llama run), then queried the log by model+dataset and by hyperparameter. The queries searched the JSONL file in memory and returned matching records. Git commit hash was captured automatically for audit trail.

Common gotcha

JSONL format (one JSON per line) is critical: if you accidentally write the entire log as a single JSON array and then append to it, you'll corrupt the file or lose previous entries. Always append one complete JSON object per line. Similarly, if team members edit the log file while others are writing, you can get interleaved partial lines. Use file locks or a simple SQLite database (with a mutex) in production.

Error recovery

FileNotFoundError on query

The log file doesn't exist yet. The __init__ method creates an empty file, but if the file path is wrong or a directory doesn't exist, it will fail. Check that the log_file path is writable and the directory exists.

json.JSONDecodeError when reading

A previous write was interrupted and left a partial/malformed JSON line in the file. Manually inspect training_log.jsonl for any incomplete lines (ones that don't end with }). Delete the broken line or restore from backup.

git command returns 'unknown'

Git is not installed or the code is not in a Git repository. This is non-fatal; the code catches the exception and logs 'unknown'. If you need commit tracking, initialize a Git repo or remove the git_hash field.

Experienced dev note

The real power here is not the logging: it's searchability and version control. When someone asks 'Did we try rank=64 with this dataset?', you run one query instead of asking Slack. And because the log is version-controlled, you can diff it to see what changed between runs. In production, add a simple SQLite backend (replace the JSONL file with a .db file using sqlite3) if you exceed ~500 runs; JSONL becomes slow to query. But start with JSONL because it's Git-friendly and human-readable.

Check your understanding

Your teammate ran a fine-tune at 2am and then immediately re-ran it with slightly different hyperparams at 3am, thinking the first one failed. How would you use the log to find both runs and compare their metrics without asking them what they did?

Show answer hint

A correct answer would describe querying by both model and dataset (or by specific hyperparameters), then comparing the 'metrics' and 'timestamp' fields between the two entries. The key is that both runs are in the log and you can reconstruct exactly what happened without external communication.

VERSION This pattern is compatible with trl >= 1.0 and transformers >= 5.0. No breaking changes in the logging mechanism itself, but if you integrate with SFTTrainer.train(), ensure you're using the LCEL-compatible trainer API (introduced in transformers 5.5.x).

Once your team is tracking experiments, you'll need to automate the logging so every SFTTrainer run writes to the log automatically: that's integrating knowledge management directly into your training pipeline.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.