Tool Advanced hard · 8 min best_practice

SBOM for ML containers

What you will learn

Generate and manage Software Bill of Materials (SBOM) for ML container images to track dependencies, vulnerabilities, and licensing compliance.

Why this matters

ML containers bundle Python packages, system libraries, CUDA runtimes, and model weights: creating opaque dependency chains. Without an SBOM, you cannot audit what's in your image for security compliance, licensing violations, or vulnerability patching. In production, a supply-chain attack on a transitive dependency (e.g., a compromised data preprocessing library) goes undetected. Regulators (NIST, SOC2) now require SBOMs for containerized workloads.

Skip if: For local development-only containers that never leave your machine, SBOM generation adds negligible value. Skip this if: (1) your ML pipeline runs entirely on-prem with no external audit requirements, (2) you rebuild images daily and track changes manually, (3) your organization has zero compliance/security scanning policies. However, once code touches a registry, CI/CD pipeline, or shared team environment, SBOM becomes non-negotiable.

Explanation

A Software Bill of Materials (SBOM) is a machine-readable inventory of all software components in a container image: OS packages, Python wheels, compiled libraries, and their versions. For ML containers, this is critical because: (1) Python's transitive dependencies create chains 50+ packages deep; (2) CUDA and cuDNN have known vulnerabilities published regularly; (3) Model weights or data preprocessing code may have GPL/AGPL licensing that contaminates your entire deployment. Tools like syft (open-source) or trivy (with SBOM mode) scan your image and produce output in standard formats (SPDX, CycloneDX). You integrate SBOM generation into your Docker build pipeline: either as a post-build step after pushing to registry, or as part of your CI/CD scanning gate. The SBOM is versioned alongside your image digest, so you can audit "what was in production on 2026-03-15" months later.

Configuration

dockerfile

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 as build

RUN apt-get update && apt-get install -y --no-cache-dir \
    python3 python3-pip python3-dev && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 as runtime

RUN apt-get update && apt-get install -y --no-cache-dir \
    python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

COPY --from=build /usr/local/lib/python3*/dist-packages /usr/local/lib/python3.10/dist-packages
COPY app/ /app
WORKDIR /app

ENTRYPOINT ["python3", "model.py"]

Why this order?

Multi-stage build isolates dependencies: the build stage installs everything needed for compilation; the runtime stage contains only what's needed to run the model. This reduces final image size by 40-60% and shrinks the attack surface for SBOM scanning. The pip install happens before COPY app/ to maximize Docker layer caching: if only your code changes, rebuild is fast.

Wrong vs Right

Wrong way

dockerfile

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip \
    && pip install torch torchvision numpy pandas scikit-learn \
    && apt-get clean

COPY . /app
WORKDIR /app
CMD ["python3", "train.py"]

# No SBOM generated. Dependencies are baked in, opaque to scanners.
# Downstream vulnerability in torch=2.0.0 goes undetected until runtime failure.

Right way

docker

# In CI/CD after docker build:
docker build -t my-ml:latest .
docker push my-ml:latest

# Generate SBOM immediately post-push
syft ghcr.io/myorg/my-ml:latest -o spdx-json > sbom.spdx.json
syft ghcr.io/myorg/my-ml:latest -o cyclonedx-json > sbom.cyclonedx.json

# Store SBOM alongside image metadata in your artifact repository
grype sbom.spdx.json --fail-on high  # Vulnerability check

# Tag image with SBOM digest for traceability
echo "SBOM SHA256: $(sha256sum sbom.spdx.json | cut -d' ' -f1)"

# Store in OCI artifact repository or commit to VCS
git add sbom.spdx.json && git commit -m "SBOM for my-ml:${GIT_SHA}"

# Alternative: use attestation (Docker BuildKit)
docker buildx build --provenance=true --sbom=true -t my-ml:latest .

Tool vitals

Primary command

bash

syft <image> -o spdx-json > sbom.spdx.json

Config file Dockerfile (with multi-stage SBOM generation)

Verify

bash

sbom validate sbom.spdx.json && cat sbom.spdx.json | jq '.packages | length'

Integration notes

SBOM feeds into your supply-chain security workflow: (1) Grype scans SBOM for known vulnerabilities and fails CI/CD if High/Critical are found; (2) License scanners (FOSSA, Licensefinder) parse the SBOM to detect GPL/AGPL packages that violate corporate policy; (3) Artifact attestation (Sigstore) uses SBOM as provenance evidence; (4) Kubernetes admission controllers (Kubewarden, OPA) can block image deployment if SBOM attestation is missing or stale. Connect SBOM to your supply-chain protection policy (SLSA) so that only images with verified, scanned SBOMs enter production.

Migration path

If you outgrow open-source SBOM scanning, migrate to: (1) Snyk Container for continuous monitoring with weekly rescans, (2) Anchore Enterprise for policy-as-code and runtime vulnerability correlation, (3) Aqua for runtime enforcement of SBOM policies. None of these require code changes: they consume your SPDX/CycloneDX SBOM directly. Ensure your SBOM format is standards-compliant (SPDX 2.3+, CycloneDX 1.4+) so you can swap tools without regenerating.

Cost model

Syft is free (open-source). Grype is free. If using Syft in CI/CD at scale (scanning 100+ images daily), you may hit rate limits on public image registries without authentication. Use registry credentials in your CI runner to avoid throttling. Commercial alternatives charge per image scanned: Snyk (~$0.10/scan after free tier), Anchore Enterprise (seat-based ~$500/mo for small teams). Sigstore attestations are free.

Common gotcha

Syft scans the final image layers on-registry, not your Dockerfile: it cannot see unpacked pip wheels if you used --no-cache-dir and they're not re-extracted. More critically: SBOM generation is asynchronous to your build. If you push an image and 5 minutes later generate SBOM, a race condition can occur where the image is already pulled by a CD pipeline before the SBOM exists. In production, you must generate SBOM *before* marking the image as ready for deployment. Use a signed attestation in Docker BuildKit (docker buildx build --attest sbom=true) to embed SBOM metadata immutably in the image manifest: this prevents the image and SBOM from diverging.

Team adoption

Mandate SBOM generation as a gate in your CI/CD pipeline: no image reaches your registry without a signed SBOM attestation. Create a team dashboard (e.g., Grafana + attestation API) that shows SBOM coverage: teams shipping without SBOM are immediately visible. Run weekly grype scans on your entire registry and publish a vulnerability report. Set a policy: teams must remediate High vulns within 7 days, Critical within 24 hours. For large teams, create a shared "sbom-tool" Makefile target so every team uses identical SBOM configuration: avoids drift where some teams use Syft, others use Grype, different formats, etc.

Experienced dev note

Most teams generate SBOM but never *use* it: it sits in a bucket unused. The real power is in the --fail-on medium gate in your CI/CD pipeline combined with a baseline SBOM diff. Store your baseline SBOM for main branch, then on PRs generate a new SBOM and diff it: syft diff sbom-baseline.json sbom-pr.json. This catches new vulnerabilities *before* merge, not days later. Additionally, pin your SBOM format version in your policy: teams using mixed SPDX 2.2/2.3 with CycloneDX 1.3/1.4 create parsing chaos downstream. Standardize on CycloneDX 1.4+ if you integrate with Kubernetes supply-chain security (it has better package type taxonomy).

Check your understanding

You have two images: image-v1 with torch==2.0.0 was deployed 30 days ago, image-v2 with torch==2.1.0 is in staging. A critical CVE is announced for torch==2.0.1. Your SBOM for image-v1 shows torch==2.0.0. Why is this insufficient to declare image-v1 safe, and what data point would you need from your SBOM to be certain?

Show answer hint

torch==2.0.0 may depend on older versions of CUDA runtime or numpy that have transitive CVEs. An SBOM tells you direct and transitive dependencies: check the <code>externalReferences</code> field and <code>dependencies</code> array. The real gotcha: torch wheels bundle native C++ code (cuBLAS, cuDNN): the SBOM must capture the bundled CUDA version as a separate component, not just the Python package. If your SBOM lacks CUDA component granularity, you cannot reason about native library vulns.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.