Critical severity intermediate · Fix: 15-30 min

PodEvicted

kubernetes.client.rest.ApiException: PodEvicted due to GPU memory pressure

What this error means

Kubernetes evicts pods when GPU memory pressure exceeds node capacity, causing your ML workload pods to terminate unexpectedly.

Stack trace

traceback

Warning  Evicted  Pod  The node was low on resource: nvidia.com/gpu-memory. Container <container-name> was using more GPU memory than allowed, pod evicted.

QUICK FIX

Set explicit GPU memory requests and limits in pod specs and isolate GPU workloads to nodes with sufficient GPU memory.

Why it happens

Kubernetes nodes have limited GPU memory resources. When multiple pods compete for GPU memory or a pod exceeds its GPU memory limit, the kubelet evicts pods to free resources. This eviction is triggered by the node's resource pressure monitoring, especially if the NVIDIA device plugin reports GPU memory usage exceeding allocatable limits.

Detection

Monitor Kubernetes events for 'Evicted' status with reason 'PodEvicted' and check node GPU memory usage metrics via NVIDIA device plugin or node exporter before pod termination.

Causes & fixes

Pod requests or limits for GPU memory are not set or underestimated, allowing pods to consume excessive GPU memory.

✓ Fix

Specify accurate GPU memory requests and limits in pod resource requests and limits to prevent overcommitment.

Multiple GPU-intensive pods scheduled on the same node exceed the node's total GPU memory capacity.

✓ Fix

Use node affinity or taints/tolerations to isolate GPU workloads or limit pod concurrency per GPU node.

The NVIDIA device plugin version is outdated and does not report GPU memory usage correctly, causing improper eviction decisions.

✓ Fix

Upgrade the NVIDIA device plugin to the latest stable version that supports GPU memory metrics.

Pod or container is leaking GPU memory or running inefficient GPU workloads causing unexpected memory spikes.

✓ Fix

Profile and optimize GPU workloads to reduce memory usage and prevent leaks.

Code: broken vs fixed

Broken - triggers the error

python

from kubernetes import client, config

config.load_kube_config()
v1 = client.CoreV1Api()
pod = client.V1Pod(
    metadata=client.V1ObjectMeta(name="gpu-pod"),
    spec=client.V1PodSpec(
        containers=[client.V1Container(
            name="gpu-container",
            image="my-gpu-image",
            resources=client.V1ResourceRequirements(
                limits={"nvidia.com/gpu": "1"}  # Missing GPU memory limits
            )
        )]
    )
)
v1.create_namespaced_pod(namespace="default", body=pod)  # This pod may get evicted due to GPU memory pressure

Fixed - works correctly

python

import os
from kubernetes import client, config

config.load_kube_config()
v1 = client.CoreV1Api()
pod = client.V1Pod(
    metadata=client.V1ObjectMeta(name="gpu-pod"),
    spec=client.V1PodSpec(
        containers=[client.V1Container(
            name="gpu-container",
            image="my-gpu-image",
            resources=client.V1ResourceRequirements(
                limits={"nvidia.com/gpu": "1", "nvidia.com/gpu-memory": "8Gi"},  # Added GPU memory limit
                requests={"nvidia.com/gpu": "1", "nvidia.com/gpu-memory": "8Gi"}  # Added GPU memory request
            )
        )]
    )
)
v1.create_namespaced_pod(namespace="default", body=pod)  # Pod now requests and limits GPU memory
print("Pod created with GPU memory limits to prevent eviction.")

Added explicit GPU memory requests and limits in the pod spec to prevent Kubernetes from evicting the pod due to GPU memory pressure.

⚠

Workaround

Catch pod eviction events and implement a retry mechanism with exponential backoff to reschedule pods on nodes with available GPU memory.

✓

Prevention

Use Kubernetes device plugin metrics and scheduler extender to enforce strict GPU memory resource accounting and isolate GPU workloads to nodes with sufficient capacity.

Python 3.9+ · kubernetes >=12.0.0 · tested on 22.0.0

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.