PodEvicted
kubernetes.client.rest.ApiException: PodEvicted due to GPU memory pressure
Stack trace
Warning Evicted Pod The node was low on resource: nvidia.com/gpu-memory. Container <container-name> was using more GPU memory than allowed, pod evicted.
Why it happens
Kubernetes nodes have limited GPU memory resources. When multiple pods compete for GPU memory or a pod exceeds its GPU memory limit, the kubelet evicts pods to free resources. This eviction is triggered by the node's resource pressure monitoring, especially if the NVIDIA device plugin reports GPU memory usage exceeding allocatable limits.
Detection
Monitor Kubernetes events for 'Evicted' status with reason 'PodEvicted' and check node GPU memory usage metrics via NVIDIA device plugin or node exporter before pod termination.
Causes & fixes
Pod requests or limits for GPU memory are not set or underestimated, allowing pods to consume excessive GPU memory.
Specify accurate GPU memory requests and limits in pod resource requests and limits to prevent overcommitment.
Multiple GPU-intensive pods scheduled on the same node exceed the node's total GPU memory capacity.
Use node affinity or taints/tolerations to isolate GPU workloads or limit pod concurrency per GPU node.
The NVIDIA device plugin version is outdated and does not report GPU memory usage correctly, causing improper eviction decisions.
Upgrade the NVIDIA device plugin to the latest stable version that supports GPU memory metrics.
Pod or container is leaking GPU memory or running inefficient GPU workloads causing unexpected memory spikes.
Profile and optimize GPU workloads to reduce memory usage and prevent leaks.
Code: broken vs fixed
from kubernetes import client, config
config.load_kube_config()
v1 = client.CoreV1Api()
pod = client.V1Pod(
metadata=client.V1ObjectMeta(name="gpu-pod"),
spec=client.V1PodSpec(
containers=[client.V1Container(
name="gpu-container",
image="my-gpu-image",
resources=client.V1ResourceRequirements(
limits={"nvidia.com/gpu": "1"} # Missing GPU memory limits
)
)]
)
)
v1.create_namespaced_pod(namespace="default", body=pod) # This pod may get evicted due to GPU memory pressure import os
from kubernetes import client, config
config.load_kube_config()
v1 = client.CoreV1Api()
pod = client.V1Pod(
metadata=client.V1ObjectMeta(name="gpu-pod"),
spec=client.V1PodSpec(
containers=[client.V1Container(
name="gpu-container",
image="my-gpu-image",
resources=client.V1ResourceRequirements(
limits={"nvidia.com/gpu": "1", "nvidia.com/gpu-memory": "8Gi"}, # Added GPU memory limit
requests={"nvidia.com/gpu": "1", "nvidia.com/gpu-memory": "8Gi"} # Added GPU memory request
)
)]
)
)
v1.create_namespaced_pod(namespace="default", body=pod) # Pod now requests and limits GPU memory
print("Pod created with GPU memory limits to prevent eviction.") Workaround
Catch pod eviction events and implement a retry mechanism with exponential backoff to reschedule pods on nodes with available GPU memory.
Prevention
Use Kubernetes device plugin metrics and scheduler extender to enforce strict GPU memory resource accounting and isolate GPU workloads to nodes with sufficient capacity.