绿色记忆:Kubernetes GPU Sharing

Kubernetes GPU Sharing

GPU sharing in Kubernetes depends on what the NVIDIA device plugin advertises to the scheduler, what isolation the underlying mechanism really provides, and what the installed hardware can support. This note uses a real production scheduling stall to walk through the GPU inventory, the practical differences between Time-Slicing and MIG, the constraints imposed by the current cluster hardware, and the rollout that expanded schedulable GPU slots from 2 to 11.

Background

GPUs are expensive and scarce. Kubernetes treats them as exclusive resources by default, so one Pod usually occupies one whole physical GPU even when the workload uses only a small fraction of the card. As concurrent GPU-backed services grew, especially for LLM inference, NLP, and PII detection, that default model turned into both wasted capacity and a concrete scheduling problem.

The production cluster had 8 nodes, 2 of them with GPUs. An NLP inference service was scaled to multiple replicas, each requesting nvidia.com/gpu: 1. Both physical GPUs were already occupied, so new replicas sat in Pending for more than 28 days with repeated events such as 0/8 nodes are available: 8 Insufficient nvidia.com/gpu. That failure forced a deeper evaluation of GPU sharing options.

NVIDIA exposes two mainstream sharing paths for Kubernetes: Time-Slicing and MIG. They solve different problems. They also depend on very different hardware assumptions, which means the hardware survey has to come first.

Cluster Survey

Node And GPU Inventory

All 8 nodes in the cluster were Ready and running on Tencent Cloud TKE. GPU nodes were identified through the node label nvidia-device-enable=enable, then a privileged Pod entered the host namespace and ran nvidia-smi -q to confirm the exact models:

Instance Type	GPU Model	Architecture	Memory	Compute Capability	Driver	MIG Support
PNV5b.8XLARGE96	NVIDIA L20	Ada Lovelace	46068 MiB	8.9	570.158.01	No
GN7.2XLARGE32	Tesla T4	Turing	15360 MiB	7.5	570.158.01	No

Both nodes were on the same driver release and neither card supported MIG. That distinction matters. The L20 has a higher Compute Capability than Ampere, but MIG support does not track Compute Capability monotonically. The Ada Lovelace L-series does not support MIG, while Ampere parts such as A100 and A30, and Hopper parts such as H100 and H20, do.

Existing Device Plugin State

The cluster was running the TKE-provided nvidia-device-plugin:v0.14.5 as a DaemonSet with the startup arguments --mig-strategy=single --fail-on-init-error=false --pass-device-specs=true. There was no --config-file flag and no Time-Slicing ConfigMap. Each GPU node registered exactly one nvidia.com/gpu resource with Kubernetes, so the cluster had 2 schedulable GPU slots in total, both already consumed. The DaemonSet metadata included meta.helm.sh/release-name: nvidia-gpu, which showed that it had originally been installed through Helm even though Helm CLI was not present in the current environment.

Mechanisms

Time-Slicing

Time-Slicing is an oversubscription feature implemented by the NVIDIA k8s-device-plugin. An administrator defines a replica count for each GPU resource in a ConfigMap, and the device plugin advertises that GPU to Kubernetes as multiple schedulable resources. Under the hood, workloads share the same physical GPU and CUDA time-slices execution across processes.

Kubernetes itself does not understand the semantics of GPU sharing. It only sees whatever extended resources the plugin exposes, such as nvidia.com/gpu or nvidia.com/gpu.shared. From the Pod's point of view, the declaration pattern does not change: GPU resources belong in resources.limits, and if requests is also present, it must match the limit value.

The tradeoff is blunt. Time-Slicing does not isolate memory, so all replicas on the same GPU share one physical memory pool. It also does not guarantee a proportional share of compute. Asking for multiple shared GPUs does not mean the workload gets a linear share of throughput. NVIDIA recommends failRequestsGreaterThanOne: true so that a request larger than 1 fails with UnexpectedAdmissionError instead of creating the false impression of an exclusive quota.

The upside is operational simplicity. On existing non-MIG hardware, it usually takes only a ConfigMap plus a device-plugin restart. The downside is weak isolation, a shared fault domain, and limited observability. In Time-Slicing mode, DCGM-Exporter cannot reliably attribute metrics to individual containers and mostly reports at the physical GPU level.

MIG

MIG, or Multi-Instance GPU, is NVIDIA's hardware partitioning model introduced on Ampere-class GPUs and later architectures that support it. A physical GPU is divided into GPU Instances, each with its own SM slices, memory partition, cache and bandwidth share, DMA engines, and hardware fault boundary. That is the isolation Time-Slicing cannot provide.

How MIG appears in Kubernetes depends on the configured strategy. Under single, the resource name stays as nvidia.com/gpu, but each advertised unit maps to one same-profile MIG instance. Under mixed, each MIG profile is exposed as its own resource type such as nvidia.com/mig-1g.12gb. That model is far cleaner for isolation, but it depends on MIG-capable hardware. Changing MIG profiles also requires node-level maintenance. On Hopper, GPU reset support makes this less disruptive than it was on Ampere, but it is still not a zero-touch change.

Comparison

Dimension	Time-Slicing	MIG
Memory isolation	None, all replicas share physical memory	Hardware-level, per-instance
Fault domain	Shared within one physical GPU	Isolated at the instance level
Kubernetes resource shape	nvidia.com/gpu or nvidia.com/gpu.shared	single uses nvidia.com/gpu; mixed uses nvidia.com/mig-*
Workload changes	Often none when renameByDefault=false	mixed requires Pods to request explicit MIG resources
Hardware support	Broad, works on existing full-GPU resources	Only on MIG-capable GPU models
Metric attribution	Mostly physical-GPU level	Can be modeled around MIG resources
Operational complexity	Low, usually just ConfigMap plus plugin restart	Moderate, requires lifecycle management for GPU Instances
Composition	Can be applied to full GPUs and to mixed MIG resources	Can serve as the lower-level partitioning layer before Time-Slicing

H20 And MIG

The H20 is a Hopper GH100 part with Compute Capability 9.0 and 96 GB of HBM3e. NVIDIA's MIG documentation lists it as supporting up to 7 MIG instances. That makes it a relevant medium-term target even though it was not present in the current cluster.

Typical H20 partition shapes look like this:

Profile	SM Share	Memory	Instances Per Card	Typical Use
1g.12gb	1/7	12GB	7	Inference for models up to roughly 7B
2g.24gb	2/7	24GB	3	Mid-size models around 13B
3g.47gb	3/7	47GB	2	Models around 30B
4g.47gb	4/7	47GB	1	Single larger model instance
7g.94gb	7/7	94GB	1	Full-card style allocation for 70B+

Time-Slicing and MIG can be combined. A common pattern is to partition the H20 with MIG first, then oversubscribe a specific MIG resource with Time-Slicing. In that case the ConfigMap needs migStrategy: mixed and the resource name must target the MIG profile directly:

sharing:

timeSlicing:

resources:

- name: nvidia.com/mig-1g.12gb

replicas: 2

Configuration Reference

Time-Slicing ConfigMap By GPU Model

The ConfigMap can hold multiple keys, with each key representing one node configuration. any acts as the fallback. Other keys are selected through node labels:

apiVersion: v1

kind: ConfigMap

metadata:

namespace: kube-system

data:

any: |-

version: v1

flags:

migStrategy: none

sharing:

timeSlicing:

renameByDefault: false

failRequestsGreaterThanOne: true

resources:

- name: nvidia.com/gpu

replicas: 2

l20: |-

version: v1

flags:

migStrategy: none

sharing:

timeSlicing:

renameByDefault: false

failRequestsGreaterThanOne: true

resources:

- name: nvidia.com/gpu

replicas: 8

t4: |-

version: v1

flags:

migStrategy: none

sharing:

timeSlicing:

renameByDefault: false

failRequestsGreaterThanOne: true

resources:

- name: nvidia.com/gpu

replicas: 3

With renameByDefault: false, the resource name stays as nvidia.com/gpu. The node labels then pick up a -SHARED suffix, for example nvidia.com/gpu.product=Tesla-T4-SHARED, which makes it possible to distinguish shared and non-shared nodes with selectors. Replica counts were chosen from measured per-process memory usage, described later in the rollout record.

Device Plugin DaemonSets By GPU Model

Version v0.14.5 does not support per-node dynamic config selection in the way newer operator-managed deployments do. The practical solution was to run two DaemonSets, each with its own --config-file and its own node selector:

# Label nodes by GPU model

kubectl label node <l20-node> nvidia.com/device-plugin.config=l20

kubectl label node <t4-node> nvidia.com/device-plugin.config=t4

# Patch the existing DaemonSet so it runs only on T4 nodes and uses the t4 config

kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[

{"op":"replace","path":"/spec/template/spec/containers/0/args/3",

"value":"--config-file=/etc/nvidia/time-slicing-config/t4"},

{"op":"add","path":"/spec/template/spec/nodeSelector/nvidia.com~1device-plugin.config",

"value":"t4"}

kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system

A dedicated L20 DaemonSet then reused the same ConfigMap but pointed at the l20 key:

apiVersion: apps/v1

kind: DaemonSet

metadata:

namespace: kube-system

spec:

selector:

matchLabels:

updateStrategy:

type: RollingUpdate

template:

metadata:

labels:

spec:

nodeSelector:

nvidia-device-enable: enable

nvidia.com/device-plugin.config: l20

tolerations:

- operator: Exists

priorityClassName: system-node-critical

containers:

- name: nvidia-device-plugin-ctr

image: sgccr.ccs.tencentyun.com/tkeimages/nvidia-device-plugin:v0.14.5

command: [nvidia-device-plugin]

args:

- --fail-on-init-error=false

- --mig-strategy=single

- --pass-device-specs=true

- --config-file=/etc/nvidia/time-slicing-config/l20

env:

- name: NVIDIA_DRIVER_CAPABILITIES

value: utility,compute

resources:

limits:

cpu: 100m

memory: 100Mi

requests:

cpu: 100m

memory: 100Mi

securityContext:

capabilities:

drop: [ALL]

volumeMounts:

- name: device-plugin

mountPath: /var/lib/kubelet/device-plugins

- name: time-slicing-config

mountPath: /etc/nvidia/time-slicing-config

volumes:

- name: device-plugin

hostPath:

path: /var/lib/kubelet/device-plugins

- name: time-slicing-config

configMap:

MIG On Future H20 Nodes

On Hopper hardware, MIG reconfiguration is less painful than it was on earlier generations because GPU reset support is better. The operational sequence is still node maintenance first, GPU reconfiguration second, scheduler re-entry last:

kubectl drain <h20-node> --ignore-daemonsets --delete-emptydir-data

# SSH into the H20 node

sudo nvidia-smi -mig 1

sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C

nvidia-smi -L

kubectl uncordon <h20-node>

From the Kubernetes side, the three MIG strategies remain single, mixed, and none. single keeps the traditional nvidia.com/gpu resource shape when all instances on a node share one profile. mixed exposes explicit nvidia.com/mig-* resources and requires workloads to request them directly. For a new MIG deployment, Helm is the cleaner path:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \

--version=0.17.1 \

--namespace nvidia-device-plugin \

--create-namespace \

--set migStrategy=single \

--set gfd.enabled=true

Rollout Record

Step 1: Inspect The Baseline

# Check the current DaemonSet arguments

kubectl get daemonset nvidia-device-plugin-daemonset -n kube-system -o yaml | grep -A 15 "containers:"

# Check whether any Time-Slicing ConfigMap already exists

kubectl get configmap -n kube-system | grep nvidia

# Check current GPU slot counts

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu"

At baseline there was no ConfigMap, no --config-file, and only one schedulable GPU slot per GPU node.

Step 2: Create The ConfigMap And Patch The DaemonSet

The first pass used the any key with replicas=2. The goal at that stage was not model-specific tuning. It was to verify that the plugin picked up Time-Slicing at all.

kubectl apply -f time-slicing-config.yaml

kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[

{"op":"add","path":"/spec/template/spec/volumes/-",

"value":{"name":"time-slicing-config","configMap":{"name":"time-slicing-config"}}},

{"op":"add","path":"/spec/template/spec/containers/0/volumeMounts/-",

"value":{"name":"time-slicing-config","mountPath":"/etc/nvidia/time-slicing-config"}},

{"op":"add","path":"/spec/template/spec/containers/0/args/-",

"value":"--config-file=/etc/nvidia/time-slicing-config/any"}

kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system

kubectl rollout status daemonset/nvidia-device-plugin-daemonset -n kube-system --timeout=120s

One operational detail matters here: the device plugin does not watch ConfigMap updates automatically. Editing the ConfigMap alone is not enough. A DaemonSet restart is required before a new Time-Slicing configuration takes effect.

Step 3: Verify The First Expansion

The restart finished in about 6 seconds. After that, node capacity showed the first slot expansion:

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu"

# L20 node 2 2

# T4 node 2 2

The plugin logs also confirmed that the Time-Slicing configuration had loaded:

kubectl logs -n kube-system <device-plugin-pod> --tail=5

# "timeSlicing": {"failRequestsGreaterThanOne": true, "resources": [{"replicas": 2}]}

# Registered device plugin for 'nvidia.com/gpu' with Kubelet

The workload side told the more useful story. The blocked object was not just one Pod. It was a Rolling Update that had been stalled for 28 days. The Deployment had already created a new ReplicaSet, but the first replacement Pod could not schedule because no GPU was free. Since the default Deployment strategy waits for the new ReplicaSet to become ready before shrinking the old one, the entire release froze in the middle. Once Time-Slicing expanded capacity, scheduling completed within 43 seconds and the Deployment resumed immediately:

kubectl describe pod <pending-pod> -n <ns> | grep -A 3 "Events:"

# Warning FailedScheduling (x1303 over 4d12h) 0/8 nodes are available: 8 Insufficient nvidia.com/gpu.

# Normal Scheduled 43s Successfully assigned <pod> to <gpu-node>

Step 4: Tune Replica Counts By GPU Model

After the mechanism worked, the next step was to size replicas from actual memory usage rather than intuition. A privileged Pod was used to measure the live memory footprint on both GPU nodes:

GPU	Total Memory	Measured Per-Process Usage	Final Replicas	Theoretical Headroom Per Slot
NVIDIA L20	46068 MiB	4621 MiB (about 10%)	8	about 1137 MiB
Tesla T4	15360 MiB	4401 MiB (about 29%)	3	about 759 MiB

The L20 made the underutilization obvious. With replicas=2, each slot effectively had about 22 GB available while the measured process used only about 4.6 GB. That was far too conservative. Raising the L20 to 8 slots pushed theoretical memory utilization much closer to a useful level.

This stage also exposed a version-specific limitation. The v0.14.5 plugin cannot point --config-file at a directory for dynamic per-node selection. On this version, doing so crashes the Pod:

# Pod CrashLoopBackOff, log output:

# E unable to load config: unable to finalize config: unable to parse config file:

# read error: read /etc/nvidia/time-slicing-config: is a directory

That selection mechanism depends on the config-manager sidecar used by GPU Operator deployments. The bare device plugin does not have it. In practice, that forced the two-DaemonSet layout: one bound to T4 nodes and one bound to L20 nodes, each with an explicit config file target.

The final GPU slot layout looked like this:

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu"

# L20 node 8 8

# T4 node 3 3

Outstanding Issues

One risk remains around ownership. The original DaemonSet is TKE-managed. During cluster upgrades or node-group operations, the control plane may reconcile that DaemonSet back to its original form and wipe out manual patches. The current mitigation is documentation and repeatability. The cleaner long-term answer is to move to TKE's native GPU sharing feature or deploy GPU Operator and stop patching the managed object directly.

Observability is also still weak. The cluster already runs nvidia-gpu-exporter, but in Time-Slicing mode the metrics still aggregate at the physical GPU level. Per-Pod memory and compute attribution remains limited. That is one reason MIG is still the better long-term target when the hardware supports it.

There is also a side effect on privileged development pods that bypass the device plugin entirely by not requesting nvidia.com/gpu. During this rollout, the containerd runc runtime was configured to use nvidia-container-runtime as its binary, which is a common step when setting up GPU sharing. The containerd flag privileged_without_host_devices defaults to false in that configuration. With the nvidia runtime active and that flag false, privileged containers that have no device plugin allocation are blocked from /dev/nvidiactl by an eBPF cgroup program, even though the device files are visible in ls /dev. The result is Failed to initialize NVML: Unknown Error. Setting privileged_without_host_devices = true in containerd config and restarting containerd resolves it. Any cluster that runs time-slicing and also operates privileged dev pods outside the device plugin should check this flag.

Final State

Item	Before	After
Total GPU slots	2 full physical GPUs	11 slots (L20 x 8 + T4 x 3)
L20 utilization by memory	about 10% (1 process on 46 GB)	about 80% at theoretical full slot usage
T4 utilization by memory	about 29% (1 process on 15 GB)	about 86% at theoretical full slot usage
Pending Pods	1 Pod stuck for 28 days	0
Blocked Rolling Update	Frozen for 28 days	Completed, new version fully ready
DaemonSet count	1 generic DaemonSet	2 model-specific DaemonSets
Memory and fault isolation	None	Still none under Time-Slicing
Container-level GPU metrics	None	Still limited, pending MIG-capable hardware

References

← Investigating and Solving the Issue of Failed Certificate Request with ZeroSSL and Cert-Manager

Replacing Docker Desktop with Colima on macOS →

Kubernetes GPU Sharing

Kubernetes GPU Sharing

Leave a Reply Cancel reply

ABOUT ME

ABOUT GMEM

GMEM HISTORY

MIRROR INFO

Meta

Recent Posts

TOPLINKS

Recent Comments