Kubernetes GPU Sharing
GPU sharing in Kubernetes depends on what the NVIDIA device plugin advertises to the scheduler, what isolation the underlying mechanism really provides, and what the installed hardware can support. This note uses a real production scheduling stall to walk through the GPU inventory, the practical differences between Time-Slicing and MIG, the constraints imposed by the current cluster hardware, and the rollout that expanded schedulable GPU slots from 2 to 11.
GPUs are expensive and scarce. Kubernetes treats them as exclusive resources by default, so one Pod usually occupies one whole physical GPU even when the workload uses only a small fraction of the card. As concurrent GPU-backed services grew, especially for LLM inference, NLP, and PII detection, that default model turned into both wasted capacity and a concrete scheduling problem.
The production cluster had 8 nodes, 2 of them with GPUs. An NLP inference service was scaled to multiple replicas, each requesting nvidia.com/gpu: 1. Both physical GPUs were already occupied, so new replicas sat in Pending for more than 28 days with repeated events such as 0/8 nodes are available: 8 Insufficient nvidia.com/gpu. That failure forced a deeper evaluation of GPU sharing options.
NVIDIA exposes two mainstream sharing paths for Kubernetes: Time-Slicing and MIG. They solve different problems. They also depend on very different hardware assumptions, which means the hardware survey has to come first.
All 8 nodes in the cluster were Ready and running on Tencent Cloud TKE. GPU nodes were identified through the node label nvidia-device-enable=enable, then a privileged Pod entered the host namespace and ran nvidia-smi -q to confirm the exact models:
| Instance Type | GPU Model | Architecture | Memory | Compute Capability | Driver | MIG Support |
| PNV5b.8XLARGE96 | NVIDIA L20 | Ada Lovelace | 46068 MiB | 8.9 | 570.158.01 | No |
| GN7.2XLARGE32 | Tesla T4 | Turing | 15360 MiB | 7.5 | 570.158.01 | No |
Both nodes were on the same driver release and neither card supported MIG. That distinction matters. The L20 has a higher Compute Capability than Ampere, but MIG support does not track Compute Capability monotonically. The Ada Lovelace L-series does not support MIG, while Ampere parts such as A100 and A30, and Hopper parts such as H100 and H20, do.
The cluster was running the TKE-provided nvidia-device-plugin:v0.14.5 as a DaemonSet with the startup arguments --mig-strategy=single --fail-on-init-error=false --pass-device-specs=true. There was no --config-file flag and no Time-Slicing ConfigMap. Each GPU node registered exactly one nvidia.com/gpu resource with Kubernetes, so the cluster had 2 schedulable GPU slots in total, both already consumed. The DaemonSet metadata included meta.helm.sh/release-name: nvidia-gpu, which showed that it had originally been installed through Helm even though Helm CLI was not present in the current environment.
Time-Slicing is an oversubscription feature implemented by the NVIDIA k8s-device-plugin. An administrator defines a replica count for each GPU resource in a ConfigMap, and the device plugin advertises that GPU to Kubernetes as multiple schedulable resources. Under the hood, workloads share the same physical GPU and CUDA time-slices execution across processes.
Kubernetes itself does not understand the semantics of GPU sharing. It only sees whatever extended resources the plugin exposes, such as nvidia.com/gpu or nvidia.com/gpu.shared. From the Pod's point of view, the declaration pattern does not change: GPU resources belong in resources.limits, and if requests is also present, it must match the limit value.
The tradeoff is blunt. Time-Slicing does not isolate memory, so all replicas on the same GPU share one physical memory pool. It also does not guarantee a proportional share of compute. Asking for multiple shared GPUs does not mean the workload gets a linear share of throughput. NVIDIA recommends failRequestsGreaterThanOne: true so that a request larger than 1 fails with UnexpectedAdmissionError instead of creating the false impression of an exclusive quota.
The upside is operational simplicity. On existing non-MIG hardware, it usually takes only a ConfigMap plus a device-plugin restart. The downside is weak isolation, a shared fault domain, and limited observability. In Time-Slicing mode, DCGM-Exporter cannot reliably attribute metrics to individual containers and mostly reports at the physical GPU level.
MIG, or Multi-Instance GPU, is NVIDIA's hardware partitioning model introduced on Ampere-class GPUs and later architectures that support it. A physical GPU is divided into GPU Instances, each with its own SM slices, memory partition, cache and bandwidth share, DMA engines, and hardware fault boundary. That is the isolation Time-Slicing cannot provide.
How MIG appears in Kubernetes depends on the configured strategy. Under single, the resource name stays as nvidia.com/gpu, but each advertised unit maps to one same-profile MIG instance. Under mixed, each MIG profile is exposed as its own resource type such as nvidia.com/mig-1g.12gb. That model is far cleaner for isolation, but it depends on MIG-capable hardware. Changing MIG profiles also requires node-level maintenance. On Hopper, GPU reset support makes this less disruptive than it was on Ampere, but it is still not a zero-touch change.
| Dimension | Time-Slicing | MIG |
| Memory isolation | None, all replicas share physical memory | Hardware-level, per-instance |
| Fault domain | Shared within one physical GPU | Isolated at the instance level |
| Kubernetes resource shape | nvidia.com/gpu or nvidia.com/gpu.shared | single uses nvidia.com/gpu; mixed uses nvidia.com/mig-* |
| Workload changes | Often none when renameByDefault=false | mixed requires Pods to request explicit MIG resources |
| Hardware support | Broad, works on existing full-GPU resources | Only on MIG-capable GPU models |
| Metric attribution | Mostly physical-GPU level | Can be modeled around MIG resources |
| Operational complexity | Low, usually just ConfigMap plus plugin restart | Moderate, requires lifecycle management for GPU Instances |
| Composition | Can be applied to full GPUs and to mixed MIG resources | Can serve as the lower-level partitioning layer before Time-Slicing |
The H20 is a Hopper GH100 part with Compute Capability 9.0 and 96 GB of HBM3e. NVIDIA's MIG documentation lists it as supporting up to 7 MIG instances. That makes it a relevant medium-term target even though it was not present in the current cluster.
Typical H20 partition shapes look like this:
| Profile | SM Share | Memory | Instances Per Card | Typical Use |
| 1g.12gb | 1/7 | 12GB | 7 | Inference for models up to roughly 7B |
| 2g.24gb | 2/7 | 24GB | 3 | Mid-size models around 13B |
| 3g.47gb | 3/7 | 47GB | 2 | Models around 30B |
| 4g.47gb | 4/7 | 47GB | 1 | Single larger model instance |
| 7g.94gb | 7/7 | 94GB | 1 | Full-card style allocation for 70B+ |
Time-Slicing and MIG can be combined. A common pattern is to partition the H20 with MIG first, then oversubscribe a specific MIG resource with Time-Slicing. In that case the ConfigMap needs migStrategy: mixed and the resource name must target the MIG profile directly:
|
1 2 3 4 5 |
sharing: timeSlicing: resources: - name: nvidia.com/mig-1g.12gb replicas: 2 |
The ConfigMap can hold multiple keys, with each key representing one node configuration. any acts as the fallback. Other keys are selected through node labels:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: kube-system data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: true resources: - name: nvidia.com/gpu replicas: 2 l20: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: true resources: - name: nvidia.com/gpu replicas: 8 t4: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: true resources: - name: nvidia.com/gpu replicas: 3 |
With renameByDefault: false, the resource name stays as nvidia.com/gpu. The node labels then pick up a -SHARED suffix, for example nvidia.com/gpu.product=Tesla-T4-SHARED, which makes it possible to distinguish shared and non-shared nodes with selectors. Replica counts were chosen from measured per-process memory usage, described later in the rollout record.
Version v0.14.5 does not support per-node dynamic config selection in the way newer operator-managed deployments do. The practical solution was to run two DaemonSets, each with its own --config-file and its own node selector:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
# Label nodes by GPU model kubectl label node <l20-node> nvidia.com/device-plugin.config=l20 kubectl label node <t4-node> nvidia.com/device-plugin.config=t4 # Patch the existing DaemonSet so it runs only on T4 nodes and uses the t4 config kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[ {"op":"replace","path":"/spec/template/spec/containers/0/args/3", "value":"--config-file=/etc/nvidia/time-slicing-config/t4"}, {"op":"add","path":"/spec/template/spec/nodeSelector/nvidia.com~1device-plugin.config", "value":"t4"} ]' kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system |
A dedicated L20 DaemonSet then reused the same ConfigMap but pointed at the l20 key:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset-l20 namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds-l20 updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds-l20 spec: nodeSelector: nvidia-device-enable: enable nvidia.com/device-plugin.config: l20 tolerations: - operator: Exists priorityClassName: system-node-critical containers: - name: nvidia-device-plugin-ctr image: sgccr.ccs.tencentyun.com/tkeimages/nvidia-device-plugin:v0.14.5 command: [nvidia-device-plugin] args: - --fail-on-init-error=false - --mig-strategy=single - --pass-device-specs=true - --config-file=/etc/nvidia/time-slicing-config/l20 env: - name: NVIDIA_DRIVER_CAPABILITIES value: utility,compute resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi securityContext: capabilities: drop: [ALL] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: time-slicing-config mountPath: /etc/nvidia/time-slicing-config volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins - name: time-slicing-config configMap: name: time-slicing-config |
On Hopper hardware, MIG reconfiguration is less painful than it was on earlier generations because GPU reset support is better. The operational sequence is still node maintenance first, GPU reconfiguration second, scheduler re-entry last:
|
1 2 3 4 5 6 7 8 |
kubectl drain <h20-node> --ignore-daemonsets --delete-emptydir-data # SSH into the H20 node sudo nvidia-smi -mig 1 sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C nvidia-smi -L kubectl uncordon <h20-node> |
From the Kubernetes side, the three MIG strategies remain single, mixed, and none. single keeps the traditional nvidia.com/gpu resource shape when all instances on a node share one profile. mixed exposes explicit nvidia.com/mig-* resources and requires workloads to request them directly. For a new MIG deployment, Helm is the cleaner path:
|
1 2 3 4 5 6 |
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.17.1 \ --namespace nvidia-device-plugin \ --create-namespace \ --set migStrategy=single \ --set gfd.enabled=true |
|
1 2 3 4 5 6 7 8 |
# Check the current DaemonSet arguments kubectl get daemonset nvidia-device-plugin-daemonset -n kube-system -o yaml | grep -A 15 "containers:" # Check whether any Time-Slicing ConfigMap already exists kubectl get configmap -n kube-system | grep nvidia # Check current GPU slot counts kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu" |
At baseline there was no ConfigMap, no --config-file, and only one schedulable GPU slot per GPU node.
The first pass used the any key with replicas=2. The goal at that stage was not model-specific tuning. It was to verify that the plugin picked up Time-Slicing at all.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
kubectl apply -f time-slicing-config.yaml kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type=json -p='[ {"op":"add","path":"/spec/template/spec/volumes/-", "value":{"name":"time-slicing-config","configMap":{"name":"time-slicing-config"}}}, {"op":"add","path":"/spec/template/spec/containers/0/volumeMounts/-", "value":{"name":"time-slicing-config","mountPath":"/etc/nvidia/time-slicing-config"}}, {"op":"add","path":"/spec/template/spec/containers/0/args/-", "value":"--config-file=/etc/nvidia/time-slicing-config/any"} ]' kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system kubectl rollout status daemonset/nvidia-device-plugin-daemonset -n kube-system --timeout=120s |
One operational detail matters here: the device plugin does not watch ConfigMap updates automatically. Editing the ConfigMap alone is not enough. A DaemonSet restart is required before a new Time-Slicing configuration takes effect.
The restart finished in about 6 seconds. After that, node capacity showed the first slot expansion:
|
1 2 3 |
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu" # L20 node 2 2 # T4 node 2 2 |
The plugin logs also confirmed that the Time-Slicing configuration had loaded:
|
1 2 3 |
kubectl logs -n kube-system <device-plugin-pod> --tail=5 # "timeSlicing": {"failRequestsGreaterThanOne": true, "resources": [{"replicas": 2}]} # Registered device plugin for 'nvidia.com/gpu' with Kubelet |
The workload side told the more useful story. The blocked object was not just one Pod. It was a Rolling Update that had been stalled for 28 days. The Deployment had already created a new ReplicaSet, but the first replacement Pod could not schedule because no GPU was free. Since the default Deployment strategy waits for the new ReplicaSet to become ready before shrinking the old one, the entire release froze in the middle. Once Time-Slicing expanded capacity, scheduling completed within 43 seconds and the Deployment resumed immediately:
|
1 2 3 |
kubectl describe pod <pending-pod> -n <ns> | grep -A 3 "Events:" # Warning FailedScheduling (x1303 over 4d12h) 0/8 nodes are available: 8 Insufficient nvidia.com/gpu. # Normal Scheduled 43s Successfully assigned <pod> to <gpu-node> |
After the mechanism worked, the next step was to size replicas from actual memory usage rather than intuition. A privileged Pod was used to measure the live memory footprint on both GPU nodes:
| GPU | Total Memory | Measured Per-Process Usage | Final Replicas | Theoretical Headroom Per Slot |
| NVIDIA L20 | 46068 MiB | 4621 MiB (about 10%) | 8 | about 1137 MiB |
| Tesla T4 | 15360 MiB | 4401 MiB (about 29%) | 3 | about 759 MiB |
The L20 made the underutilization obvious. With replicas=2, each slot effectively had about 22 GB available while the measured process used only about 4.6 GB. That was far too conservative. Raising the L20 to 8 slots pushed theoretical memory utilization much closer to a useful level.
This stage also exposed a version-specific limitation. The v0.14.5 plugin cannot point --config-file at a directory for dynamic per-node selection. On this version, doing so crashes the Pod:
|
1 2 3 |
# Pod CrashLoopBackOff, log output: # E unable to load config: unable to finalize config: unable to parse config file: # read error: read /etc/nvidia/time-slicing-config: is a directory |
That selection mechanism depends on the config-manager sidecar used by GPU Operator deployments. The bare device plugin does not have it. In practice, that forced the two-DaemonSet layout: one bound to T4 nodes and one bound to L20 nodes, each with an explicit config file target.
The final GPU slot layout looked like this:
|
1 2 3 |
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU-CAP:.status.capacity.nvidia\.com/gpu,GPU-ALLOC:.status.allocatable.nvidia\.com/gpu" # L20 node 8 8 # T4 node 3 3 |
One risk remains around ownership. The original DaemonSet is TKE-managed. During cluster upgrades or node-group operations, the control plane may reconcile that DaemonSet back to its original form and wipe out manual patches. The current mitigation is documentation and repeatability. The cleaner long-term answer is to move to TKE's native GPU sharing feature or deploy GPU Operator and stop patching the managed object directly.
Observability is also still weak. The cluster already runs nvidia-gpu-exporter, but in Time-Slicing mode the metrics still aggregate at the physical GPU level. Per-Pod memory and compute attribution remains limited. That is one reason MIG is still the better long-term target when the hardware supports it.
There is also a side effect on privileged development pods that bypass the device plugin entirely by not requesting nvidia.com/gpu. During this rollout, the containerd runc runtime was configured to use nvidia-container-runtime as its binary, which is a common step when setting up GPU sharing. The containerd flag privileged_without_host_devices defaults to false in that configuration. With the nvidia runtime active and that flag false, privileged containers that have no device plugin allocation are blocked from /dev/nvidiactl by an eBPF cgroup program, even though the device files are visible in ls /dev. The result is Failed to initialize NVML: Unknown Error. Setting privileged_without_host_devices = true in containerd config and restarting containerd resolves it. Any cluster that runs time-slicing and also operates privileged dev pods outside the device plugin should check this flag.
| Item | Before | After |
| Total GPU slots | 2 full physical GPUs | 11 slots (L20 x 8 + T4 x 3) |
| L20 utilization by memory | about 10% (1 process on 46 GB) | about 80% at theoretical full slot usage |
| T4 utilization by memory | about 29% (1 process on 15 GB) | about 86% at theoretical full slot usage |
| Pending Pods | 1 Pod stuck for 28 days | 0 |
| Blocked Rolling Update | Frozen for 28 days | Completed, new version fully ready |
| DaemonSet count | 1 generic DaemonSet | 2 model-specific DaemonSets |
| Memory and fault isolation | None | Still none under Time-Slicing |
| Container-level GPU metrics | None | Still limited, pending MIG-capable hardware |
Leave a Reply