GPU Nodes Configuration

View as Markdown

Overview

This guide covers advanced configuration and optimization for GPU nodes in AWS EKS, including node taints, tolerations, labels, and performance tuning.

GPU Node Configuration

Node Labels

Labels help Kubernetes schedule pods on the correct nodes.

Automatic Labels

EKS automatically adds these labels to GPU nodes:

1node.kubernetes.io/instance-type: g5.xlarge
2beta.kubernetes.io/instance-type: g5.xlarge
3topology.kubernetes.io/zone: us-east-1a
4topology.kubernetes.io/region: us-east-1

Custom Labels

Add custom labels when creating node groups:

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes
3 instanceType: g5.xlarge
4 labels:
5 workload: gpu
6 nvidia.com/gpu: "true"
7 gpu-type: a10
8 cost-tier: spot

Or add labels to existing nodes:

$kubectl label nodes <node-name> workload=gpu
$kubectl label nodes <node-name> gpu-type=a10

Node Taints

Taints prevent non-GPU workloads from running on expensive GPU nodes.

Add Taints During Node Group Creation

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes
3 instanceType: g5.xlarge
4 taints:
5 - key: nvidia.com/gpu
6 value: "true"
7 effect: NoSchedule

Add Taints to Existing Nodes

$kubectl taint nodes <node-name> nvidia.com/gpu=true:NoSchedule

Tolerations in Pod Specs

Pods must have matching tolerations to run on tainted nodes:

values.yaml
1lightningAsr:
2 tolerations:
3 - key: nvidia.com/gpu
4 operator: Exists
5 effect: NoSchedule
6 - key: nvidia.com/gpu
7 operator: Equal
8 value: "true"
9 effect: NoSchedule

Node Selectors

Using Instance Type

Most common approach for AWS:

values.yaml
1lightningAsr:
2 nodeSelector:
3 node.kubernetes.io/instance-type: g5.xlarge

Using Custom Labels

values.yaml
1lightningAsr:
2 nodeSelector:
3 workload: gpu
4 gpu-type: a10

Multiple Selectors

Combine multiple selectors for precise placement:

values.yaml
1lightningAsr:
2 nodeSelector:
3 node.kubernetes.io/instance-type: g5.xlarge
4 topology.kubernetes.io/zone: us-east-1a
5 cost-tier: on-demand

NVIDIA Device Plugin

The NVIDIA device plugin makes GPUs available to Kubernetes pods.

Installation via GPU Operator

The recommended approach is using the NVIDIA GPU Operator (included in the Smallest Helm chart):

values.yaml
1gpu-operator:
2 enabled: true
3 driver:
4 enabled: true
5 toolkit:
6 enabled: true
7 devicePlugin:
8 enabled: true

Manual Installation

Alternatively, install the device plugin directly:

$kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify Device Plugin

$kubectl get pods -n kube-system | grep nvidia-device-plugin
$kubectl logs -n kube-system -l name=nvidia-device-plugin

Check GPU Availability

$kubectl get nodes -o json | \
> jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) |
> "\(.metadata.name)\t\(.status.capacity."nvidia.com/gpu")"'

GPU Resource Limits

Request GPU in Pod Spec

The Lightning ASR deployment automatically requests GPU:

1resources:
2 limits:
3 nvidia.com/gpu: 1
4 requests:
5 nvidia.com/gpu: 1

Multiple GPUs

For pods that need multiple GPUs:

1resources:
2 limits:
3 nvidia.com/gpu: 2
4 requests:
5 nvidia.com/gpu: 2

Smallest Self-Host Lightning ASR is optimized for single GPU per pod. Use multiple pods for scaling rather than multiple GPUs per pod.

GPU Performance Optimization

Enable GPU Persistence Mode

GPU persistence mode keeps the NVIDIA driver loaded, reducing initialization time:

1gpu-operator:
2 enabled: true
3 driver:
4 enabled: true
5 env:
6 - name: NVIDIA_DRIVER_CAPABILITIES
7 value: "compute,utility"
8 - name: NVIDIA_REQUIRE_CUDA
9 value: "cuda>=11.8"
10 toolkit:
11 enabled: true
12 env:
13 - name: NVIDIA_MPS_ENABLED
14 value: "1"

Use DaemonSet for GPU Configuration

Create a DaemonSet to configure GPU settings on all GPU nodes:

gpu-config-daemonset.yaml
1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: gpu-config
5 namespace: kube-system
6spec:
7 selector:
8 matchLabels:
9 name: gpu-config
10 template:
11 metadata:
12 labels:
13 name: gpu-config
14 spec:
15 hostPID: true
16 nodeSelector:
17 nvidia.com/gpu: "true"
18 tolerations:
19 - key: nvidia.com/gpu
20 operator: Exists
21 effect: NoSchedule
22 containers:
23 - name: gpu-config
24 image: nvidia/cuda:11.8.0-base-ubuntu22.04
25 command:
26 - /bin/bash
27 - -c
28 - |
29 nvidia-smi -pm 1
30 nvidia-smi --auto-boost-default=DISABLED
31 nvidia-smi -ac 1215,1410
32 sleep infinity
33 securityContext:
34 privileged: true
35 volumeMounts:
36 - name: sys
37 mountPath: /sys
38 volumes:
39 - name: sys
40 hostPath:
41 path: /sys

Apply:

$kubectl apply -f gpu-config-daemonset.yaml

Monitor GPU Utilization

Deploy NVIDIA DCGM exporter for Prometheus metrics:

$helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
$helm repo update
$
$helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
> --namespace kube-system \
> --set serviceMonitor.enabled=true

Multi-GPU Strategies

Scale horizontally with one pod per GPU:

values.yaml
1scaling:
2 auto:
3 enabled: true
4 lightningAsr:
5 hpa:
6 enabled: true
7 minReplicas: 1
8 maxReplicas: 10
9
10lightningAsr:
11 resources:
12 limits:
13 nvidia.com/gpu: 1

Strategy 2: GPU Sharing (Time-Slicing)

Allow multiple pods to share a single GPU (reduces isolation):

1gpu-operator:
2 enabled: true
3 devicePlugin:
4 config:
5 name: time-slicing-config
6 default: any
7 sharing:
8 timeSlicing:
9 replicas: 4

GPU sharing reduces isolation and can impact performance. Use only if cost is more critical than performance.

Strategy 3: Multi-Instance GPU (MIG)

For A100 and A30 GPUs, use MIG to partition GPUs:

$nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

Configure pods to use MIG instances:

1resources:
2 limits:
3 nvidia.com/mig-1g.5gb: 1

Node Auto-Scaling

Configure Auto-Scaling Groups

When creating node groups, enable auto-scaling:

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes
3 instanceType: g5.xlarge
4 minSize: 0
5 maxSize: 10
6 desiredCapacity: 1
7 tags:
8 k8s.io/cluster-autoscaler/enabled: "true"
9 k8s.io/cluster-autoscaler/smallest-cluster: "owned"

Install Cluster Autoscaler

See Cluster Autoscaler for full setup.

Quick enable:

values.yaml
1cluster-autoscaler:
2 enabled: true
3 autoDiscovery:
4 clusterName: smallest-cluster
5 awsRegion: us-east-1
6 nodeSelector:
7 workload: cpu

Run Cluster Autoscaler on CPU nodes, not GPU nodes, to avoid wasting GPU resources.

Cost Optimization

Use Spot Instances

Save up to 70% with Spot instances:

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes-spot
3 instanceType: g5.xlarge
4 minSize: 0
5 maxSize: 10
6 desiredCapacity: 1
7 spot: true
8 instancesDistribution:
9 maxPrice: 0.50
10 instanceTypes: ["g5.xlarge", "g5.2xlarge"]
11 onDemandBaseCapacity: 0
12 onDemandPercentageAboveBaseCapacity: 0
13 spotAllocationStrategy: capacity-optimized
14 labels:
15 capacity-type: spot
16 taints:
17 - key: nvidia.com/gpu
18 value: "true"
19 effect: NoSchedule

Handle Spot Interruptions

Add pod disruption budget:

1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: lightning-asr-pdb
5spec:
6 minAvailable: 1
7 selector:
8 matchLabels:
9 app: lightning-asr

Configure graceful shutdown:

values.yaml
1lightningAsr:
2 terminationGracePeriodSeconds: 120

Mixed On-Demand and Spot

Combine both for reliability:

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes-ondemand
3 instanceType: g5.xlarge
4 minSize: 1
5 maxSize: 3
6 labels:
7 capacity-type: on-demand
8
9 - name: gpu-nodes-spot
10 instanceType: g5.xlarge
11 minSize: 0
12 maxSize: 10
13 spot: true
14 labels:
15 capacity-type: spot

Use pod affinity to prefer spot:

values.yaml
1lightningAsr:
2 affinity:
3 nodeAffinity:
4 preferredDuringSchedulingIgnoredDuringExecution:
5 - weight: 100
6 preference:
7 matchExpressions:
8 - key: capacity-type
9 operator: In
10 values:
11 - spot

Monitoring GPU Nodes

View GPU Node Status

$kubectl get nodes -l nvidia.com/gpu=true

Check GPU Allocation

$kubectl describe nodes -l nvidia.com/gpu=true | grep -A 5 "Allocated resources"

GPU Utilization

Using NVIDIA SMI:

$kubectl run nvidia-smi --rm -it --restart=Never \
> --image=nvidia/cuda:11.8.0-base-ubuntu22.04 \
> --overrides='{"spec":{"nodeSelector":{"nvidia.com/gpu":"true"},"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists"}]}}' \
> -- nvidia-smi

Troubleshooting

GPU Not Detected

Check NVIDIA device plugin:

$kubectl get pods -n kube-system | grep nvidia
$kubectl logs -n kube-system -l name=nvidia-device-plugin

Verify driver on node:

$kubectl debug node/<node-name> -it --image=ubuntu
$apt-get update && apt-get install -y nvidia-utils
$nvidia-smi

Pods Not Scheduling on GPU Nodes

Check tolerations:

$kubectl describe pod <pod-name> | grep -A 5 Tolerations

Check node selector:

$kubectl get pod <pod-name> -o jsonpath='{.spec.nodeSelector}'

Check node taints:

$kubectl describe node <node-name> | grep Taints

GPU Out of Memory

Check pod resource limits:

$kubectl describe pod <pod-name> | grep -A 5 Limits

Monitor GPU memory:

$kubectl exec <pod-name> -- nvidia-smi

Best Practices

Always use taints and tolerations to prevent non-GPU workloads from running on GPU nodes:

1taints:
2 - key: nvidia.com/gpu
3 value: "true"
4 effect: NoSchedule

Always specify GPU resource requests and limits:

1resources:
2 limits:
3 nvidia.com/gpu: 1
4 requests:
5 nvidia.com/gpu: 1

Configure auto-scaling to scale GPU nodes to zero during off-hours:

1minSize: 0
2maxSize: 10

Use DCGM exporter and Grafana to monitor GPU metrics:

  • GPU utilization
  • Memory usage
  • Temperature
  • Power consumption

Regularly test your application’s response to spot interruptions:

$kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

What’s Next?