GPU Nodes Configuration | Smallest AI Docs

Overview

This guide covers advanced configuration and optimization for GPU nodes in AWS EKS, including node taints, tolerations, labels, and performance tuning.

GPU Node Configuration

Node Labels

Labels help Kubernetes schedule pods on the correct nodes.

Automatic Labels

EKS automatically adds these labels to GPU nodes:

1 node.kubernetes.io/instance-type: g5.xlarge
2 beta.kubernetes.io/instance-type: g5.xlarge
3 topology.kubernetes.io/zone: us-east-1a
4 topology.kubernetes.io/region: us-east-1

Custom Labels

Add custom labels when creating node groups:

cluster-config.yaml

1 managedNodeGroups:
2   - name: gpu-nodes
3     instanceType: g5.xlarge
4     labels:
5       workload: gpu
6       nvidia.com/gpu: "true"
7       gpu-type: a10
8       cost-tier: spot

Or add labels to existing nodes:

$ kubectl label nodes <node-name> workload=gpu
$ kubectl label nodes <node-name> gpu-type=a10

Node Taints

Taints prevent non-GPU workloads from running on expensive GPU nodes.

Add Taints During Node Group Creation

cluster-config.yaml

1 managedNodeGroups:
2   - name: gpu-nodes
3     instanceType: g5.xlarge
4     taints:
5       - key: nvidia.com/gpu
6         value: "true"
7         effect: NoSchedule

Add Taints to Existing Nodes

$ kubectl taint nodes <node-name> nvidia.com/gpu=true:NoSchedule

Tolerations in Pod Specs

Pods must have matching tolerations to run on tainted nodes:

values.yaml

1 lightningAsr:
2   tolerations:
3     - key: nvidia.com/gpu
4       operator: Exists
5       effect: NoSchedule
6     - key: nvidia.com/gpu
7       operator: Equal
8       value: "true"
9       effect: NoSchedule

Node Selectors

Using Instance Type

Most common approach for AWS:

values.yaml

1 lightningAsr:
2   nodeSelector:
3     node.kubernetes.io/instance-type: g5.xlarge

Using Custom Labels

values.yaml

1 lightningAsr:
2   nodeSelector:
3     workload: gpu
4     gpu-type: a10

Multiple Selectors

Combine multiple selectors for precise placement:

values.yaml

1 lightningAsr:
2   nodeSelector:
3     node.kubernetes.io/instance-type: g5.xlarge
4     topology.kubernetes.io/zone: us-east-1a
5     cost-tier: on-demand

NVIDIA Device Plugin

The NVIDIA device plugin makes GPUs available to Kubernetes pods.

Installation via GPU Operator

The recommended approach is using the NVIDIA GPU Operator (included in the Smallest Helm chart):

values.yaml

1 gpu-operator:
2   enabled: true
3   driver:
4     enabled: true
5   toolkit:
6     enabled: true
7   devicePlugin:
8     enabled: true

Manual Installation

Alternatively, install the device plugin directly:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify Device Plugin

$ kubectl get pods -n kube-system | grep nvidia-device-plugin
$ kubectl logs -n kube-system -l name=nvidia-device-plugin

Check GPU Availability

$ kubectl get nodes -o json | \
>   jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | 
>   "\(.metadata.name)\t\(.status.capacity."nvidia.com/gpu")"'

GPU Resource Limits

Request GPU in Pod Spec

The Lightning ASR deployment automatically requests GPU:

1 resources:
2   limits:
3     nvidia.com/gpu: 1
4   requests:
5     nvidia.com/gpu: 1

Multiple GPUs

For pods that need multiple GPUs:

1 resources:
2   limits:
3     nvidia.com/gpu: 2
4   requests:
5     nvidia.com/gpu: 2

Smallest Self-Host Lightning ASR is optimized for single GPU per pod. Use multiple pods for scaling rather than multiple GPUs per pod.

GPU Performance Optimization

Enable GPU Persistence Mode

GPU persistence mode keeps the NVIDIA driver loaded, reducing initialization time:

1 gpu-operator:
2   enabled: true
3   driver:
4     enabled: true
5     env:
6       - name: NVIDIA_DRIVER_CAPABILITIES
7         value: "compute,utility"
8       - name: NVIDIA_REQUIRE_CUDA
9         value: "cuda>=11.8"
10   toolkit:
11     enabled: true
12     env:
13       - name: NVIDIA_MPS_ENABLED
14         value: "1"

Use DaemonSet for GPU Configuration

Create a DaemonSet to configure GPU settings on all GPU nodes:

gpu-config-daemonset.yaml

1 apiVersion: apps/v1
2 kind: DaemonSet
3 metadata:
4   name: gpu-config
5   namespace: kube-system
6 spec:
7   selector:
8     matchLabels:
9       name: gpu-config
10   template:
11     metadata:
12       labels:
13         name: gpu-config
14     spec:
15       hostPID: true
16       nodeSelector:
17         nvidia.com/gpu: "true"
18       tolerations:
19         - key: nvidia.com/gpu
20           operator: Exists
21           effect: NoSchedule
22       containers:
23       - name: gpu-config
24         image: nvidia/cuda:11.8.0-base-ubuntu22.04
25         command:
26           - /bin/bash
27           - -c
28           - |
29             nvidia-smi -pm 1
30             nvidia-smi --auto-boost-default=DISABLED
31             nvidia-smi -ac 1215,1410
32             sleep infinity
33         securityContext:
34           privileged: true
35         volumeMounts:
36           - name: sys
37             mountPath: /sys
38       volumes:
39         - name: sys
40           hostPath:
41             path: /sys

Apply:

$ kubectl apply -f gpu-config-daemonset.yaml

Monitor GPU Utilization

Deploy NVIDIA DCGM exporter for Prometheus metrics:

$ helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
$ helm repo update
$ 
$ helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
>   --namespace kube-system \
>   --set serviceMonitor.enabled=true

Multi-GPU Strategies

Strategy 1: One Pod per GPU (Recommended)

Scale horizontally with one pod per GPU:

values.yaml

1 scaling:
2   auto:
3     enabled: true
4     lightningAsr:
5       hpa:
6         enabled: true
7         minReplicas: 1
8         maxReplicas: 10
9 
10 lightningAsr:
11   resources:
12     limits:
13       nvidia.com/gpu: 1

Allow multiple pods to share a single GPU (reduces isolation):

1 gpu-operator:
2   enabled: true
3   devicePlugin:
4     config:
5       name: time-slicing-config
6       default: any
7       sharing:
8         timeSlicing:
9           replicas: 4

GPU sharing reduces isolation and can impact performance. Use only if cost is more critical than performance.

Strategy 3: Multi-Instance GPU (MIG)

For A100 and A30 GPUs, use MIG to partition GPUs:

$ nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

Configure pods to use MIG instances:

1 resources:
2   limits:
3     nvidia.com/mig-1g.5gb: 1

Node Auto-Scaling

Configure Auto-Scaling Groups

When creating node groups, enable auto-scaling:

cluster-config.yaml

1 managedNodeGroups:
2   - name: gpu-nodes
3     instanceType: g5.xlarge
4     minSize: 0
5     maxSize: 10
6     desiredCapacity: 1
7     tags:
8       k8s.io/cluster-autoscaler/enabled: "true"
9       k8s.io/cluster-autoscaler/smallest-cluster: "owned"

Install Cluster Autoscaler

See Cluster Autoscaler for full setup.

Quick enable:

values.yaml

1 cluster-autoscaler:
2   enabled: true
3   autoDiscovery:
4     clusterName: smallest-cluster
5   awsRegion: us-east-1
6   nodeSelector:
7     workload: cpu

Run Cluster Autoscaler on CPU nodes, not GPU nodes, to avoid wasting GPU resources.

Cost Optimization

Use Spot Instances

Save up to 70% with Spot instances:

cluster-config.yaml

1 managedNodeGroups:
2   - name: gpu-nodes-spot
3     instanceType: g5.xlarge
4     minSize: 0
5     maxSize: 10
6     desiredCapacity: 1
7     spot: true
8     instancesDistribution:
9       maxPrice: 0.50
10       instanceTypes: ["g5.xlarge", "g5.2xlarge"]
11       onDemandBaseCapacity: 0
12       onDemandPercentageAboveBaseCapacity: 0
13       spotAllocationStrategy: capacity-optimized
14     labels:
15       capacity-type: spot
16     taints:
17       - key: nvidia.com/gpu
18         value: "true"
19         effect: NoSchedule

Handle Spot Interruptions

Add pod disruption budget:

1 apiVersion: policy/v1
2 kind: PodDisruptionBudget
3 metadata:
4   name: lightning-asr-pdb
5 spec:
6   minAvailable: 1
7   selector:
8     matchLabels:
9       app: lightning-asr

Configure graceful shutdown:

values.yaml

1 lightningAsr:
2   terminationGracePeriodSeconds: 120

Mixed On-Demand and Spot

Combine both for reliability:

cluster-config.yaml

1 managedNodeGroups:
2   - name: gpu-nodes-ondemand
3     instanceType: g5.xlarge
4     minSize: 1
5     maxSize: 3
6     labels:
7       capacity-type: on-demand
8       
9   - name: gpu-nodes-spot
10     instanceType: g5.xlarge
11     minSize: 0
12     maxSize: 10
13     spot: true
14     labels:
15       capacity-type: spot

Use pod affinity to prefer spot:

values.yaml

1 lightningAsr:
2   affinity:
3     nodeAffinity:
4       preferredDuringSchedulingIgnoredDuringExecution:
5         - weight: 100
6           preference:
7             matchExpressions:
8               - key: capacity-type
9                 operator: In
10                 values:
11                   - spot

Monitoring GPU Nodes

View GPU Node Status

$ kubectl get nodes -l nvidia.com/gpu=true

Check GPU Allocation

$ kubectl describe nodes -l nvidia.com/gpu=true | grep -A 5 "Allocated resources"

GPU Utilization

Using NVIDIA SMI:

$ kubectl run nvidia-smi --rm -it --restart=Never \
>   --image=nvidia/cuda:11.8.0-base-ubuntu22.04 \
>   --overrides='{"spec":{"nodeSelector":{"nvidia.com/gpu":"true"},"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists"}]}}' \
>   -- nvidia-smi

Troubleshooting

GPU Not Detected

Check NVIDIA device plugin:

$ kubectl get pods -n kube-system | grep nvidia
$ kubectl logs -n kube-system -l name=nvidia-device-plugin

Verify driver on node:

$ kubectl debug node/<node-name> -it --image=ubuntu
$ apt-get update && apt-get install -y nvidia-utils
$ nvidia-smi

Pods Not Scheduling on GPU Nodes

Check tolerations:

$ kubectl describe pod <pod-name> | grep -A 5 Tolerations

Check node selector:

$ kubectl get pod <pod-name> -o jsonpath='{.spec.nodeSelector}'

Check node taints:

$ kubectl describe node <node-name> | grep Taints

GPU Out of Memory

Check pod resource limits:

$ kubectl describe pod <pod-name> | grep -A 5 Limits

Monitor GPU memory:

$ kubectl exec <pod-name> -- nvidia-smi

Best Practices

Isolate GPU Workloads

Always use taints and tolerations to prevent non-GPU workloads from running on GPU nodes:

1 taints:
2   - key: nvidia.com/gpu
3     value: "true"
4     effect: NoSchedule

Set Resource Limits

Always specify GPU resource requests and limits:

1 resources:
2   limits:
3     nvidia.com/gpu: 1
4   requests:
5     nvidia.com/gpu: 1

Use Node Auto-Scaling

Configure auto-scaling to scale GPU nodes to zero during off-hours:

1 minSize: 0
2 maxSize: 10

Monitor GPU Utilization

Use DCGM exporter and Grafana to monitor GPU metrics:

GPU utilization
Memory usage
Temperature
Power consumption

Test Spot Interruptions

Regularly test your application’s response to spot interruptions:

$ kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

What’s Next?

EFS Configuration

Set up shared storage for model caching

HPA Configuration

Configure pod autoscaling based on metrics

Cluster Autoscaler

Enable automatic node scaling

Monitoring

Set up Grafana dashboards

Overview

GPU Node Configuration

Node Labels

Automatic Labels

Custom Labels

Node Taints

Add Taints During Node Group Creation

Add Taints to Existing Nodes

Tolerations in Pod Specs

Node Selectors

Using Instance Type

Using Custom Labels

Multiple Selectors

NVIDIA Device Plugin

Installation via GPU Operator

Manual Installation

Verify Device Plugin

Check GPU Availability

GPU Resource Limits

Request GPU in Pod Spec

Multiple GPUs

GPU Performance Optimization

Enable GPU Persistence Mode

Use DaemonSet for GPU Configuration

Monitor GPU Utilization

Multi-GPU Strategies

Strategy 1: One Pod per GPU (Recommended)

Strategy 2: GPU Sharing (Time-Slicing)

Strategy 3: Multi-Instance GPU (MIG)

Node Auto-Scaling

Configure Auto-Scaling Groups

Install Cluster Autoscaler

Cost Optimization

Use Spot Instances

Handle Spot Interruptions

Mixed On-Demand and Spot

Monitoring GPU Nodes

View GPU Node Status

Check GPU Allocation

GPU Utilization

Troubleshooting

GPU Not Detected

Pods Not Scheduling on GPU Nodes

GPU Out of Memory

Best Practices

Isolate GPU Workloads

Set Resource Limits

Use Node Auto-Scaling

Monitor GPU Utilization

Test Spot Interruptions

What’s Next?