*** title: Cluster Autoscaler description: Automatically scale EKS cluster nodes based on pod resource requirements ------------------------------------------------------------------------------------- ## Overview The Cluster Autoscaler automatically adjusts the number of nodes in your EKS cluster based on pending pods and resource utilization. When combined with HPA, it provides end-to-end autoscaling from application load to infrastructure capacity. ## How It Works ```mermaid graph TD HPA[HPA] -->|Scales Pods| Deployment[Deployment] Deployment -->|Creates| Pods[New Pods] Pods -->|Status: Pending| CA[Cluster Autoscaler] CA -->|Checks| ASG[Auto Scaling Group] CA -->|Adds Nodes| ASG ASG -->|Provisions| Nodes[EC2 Instances] Nodes -->|Registers| K8s[Kubernetes] K8s -->|Schedules| Pods style CA fill:#0D9373 style HPA fill:#07C983 ``` **Flow**: 1. HPA scales pods based on metrics 2. New pods enter "Pending" state (insufficient resources) 3. Cluster Autoscaler detects pending pods 4. Adds nodes to Auto Scaling Group 5. Pods scheduled on new nodes 6. After scale-down period, removes underutilized nodes ## Prerequisites Create IAM role with autoscaling permissions (see [IAM & IRSA](/waves/self-host/kubernetes-setup/quick-start)) Ensure node groups have proper tags: ``` k8s.io/cluster-autoscaler/: owned k8s.io/cluster-autoscaler/enabled: true ``` IRSA-enabled service account for Cluster Autoscaler ## Installation ### Using Helm Chart The Smallest Self-Host chart includes Cluster Autoscaler as a dependency: ```yaml values.yaml cluster-autoscaler: enabled: true rbac: serviceAccount: name: cluster-autoscaler annotations: eks.amazonaws.com/role-arn: arn:aws:iam::YOUR_ACCOUNT_ID:role/cluster-autoscaler-role autoDiscovery: clusterName: smallest-cluster awsRegion: us-east-1 extraArgs: balance-similar-node-groups: true skip-nodes-with-system-pods: false scale-down-delay-after-add: 5m scale-down-unneeded-time: 10m ``` Deploy: ```bash helm upgrade --install smallest-self-host smallest-self-host/smallest-self-host \ -f values.yaml \ --namespace smallest ``` ### Standalone Installation Install Cluster Autoscaler separately: ```bash helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ --set autoDiscovery.clusterName=smallest-cluster \ --set awsRegion=us-east-1 \ --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT_ID:role/cluster-autoscaler-role ``` ## Configuration ### Auto-Discovery Auto-discover Auto Scaling Groups by cluster name: ```yaml autoDiscovery: clusterName: smallest-cluster tags: - k8s.io/cluster-autoscaler/enabled - k8s.io/cluster-autoscaler/smallest-cluster ``` ### Manual Configuration Explicitly specify Auto Scaling Groups: ```yaml autoscalingGroups: - name: eks-cpu-nodes minSize: 1 maxSize: 10 - name: eks-gpu-nodes minSize: 0 maxSize: 20 ``` ### Scale-Down Configuration Control when and how nodes are removed: ```yaml extraArgs: scale-down-enabled: true scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m scale-down-utilization-threshold: 0.5 max-graceful-termination-sec: 600 ``` **Parameters**: * `scale-down-delay-after-add`: Wait time after adding node before considering scale-down * `scale-down-unneeded-time`: How long node must be underutilized before removal * `scale-down-utilization-threshold`: CPU/memory threshold (0.5 = 50%) * `max-graceful-termination-sec`: Max time for pod eviction ### Node Group Priorities Scale specific node groups first: ```yaml extraArgs: expander: priority priorityConfigMapAnnotations: cluster-autoscaler.kubernetes.io/expander-priorities: | 10: - .*-spot-.* 50: - .*-ondemand-.* ``` Priorities: * Higher number = higher priority * Regex patterns match node group names * Useful for preferring spot instances ## Verify Installation ### Check Cluster Autoscaler Pod ```bash kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler ``` ### Check Logs ```bash kubectl logs -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler -f ``` Look for: ``` Starting cluster autoscaler Auto-discovery enabled Discovered node groups: [eks-gpu-nodes, eks-cpu-nodes] ``` ### Verify IAM Permissions ```bash kubectl logs -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler | grep -i "error\|permission" ``` Should show no permission errors. ## Testing Cluster Autoscaler ### Trigger Scale-Up Create pods that exceed cluster capacity: ```bash kubectl run test-scale-up-1 \ --image=nginx \ --requests='cpu=1,memory=1Gi' \ --replicas=20 \ --namespace=smallest ``` Watch nodes: ```bash watch -n 5 'kubectl get nodes' ``` Watch Cluster Autoscaler: ```bash kubectl logs -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler -f ``` Expected behavior: 1. Pods enter "Pending" state 2. Cluster Autoscaler detects pending pods 3. Logs show: "Scale-up: setting group size to X" 4. New nodes appear in `kubectl get nodes` 5. Pods transition to "Running" ### Trigger Scale-Down Delete test pods: ```bash kubectl delete deployment test-scale-up-1 -n smallest ``` After `scale-down-unneeded-time` (default 10 minutes): 1. Cluster Autoscaler marks underutilized nodes 2. Drains pods gracefully 3. Terminates EC2 instances 4. Node count decreases ## GPU Node Scaling ### Configure GPU Node Groups Tag GPU node groups for autoscaling: ```yaml cluster-config.yaml managedNodeGroups: - name: gpu-nodes instanceType: g5.xlarge minSize: 0 maxSize: 10 desiredCapacity: 1 tags: k8s.io/cluster-autoscaler/smallest-cluster: "owned" k8s.io/cluster-autoscaler/enabled: "true" k8s.io/cluster-autoscaler/node-template/label/workload: "gpu" ``` ### Prevent Cluster Autoscaler on GPU Nodes Run Cluster Autoscaler on CPU nodes to avoid wasting GPU: ```yaml values.yaml cluster-autoscaler: nodeSelector: workload: cpu tolerations: [] ``` ### Scale to Zero Allow GPU nodes to scale to zero during off-hours: ```yaml managedNodeGroups: - name: gpu-nodes minSize: 0 maxSize: 10 ``` Cluster Autoscaler will: * Add GPU nodes when Lightning ASR pods are pending * Remove GPU nodes when all GPU workloads complete First startup after scale-to-zero takes longer (node provisioning + model download). ## Spot Instance Integration ### Mixed Instance Groups Use spot and on-demand instances: ```yaml cluster-config.yaml managedNodeGroups: - name: gpu-nodes-mixed minSize: 1 maxSize: 10 instancesDistribution: onDemandBaseCapacity: 1 onDemandPercentageAboveBaseCapacity: 20 spotAllocationStrategy: capacity-optimized instanceTypes: - g5.xlarge - g5.2xlarge - g4dn.xlarge ``` **Configuration**: * Base capacity: 1 on-demand node always * Additional capacity: 20% on-demand, 80% spot * Multiple instance types increase spot availability ### Handle Spot Interruptions Configure Cluster Autoscaler for spot: ```yaml extraArgs: balance-similar-node-groups: true skip-nodes-with-local-storage: false max-node-provision-time: 15m ``` Add AWS Node Termination Handler: ```bash helm repo add eks https://aws.github.io/eks-charts helm install aws-node-termination-handler eks/aws-node-termination-handler \ --namespace kube-system \ --set enableSpotInterruptionDraining=true ``` ## Advanced Configuration ### Multiple Node Groups Scale different workloads independently: ```yaml cluster-autoscaler: autoscalingGroups: - name: cpu-small minSize: 2 maxSize: 10 - name: cpu-large minSize: 0 maxSize: 5 - name: gpu-a10 minSize: 0 maxSize: 10 - name: gpu-t4 minSize: 0 maxSize: 5 ``` ### Scale-Up Policies Control scale-up behavior: ```yaml extraArgs: max-nodes-total: 50 max-empty-bulk-delete: 10 new-pod-scale-up-delay: 0s scan-interval: 10s ``` ### Resource Limits Prevent runaway scaling: ```yaml extraArgs: cores-total: "0:512" memory-total: "0:2048" max-nodes-total: 100 ``` ## Monitoring ### CloudWatch Metrics View Auto Scaling Group metrics in CloudWatch: * `GroupDesiredCapacity` * `GroupInServiceInstances` * `GroupPendingInstances` * `GroupTerminatingInstances` ### Kubernetes Events ```bash kubectl get events -n smallest --sort-by='.lastTimestamp' | grep -i scale ``` ### Cluster Autoscaler Status ```bash kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml ``` ### Grafana Dashboard Import Cluster Autoscaler dashboard: Dashboard ID: 3831 See [Grafana Dashboards](/waves/self-host/kubernetes-setup/autoscaling/grafana-dashboards) ## Troubleshooting ### Nodes Not Scaling Up **Check pending pods**: ```bash kubectl get pods --all-namespaces --field-selector=status.phase=Pending ``` **Check Cluster Autoscaler logs**: ```bash kubectl logs -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler --tail=100 ``` **Common issues**: * Max nodes reached (`max-nodes-total`) * IAM permission denied * Auto Scaling Group at max capacity * Node group not tagged properly ### Nodes Not Scaling Down **Check node utilization**: ```bash kubectl top nodes ``` **Check for blocking conditions**: ```bash kubectl describe node | grep -i "scale-down disabled" ``` **Common causes**: * Pods without PodDisruptionBudget * Pods with local storage * System pods (unless `skip-nodes-with-system-pods: false`) * Nodes below utilization threshold ### Permission Errors **Check service account**: ```bash kubectl describe sa cluster-autoscaler -n kube-system ``` **Verify IAM role**: ```bash kubectl logs -n kube-system -l app.kubernetes.io/name=aws-cluster-autoscaler | grep AccessDenied ``` Update IAM policy if needed (see [IAM & IRSA](/waves/self-host/kubernetes-setup/quick-start)) ## Best Practices Always tag Auto Scaling Groups: ``` k8s.io/cluster-autoscaler/smallest-cluster: owned k8s.io/cluster-autoscaler/enabled: true ``` Configure appropriate min/max for each node group: ```yaml gpu-nodes: minSize: 0 # Save costs maxSize: 10 # Prevent runaway ``` Protect critical workloads during scale-down: ```yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: lightning-asr-pdb spec: minAvailable: 1 selector: matchLabels: app: lightning-asr ``` Track scaling decisions in Grafana Set alerts for scale failures Periodically test scale-up and scale-down: ```bash kubectl scale deployment lightning-asr --replicas=20 ``` Watch for proper node addition/removal ## What's Next? Configure pod-level autoscaling Set up Prometheus metrics Visualize autoscaling behavior