AWS EKS Setup

View as Markdown

Overview

This guide walks you through creating an Amazon EKS cluster optimized for running Smallest Self-Host with GPU acceleration.

Prerequisites

1

AWS CLI

Install and configure AWS CLI:

$aws --version
$aws configure
2

eksctl

Install eksctl (EKS cluster management tool):

$brew install eksctl

Verify:

$eksctl version
3

kubectl

Install kubectl:

$brew install kubectl
4

IAM Permissions

Ensure your AWS user/role has permissions to:

  • Create EKS clusters
  • Manage EC2 instances
  • Create IAM roles
  • Manage VPC resources

Cluster Configuration

Option 1: Quick Start with eksctl

Create a cluster with GPU nodes using a single command:

$eksctl create cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --version 1.28 \
> --nodegroup-name cpu-nodes \
> --node-type t3.large \
> --nodes 2 \
> --nodes-min 1 \
> --nodes-max 3 \
> --managed

Then add GPU node group:

$eksctl create nodegroup \
> --cluster smallest-cluster \
> --region us-east-1 \
> --name gpu-nodes \
> --node-type g5.xlarge \
> --nodes 1 \
> --nodes-min 0 \
> --nodes-max 5 \
> --managed \
> --node-labels "workload=gpu,nvidia.com/gpu=true" \
> --node-taints "nvidia.com/gpu=true:NoSchedule"

This creates a cluster with separate CPU and GPU node groups, allowing for cost-effective scaling.

Option 2: Using Cluster Config File

Create a cluster configuration file for more control:

cluster-config.yaml
1apiVersion: eksctl.io/v1alpha5
2kind: ClusterConfig
3
4metadata:
5 name: smallest-cluster
6 region: us-east-1
7 version: "1.28"
8
9iam:
10 withOIDC: true
11
12managedNodeGroups:
13 - name: cpu-nodes
14 instanceType: t3.large
15 minSize: 1
16 maxSize: 3
17 desiredCapacity: 2
18 volumeSize: 50
19 ssh:
20 allow: false
21 labels:
22 workload: cpu
23 tags:
24 Environment: production
25 Application: smallest-self-host
26
27 - name: gpu-nodes
28 instanceType: g5.xlarge
29 minSize: 0
30 maxSize: 5
31 desiredCapacity: 1
32 volumeSize: 100
33 ssh:
34 allow: false
35 labels:
36 workload: gpu
37 nvidia.com/gpu: "true"
38 node.kubernetes.io/instance-type: g5.xlarge
39 taints:
40 - key: nvidia.com/gpu
41 value: "true"
42 effect: NoSchedule
43 tags:
44 Environment: production
45 Application: smallest-self-host
46 NodeType: gpu
47 iam:
48 withAddonPolicies:
49 autoScaler: true
50 ebs: true
51 efs: true
52
53addons:
54 - name: vpc-cni
55 - name: coredns
56 - name: kube-proxy
57 - name: aws-ebs-csi-driver

Create the cluster:

$eksctl create cluster -f cluster-config.yaml

Cluster creation takes 15-20 minutes. Monitor progress in the AWS CloudFormation console.

GPU Instance Types

Choose the right GPU instance type for your workload:

Instance TypeGPUVRAMvCPUsRAM$/hour*Recommended For
g5.xlarge1x A10G24 GB416 GB$1.00Development, testing
g5.2xlarge1x A10G24 GB832 GB$1.21Small production
g5.4xlarge1x A10G24 GB1664 GB$1.63Medium production
g5.12xlarge4x A10G96 GB48192 GB$5.67High-volume production
p3.2xlarge1x V10016 GB861 GB$3.06Legacy workloads
* Approximate on-demand pricing in us-east-1, subject to change

Recommendation: Start with g5.xlarge for development and testing. Scale to g5.2xlarge or higher for production.

Verify Cluster

Check Cluster Status

$eksctl get cluster --name smallest-cluster --region us-east-1

Verify Node Groups

$eksctl get nodegroup --cluster smallest-cluster --region us-east-1

Configure kubectl

$aws eks update-kubeconfig --name smallest-cluster --region us-east-1

Verify access:

$kubectl get nodes

Expected output:

NAME STATUS ROLES AGE VERSION
ip-xxx-cpu-1 Ready <none> 5m v1.28.x
ip-xxx-cpu-2 Ready <none> 5m v1.28.x
ip-xxx-gpu-1 Ready <none> 5m v1.28.x

Verify GPU Nodes

Check GPU availability:

$kubectl get nodes -l workload=gpu -o json | \
> jq '.items[].status.capacity'

Look for nvidia.com/gpu in the output:

1{
2 "cpu": "4",
3 "memory": "15944904Ki",
4 "nvidia.com/gpu": "1",
5 "pods": "29"
6}

Install NVIDIA Device Plugin

The NVIDIA device plugin enables GPU scheduling in Kubernetes.

The Smallest Self-Host chart includes the NVIDIA GPU Operator. Enable it in your values:

values.yaml
1gpu-operator:
2 enabled: true

Manual Installation

If installing separately:

$kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify:

$kubectl get pods -n kube-system | grep nvidia

Install EBS CSI Driver

Required for persistent volumes:

Using eksctl

$eksctl create addon \
> --name aws-ebs-csi-driver \
> --cluster smallest-cluster \
> --region us-east-1

Using AWS Console

  1. Navigate to EKS → Clusters → smallest-cluster → Add-ons
  2. Click “Add new”
  3. Select “Amazon EBS CSI Driver”
  4. Click “Add”

Verify EBS CSI Driver

$kubectl get pods -n kube-system -l app=ebs-csi-controller

Install EFS CSI Driver (Optional)

Recommended for shared model storage across pods.

Create IAM Policy

$curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/docs/iam-policy-example.json
$
$aws iam create-policy \
> --policy-name AmazonEKS_EFS_CSI_Driver_Policy \
> --policy-document file://iam-policy.json

Create IAM Service Account

$eksctl create iamserviceaccount \
> --cluster smallest-cluster \
> --region us-east-1 \
> --namespace kube-system \
> --name efs-csi-controller-sa \
> --attach-policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/AmazonEKS_EFS_CSI_Driver_Policy \
> --approve

Replace YOUR_ACCOUNT_ID with your AWS account ID.

Install EFS CSI Driver

$kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.7"

Verify:

$kubectl get pods -n kube-system -l app=efs-csi-controller

Enable Cluster Autoscaler

See the Cluster Autoscaler guide for detailed setup.

Quick setup:

$eksctl create iamserviceaccount \
> --cluster smallest-cluster \
> --region us-east-1 \
> --namespace kube-system \
> --name cluster-autoscaler \
> --attach-policy-arn arn:aws:iam::aws:policy/AutoScalingFullAccess \
> --approve \
> --override-existing-serviceaccounts

Cost Optimization

Use Spot Instances for GPU Nodes

Reduce costs by up to 70% with Spot instances:

cluster-config.yaml
1managedNodeGroups:
2 - name: gpu-nodes-spot
3 instanceType: g5.xlarge
4 minSize: 0
5 maxSize: 5
6 desiredCapacity: 1
7 spot: true
8 instancesDistribution:
9 maxPrice: 0.50
10 instanceTypes: ["g5.xlarge", "g5.2xlarge"]
11 onDemandBaseCapacity: 0
12 onDemandPercentageAboveBaseCapacity: 0
13 spotAllocationStrategy: capacity-optimized

Spot instances can be interrupted with 2-minute warning. Ensure your application handles graceful shutdowns.

Right-Size Node Groups

Start small and scale based on metrics:

1managedNodeGroups:
2 - name: gpu-nodes
3 minSize: 0
4 maxSize: 10
5 desiredCapacity: 1

Set minSize: 0 to scale down to zero during off-hours.

Enable Cluster Autoscaler

Automatically adjust node count based on demand:

values.yaml
1cluster-autoscaler:
2 enabled: true
3 autoDiscovery:
4 clusterName: smallest-cluster
5 awsRegion: us-east-1

Security Best Practices

Enable Private Endpoint

$eksctl utils update-cluster-endpoint \
> --cluster smallest-cluster \
> --region us-east-1 \
> --private-access=true \
> --public-access=false

Enable Logging

$eksctl utils update-cluster-logging \
> --cluster smallest-cluster \
> --region us-east-1 \
> --enable-types all \
> --approve

Update Security Groups

Restrict inbound access to API server:

$aws ec2 describe-security-groups \
> --filters "Name=tag:aws:eks:cluster-name,Values=smallest-cluster"

Update rules to allow only specific IPs.

Troubleshooting

GPU Nodes Not Ready

Check NVIDIA device plugin:

$kubectl get pods -n kube-system | grep nvidia
$kubectl describe node <gpu-node-name>

Pods Stuck in Pending

Check node capacity:

$kubectl describe pod <pod-name>
$kubectl get nodes -o json | jq '.items[].status.allocatable'

EBS Volumes Not Mounting

Verify EBS CSI driver:

$kubectl get pods -n kube-system -l app=ebs-csi-controller
$kubectl logs -n kube-system -l app=ebs-csi-controller

What’s Next?