EFS Configuration

View as MarkdownOpen in Claude

Overview

Amazon Elastic File System (EFS) provides shared, persistent file storage for Kubernetes pods. This is ideal for storing AI models that can be shared across multiple Lightning ASR pods, eliminating duplicate downloads and reducing startup time.

Benefits of EFS

Shared Storage

Multiple pods can read/write simultaneously (ReadWriteMany)

Automatic Scaling

Storage grows and shrinks automatically

Fast Startup

Models cached once, used by all pods

Cost Effective

Pay only for storage used, no upfront provisioning

Prerequisites

1

EFS CSI Driver

Install the EFS CSI driver (see IAM & IRSA guide)

$kubectl get pods -n kube-system -l app=efs-csi-controller
2

VPC and Subnets

Note your EKS cluster’s VPC ID and subnet IDs:

$aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.{vpcId:vpcId,subnetIds:subnetIds}'
3

Security Group

Note your cluster security group ID:

$aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId'

Create EFS File System

Using AWS Console

2

Configure File System

  • Name: smallest-models
  • VPC: Select your EKS cluster VPC
  • Availability and Durability: Regional (recommended)
  • Click “Customize”
3

File System Settings

  • Performance mode: General Purpose
  • Throughput mode: Bursting (or Elastic for production)
  • Encryption: Enable encryption at rest
  • Click “Next”
4

Network Access

  • Select all subnets where EKS nodes run
  • Security group: Select cluster security group
  • Click “Next”
5

Review and Create

Review settings and click “Create”

Note the File system ID (e.g., fs-0123456789abcdef)

Using AWS CLI

$VPC_ID=$(aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.vpcId' \
> --output text)
$
$SG_ID=$(aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \
> --output text)
$
$FILE_SYSTEM_ID=$(aws efs create-file-system \
> --region us-east-1 \
> --performance-mode generalPurpose \
> --throughput-mode bursting \
> --encrypted \
> --tags Key=Name,Value=smallest-models \
> --query 'FileSystemId' \
> --output text)
$
$echo "Created EFS: $FILE_SYSTEM_ID"
$
$SUBNET_IDS=$(aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.subnetIds[*]' \
> --output text)
$
$for subnet in $SUBNET_IDS; do
$ aws efs create-mount-target \
> --file-system-id $FILE_SYSTEM_ID \
> --subnet-id $subnet \
> --security-groups $SG_ID \
> --region us-east-1
$done
$
$echo "EFS File System ID: $FILE_SYSTEM_ID"

Configure Security Group

Ensure the security group allows NFS traffic (port 2049) from cluster nodes:

$SG_ID=$(aws eks describe-cluster \
> --name smallest-cluster \
> --region us-east-1 \
> --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \
> --output text)
$
$aws ec2 authorize-security-group-ingress \
> --group-id $SG_ID \
> --protocol tcp \
> --port 2049 \
> --source-group $SG_ID \
> --region us-east-1

If the rule already exists, you’ll see an error. This is safe to ignore.

Deploy with EFS in Helm

Update your values.yaml to enable EFS:

values.yaml
1models:
2 asrModelUrl: "your-model-url-here"
3 volumes:
4 aws:
5 efs:
6 enabled: true
7 fileSystemId: "fs-0123456789abcdef"
8 namePrefix: "models"

Replace fs-0123456789abcdef with your actual EFS file system ID.

Deploy or Upgrade

$helm upgrade --install smallest-self-host smallest-self-host/smallest-self-host \
> -f values.yaml \
> --namespace smallest

Verify EFS Configuration

Check Storage Class

$kubectl get storageclass

Should show:

NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE AGE
models-aws-efs-sc efs.csi.aws.com Delete Immediate 1m

Check Persistent Volume

$kubectl get pv

Should show:

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
models-aws-efs-pv 5Gi RWX Retain Bound smallest/models-aws-efs-pvc

Check Persistent Volume Claim

$kubectl get pvc -n smallest

Should show:

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
models-aws-efs-pvc Bound models-aws-efs-pv 5Gi RWX models-aws-efs-sc 1m

Verify Mount in Pod

$kubectl get pods -l app=lightning-asr -n smallest
$kubectl exec -it <lightning-asr-pod> -n smallest -- df -h | grep efs

Should show the EFS mount:

fs-0123456789abcdef.efs.us-east-1.amazonaws.com:/ 8.0E 0 8.0E 0% /app/models

Test EFS

Create a test file in one pod and verify it’s visible in another:

Write test file:

$kubectl exec -it <lightning-asr-pod-1> -n smallest -- sh -c "echo 'test' > /app/models/test.txt"

Read from another pod:

$kubectl exec -it <lightning-asr-pod-2> -n smallest -- cat /app/models/test.txt

Should output: test

How Model Caching Works

With EFS enabled:

  1. First Pod Startup:

    • Pod downloads model from asrModelUrl
    • Saves model to /app/models (EFS mount)
    • Takes 5-10 minutes (one-time download)
  2. Subsequent Pod Startups:

    • Pod checks /app/models for existing model
    • Finds model already downloaded
    • Skips download, loads from EFS
    • Takes 30-60 seconds

This is especially valuable when using autoscaling, as new pods start much faster.

Performance Tuning

Choose Throughput Mode

Best for: Development, testing, variable workloads

  • Throughput scales with storage size
  • 50 MB/s per TB of storage
  • Bursting to 100 MB/s
  • Most cost-effective

Enable Lifecycle Management

Automatically move infrequently accessed files to lower-cost storage:

$aws efs put-lifecycle-configuration \
> --file-system-id fs-0123456789abcdef \
> --lifecycle-policies \
> '[{"TransitionToIA":"AFTER_30_DAYS"},{"TransitionToPrimaryStorageClass":"AFTER_1_ACCESS"}]'

Cost Optimization

Monitor EFS Usage

$aws efs describe-file-systems \
> --file-system-id fs-0123456789abcdef \
> --query 'FileSystems[0].SizeInBytes'

Estimate Costs

EFS pricing (us-east-1):

  • Standard storage: ~$0.30/GB/month
  • Infrequent Access: ~$0.025/GB/month
  • Data transfer: Free within same AZ

For 50 GB model:

  • Standard: ~$15/month
  • With IA (after 30 days): ~$1.25/month

Use lifecycle policies to automatically move old models to Infrequent Access storage.

Backup and Recovery

Enable AWS Backup

$aws backup create-backup-plan \
> --backup-plan '{
> "BackupPlanName": "smallest-efs-backup",
> "Rules": [{
> "RuleName": "daily-backup",
> "TargetBackupVaultName": "Default",
> "ScheduleExpression": "cron(0 2 * * ? *)",
> "Lifecycle": {
> "DeleteAfterDays": 30
> }
> }]
> }'

Manual Backup

EFS automatically creates point-in-time backups. Access via AWS Console → EFS → Backups.

Troubleshooting

Mount Failed

Check EFS CSI driver:

$kubectl get pods -n kube-system -l app=efs-csi-controller
$kubectl logs -n kube-system -l app=efs-csi-controller

Verify security group rules:

$aws ec2 describe-security-groups --group-ids $SG_ID

Ensure port 2049 is open.

Slow Performance

Check throughput mode:

$aws efs describe-file-systems \
> --file-system-id fs-0123456789abcdef \
> --query 'FileSystems[0].ThroughputMode'

Consider upgrading to Elastic or Provisioned.

Monitor CloudWatch metrics:

  • PermittedThroughput
  • BurstCreditBalance
  • ClientConnections

Permission Denied

Check mount options in PV:

$kubectl get pv models-aws-efs-pv -o yaml

Should include:

1mountOptions:
2 - tls

Alternative: EBS for Single Pod

If you don’t need shared storage (single replica only):

values.yaml
1models:
2 volumes:
3 aws:
4 efs:
5 enabled: false
6
7scaling:
8 replicas:
9 lightningAsr: 1
10
11lightningAsr:
12 persistence:
13 enabled: true
14 storageClass: gp3
15 size: 100Gi

EBS volumes can only be attached to one pod at a time. This prevents horizontal scaling.

What’s Next?