*** title: EFS Configuration description: Set up Amazon EFS for shared storage in AWS EKS ------------------------------------------------------------ ## Overview Amazon Elastic File System (EFS) provides shared, persistent file storage for Kubernetes pods. This is ideal for storing AI models that can be shared across multiple Lightning ASR pods, eliminating duplicate downloads and reducing startup time. ## Benefits of EFS Multiple pods can read/write simultaneously (ReadWriteMany) Storage grows and shrinks automatically Models cached once, used by all pods Pay only for storage used, no upfront provisioning ## Prerequisites Install the EFS CSI driver (see [IAM & IRSA](/waves/self-host/kubernetes-setup/quick-start) guide) ```bash kubectl get pods -n kube-system -l app=efs-csi-controller ``` Note your EKS cluster's VPC ID and subnet IDs: ```bash aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.{vpcId:vpcId,subnetIds:subnetIds}' ``` Note your cluster security group ID: ```bash aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' ``` ## Create EFS File System ### Using AWS Console Go to AWS Console → EFS → Create file system * **Name**: `smallest-models` * **VPC**: Select your EKS cluster VPC * **Availability and Durability**: Regional (recommended) * Click "Customize" * **Performance mode**: General Purpose * **Throughput mode**: Bursting (or Elastic for production) * **Encryption**: Enable encryption at rest * Click "Next" * Select all subnets where EKS nodes run * Security group: Select cluster security group * Click "Next" Review settings and click "Create" Note the **File system ID** (e.g., `fs-0123456789abcdef`) ### Using AWS CLI ```bash VPC_ID=$(aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.vpcId' \ --output text) SG_ID=$(aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \ --output text) FILE_SYSTEM_ID=$(aws efs create-file-system \ --region us-east-1 \ --performance-mode generalPurpose \ --throughput-mode bursting \ --encrypted \ --tags Key=Name,Value=smallest-models \ --query 'FileSystemId' \ --output text) echo "Created EFS: $FILE_SYSTEM_ID" SUBNET_IDS=$(aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.subnetIds[*]' \ --output text) for subnet in $SUBNET_IDS; do aws efs create-mount-target \ --file-system-id $FILE_SYSTEM_ID \ --subnet-id $subnet \ --security-groups $SG_ID \ --region us-east-1 done echo "EFS File System ID: $FILE_SYSTEM_ID" ``` ## Configure Security Group Ensure the security group allows NFS traffic (port 2049) from cluster nodes: ```bash SG_ID=$(aws eks describe-cluster \ --name smallest-cluster \ --region us-east-1 \ --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' \ --output text) aws ec2 authorize-security-group-ingress \ --group-id $SG_ID \ --protocol tcp \ --port 2049 \ --source-group $SG_ID \ --region us-east-1 ``` If the rule already exists, you'll see an error. This is safe to ignore. ## Deploy with EFS in Helm Update your `values.yaml` to enable EFS: ```yaml values.yaml models: asrModelUrl: "your-model-url-here" volumes: aws: efs: enabled: true fileSystemId: "fs-0123456789abcdef" namePrefix: "models" ``` Replace `fs-0123456789abcdef` with your actual EFS file system ID. ### Deploy or Upgrade ```bash helm upgrade --install smallest-self-host smallest-self-host/smallest-self-host \ -f values.yaml \ --namespace smallest ``` ## Verify EFS Configuration ### Check Storage Class ```bash kubectl get storageclass ``` Should show: ``` NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE AGE models-aws-efs-sc efs.csi.aws.com Delete Immediate 1m ``` ### Check Persistent Volume ```bash kubectl get pv ``` Should show: ``` NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM models-aws-efs-pv 5Gi RWX Retain Bound smallest/models-aws-efs-pvc ``` ### Check Persistent Volume Claim ```bash kubectl get pvc -n smallest ``` Should show: ``` NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE models-aws-efs-pvc Bound models-aws-efs-pv 5Gi RWX models-aws-efs-sc 1m ``` ### Verify Mount in Pod ```bash kubectl get pods -l app=lightning-asr -n smallest kubectl exec -it -n smallest -- df -h | grep efs ``` Should show the EFS mount: ``` fs-0123456789abcdef.efs.us-east-1.amazonaws.com:/ 8.0E 0 8.0E 0% /app/models ``` ## Test EFS Create a test file in one pod and verify it's visible in another: ### Write test file: ```bash kubectl exec -it -n smallest -- sh -c "echo 'test' > /app/models/test.txt" ``` ### Read from another pod: ```bash kubectl exec -it -n smallest -- cat /app/models/test.txt ``` Should output: `test` ## How Model Caching Works With EFS enabled: 1. **First Pod Startup**: * Pod downloads model from `asrModelUrl` * Saves model to `/app/models` (EFS mount) * Takes 5-10 minutes (one-time download) 2. **Subsequent Pod Startups**: * Pod checks `/app/models` for existing model * Finds model already downloaded * Skips download, loads from EFS * Takes 30-60 seconds This is especially valuable when using autoscaling, as new pods start much faster. ## Performance Tuning ### Choose Throughput Mode **Best for**: Development, testing, variable workloads * Throughput scales with storage size * 50 MB/s per TB of storage * Bursting to 100 MB/s * Most cost-effective **Best for**: Production with unpredictable load * Automatically scales throughput * Up to 3 GB/s for reads * Up to 1 GB/s for writes * Pay for throughput used Update via console or CLI: ```bash aws efs update-file-system \ --file-system-id fs-0123456789abcdef \ --throughput-mode elastic ``` **Best for**: Production with consistent high throughput * Fixed throughput independent of size * Up to 1 GB/s throughput * Higher cost ```bash aws efs update-file-system \ --file-system-id fs-0123456789abcdef \ --throughput-mode provisioned \ --provisioned-throughput-in-mibps 100 ``` ### Enable Lifecycle Management Automatically move infrequently accessed files to lower-cost storage: ```bash aws efs put-lifecycle-configuration \ --file-system-id fs-0123456789abcdef \ --lifecycle-policies \ '[{"TransitionToIA":"AFTER_30_DAYS"},{"TransitionToPrimaryStorageClass":"AFTER_1_ACCESS"}]' ``` ## Cost Optimization ### Monitor EFS Usage ```bash aws efs describe-file-systems \ --file-system-id fs-0123456789abcdef \ --query 'FileSystems[0].SizeInBytes' ``` ### Estimate Costs EFS pricing (us-east-1): * **Standard storage**: \~\$0.30/GB/month * **Infrequent Access**: \~\$0.025/GB/month * **Data transfer**: Free within same AZ For 50 GB model: * Standard: \~\$15/month * With IA (after 30 days): \~\$1.25/month Use lifecycle policies to automatically move old models to Infrequent Access storage. ## Backup and Recovery ### Enable AWS Backup ```bash aws backup create-backup-plan \ --backup-plan '{ "BackupPlanName": "smallest-efs-backup", "Rules": [{ "RuleName": "daily-backup", "TargetBackupVaultName": "Default", "ScheduleExpression": "cron(0 2 * * ? *)", "Lifecycle": { "DeleteAfterDays": 30 } }] }' ``` ### Manual Backup EFS automatically creates point-in-time backups. Access via AWS Console → EFS → Backups. ## Troubleshooting ### Mount Failed **Check EFS CSI driver**: ```bash kubectl get pods -n kube-system -l app=efs-csi-controller kubectl logs -n kube-system -l app=efs-csi-controller ``` **Verify security group rules**: ```bash aws ec2 describe-security-groups --group-ids $SG_ID ``` Ensure port 2049 is open. ### Slow Performance **Check throughput mode**: ```bash aws efs describe-file-systems \ --file-system-id fs-0123456789abcdef \ --query 'FileSystems[0].ThroughputMode' ``` Consider upgrading to Elastic or Provisioned. **Monitor CloudWatch metrics**: * `PermittedThroughput` * `BurstCreditBalance` * `ClientConnections` ### Permission Denied **Check mount options** in PV: ```bash kubectl get pv models-aws-efs-pv -o yaml ``` Should include: ```yaml mountOptions: - tls ``` ## Alternative: EBS for Single Pod If you don't need shared storage (single replica only): ```yaml values.yaml models: volumes: aws: efs: enabled: false scaling: replicas: lightningAsr: 1 lightningAsr: persistence: enabled: true storageClass: gp3 size: 100Gi ``` EBS volumes can only be attached to one pod at a time. This prevents horizontal scaling. ## What's Next? Optimize model storage and caching strategies Enable autoscaling with shared model storage