AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion
Quick Fix Summary
TL;DRIncrease target group capacity or scale out healthy targets immediately.
The ALB cannot route traffic because the target group has insufficient resources (e.g., no healthy targets, connection limits exceeded, or insufficient capacity).
Diagnosis & Causes
Recovery Steps
Step 1: Verify Target Health and Load Balancer Metrics
Check the health status of targets and review CloudWatch metrics for the ALB and target group to confirm resource exhaustion.
# Describe target health for the specific target group
aws elbv2 describe-target-health --target-group-arn <TARGET_GROUP_ARN>
# Check ALB CloudWatch metrics for HTTPCode_ELB_5XX_Count and TargetConnectionErrorCount
aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_ELB_5XX_Count --dimensions Name=LoadBalancer,Value=<ALB_ARN_Suffix> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Sum Step 2: Scale Out Healthy Targets
Increase the number of healthy instances in your Auto Scaling Group (ASG) or manually register new, healthy targets to the group.
# Set desired capacity for ASG linked to the target group
aws autoscaling set-desired-capacity --auto-scaling-group-name <ASG_NAME> --desired-capacity <NEW_CAPACITY> --honor-cooldown
# Manually register a new EC2 instance to the target group
aws elbv2 register-targets --target-group-arn <TARGET_GROUP_ARN> --targets Id=<INSTANCE_ID> Step 3: Adjust Target Group Health Check Settings
Make health checks less strict temporarily to allow more targets to pass, but ensure backend can handle traffic. Focus on increasing timeout and interval.
# Modify health check for the target group (example: longer timeout, more healthy thresholds)
aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --health-check-timeout-seconds 10 --health-check-interval-seconds 30 --healthy-threshold-count 3 --unhealthy-threshold-count 2 Step 4: Review and Increase Backend Capacity
Check CPU/Memory on backend targets. If saturated, scale vertically (instance size) or optimize application performance.
# SSH into a backend instance and check resource usage
ssh -i <KEY_PEM> ec2-user@<INSTANCE_IP> 'top -bn1 | head -20'
# Check CloudWatch for EC2 CPU utilization
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=<INSTANCE_ID> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Average Step 5: Implement Connection Draining and Adjust ALB Timeouts
Enable connection draining (deregistration delay) on the target group to allow in-flight requests to complete during scaling. Increase ALB idle timeout if clients use long-lived connections.
# Enable connection draining (deregistration delay)
aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --deregistration-delay-seconds 300
# Modify ALB idle timeout (for the listener/load balancer, may require modifying listener rules or load balancer attributes)
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <ALB_ARN> --attributes Key=idle_timeout.timeout_seconds,Value=60 Step 6: Check Security Group and Network ACL Rules
Ensure the ALB's security group allows outbound traffic to the targets and the targets' security groups allow inbound traffic from the ALB on the health check and application ports.
# Describe security groups for ALB and a target instance
aws ec2 describe-security-groups --group-ids <ALB_SG_ID> <TARGET_SG_ID> Architect's Pro Tip
"This often happens during sudden traffic spikes when Auto Scaling lags. Pre-warm your ASG by proactively scaling based on predictive metrics (e.g., RequestCountPerTarget) rather than just CPU. Also, ensure your health check endpoint is lightweight and doesn't itself fail under load."
Frequently Asked Questions
My targets show as 'healthy' but I still get 503s. What's wrong?
Targets can be healthy but the target group itself may be at capacity. Check the ALB's 'ProcessedBytes' and 'TargetConnectionErrorCount' metrics. The issue might be the backend cannot accept new connections (e.g., max threads, listen queue full) despite passing a simple health check.
How do I know if I've hit the target group limits?
AWS has soft limits on targets per ALB and rules per ALB. If you have a very large number of targets (thousands) or complex rules, you may exhaust ALB resources. Check for the CloudWatch metric 'ActiveConnectionCount' or 'TargetConnectionErrorCount' and contact AWS Support to increase limits if necessary.