Troubleshooting HTTP 502 Errors from ALB Despite Target Health Checks Passing
Quick Fix Summary
TL;DRCheck ALB access logs for 502s and verify target response timeouts.
A 502 from the ALB indicates it received an invalid or incomplete response from a registered target, even though the target's health check endpoint may be returning a simple success (e.g., 200 OK). The issue lies in the application's behavior under the full request path.
Diagnosis & Causes
Recovery Steps
Step 1: Verify and Isolate with ALB Access Logs
Enable and examine ALB Access Logs to confirm 502 errors, identify the failing target, and see the upstream response time.
# Enable via Console or CLI:
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <ALB_ARN> --attributes Key=access_logs.s3.enabled,Value=true Key=access_logs.s3.bucket,Value=your-log-bucket
# Query logs (example pattern for 502s):
aws s3 cp s3://your-log-bucket/AWSLogs/.../elasticloadbalancing/.../ - | grep " 502 " | head -20 Step 2: Analyze Target Response Metrics
Check CloudWatch metrics for the target group to identify latency spikes, request counts, and HTTP 5xx errors originating from the targets.
# Key metrics to graph in CloudWatch:
- TargetResponseTime (p99 > 30s likely times out)
- HTTPCode_Target_5XX_Count
- RequestCount per target
# Use AWS CLI to get metric data:
aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name TargetResponseTime --dimensions Name=TargetGroup,Value=<TG_ARN> --start-time ... --end-time ... --period 300 --statistics Average Maximum Step 3: Check Application Logs on the Failing Target
SSH into the instance identified in Step 1 and examine application logs for errors, timeouts, or crashes during the request.
# For common web servers:
sudo tail -100 /var/log/nginx/error.log
sudo journalctl -u apache2 --since "5 minutes ago" -f
# Check for out-of-memory kills:
sudo dmesg | grep -i "killed process"
# Check connection states:
ss -tnp | grep ESTAB | wc -l Step 4: Validate ALB Timeout and Target Health Check Configuration
Ensure the ALB's idle timeout is longer than your application's longest processing time, and that the health check path is truly representative.
# Describe target group attributes:
aws elbv2 describe-target-group-attributes --target-group-arn <TG_ARN>
# Key attributes:
deregistration_delay.timeout_seconds (default 300)
stickiness.enabled
# Describe health check settings:
aws elbv2 describe-target-groups --target-group-arns <TG_ARN> --query 'TargetGroups[0].HealthCheck' Step 5: Test Direct Connection to the Target
Bypass the ALB and send a request directly to the target instance's IP and port to rule out network/security group issues.
# From a bastion host or within the VPC:
curl -v -H "Host: your-app-domain.com" http://<TARGET_PRIVATE_IP>:<PORT>/your-health-path
curl -v -H "Host: your-app-domain.com" --max-time 31 http://<TARGET_PRIVATE_IP>:<PORT>/your-app-path Step 6: Inspect Security Group and NACL Rules
Confirm the target instance's security group allows traffic from the ALB's security group (or IP) on the application port, and that NACLs are not blocking ephemeral return ports.
# Check ALB security group ID:
aws elbv2 describe-load-balancers --load-balancer-arns <ALB_ARN> --query 'LoadBalancers[0].SecurityGroups'
# Verify target instance SG ingress:
aws ec2 describe-security-groups --group-ids <TARGET_SG> --query 'SecurityGroups[0].IpPermissions' Step 7: Simulate Load and Profile Application
Use a load testing tool to simulate traffic and correlate 502s with application metrics (CPU, memory, thread pools, DB connections).
# Simple load test with vegeta:
echo "GET http://<ALB_DNS>/your-app-path" | vegeta attack -duration=30s -rate=10 | vegeta report
# Monitor target instance metrics during test:
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=<INSTANCE_ID> --start-time ... --end-time ... --period 60 --statistics Average Maximum Architect's Pro Tip
"This often happens when the health check endpoint (e.g., /health) is a simple, cached, or static response that succeeds, but the main application threads are deadlocked, the database connection pool is exhausted, or the response exceeds the ALB's idle timeout (default 60s). Always make your health check exercise the critical dependencies of your main app path."
Frequently Asked Questions
The health check is passing, so why is the ALB sending traffic to a 'bad' target?
Health checks are periodic, low-volume probes. A target can pass a health check and then immediately become unhealthy (e.g., due to a spike in traffic, memory leak, or backend dependency failure) before the next health check runs. The ALB only stops sending traffic after consecutive health check failures.
What's the difference between `HTTPCode_ELB_5XX_Count` and `HTTPCode_Target_5XX_Count` in CloudWatch?
`HTTPCode_ELB_5XX_Count` includes 502s generated by the ALB itself (like when it times out waiting for the target). `HTTPCode_Target_5XX_Count` are 5xx responses (e.g., 503, 500) that the target application sends back to the ALB. In this scenario, look for spikes in `HTTPCode_ELB_5XX_Count`.
Can Security Groups cause a 502 if health checks pass?
Yes. Health checks use the health check port/path. If your main application uses a different port or the target's security group incorrectly restricts the ALB's source IP for the *response traffic* on ephemeral ports (1024-65535), the main request can fail while the health check succeeds.