ERROR

Troubleshooting HTTP 502 Errors from ALB Despite Target Health Checks Passing

Quick Fix Summary

TL;DR

Check ALB access logs for 502s and verify target response timeouts.

A 502 from the ALB indicates it received an invalid or incomplete response from a registered target, even though the target's health check endpoint may be returning a simple success (e.g., 200 OK). The issue lies in the application's behavior under the full request path.

Diagnosis & Causes

Application response timeout or connection reset to ALB.

Invalid HTTP response headers or chunked encoding from the target.

Target instance resource exhaustion (CPU, memory, sockets).

Recovery Steps

Step 1: Verify and Isolate with ALB Access Logs

Enable and examine ALB Access Logs to confirm 502 errors, identify the failing target, and see the upstream response time.

bash

# Enable via Console or CLI:
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <ALB_ARN> --attributes Key=access_logs.s3.enabled,Value=true Key=access_logs.s3.bucket,Value=your-log-bucket
# Query logs (example pattern for 502s):
aws s3 cp s3://your-log-bucket/AWSLogs/.../elasticloadbalancing/.../ - | grep " 502 " | head -20

Step 2: Analyze Target Response Metrics

Check CloudWatch metrics for the target group to identify latency spikes, request counts, and HTTP 5xx errors originating from the targets.

bash

# Key metrics to graph in CloudWatch:
- TargetResponseTime (p99 > 30s likely times out)
- HTTPCode_Target_5XX_Count
- RequestCount per target
# Use AWS CLI to get metric data:
aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name TargetResponseTime --dimensions Name=TargetGroup,Value=<TG_ARN> --start-time ... --end-time ... --period 300 --statistics Average Maximum

Step 3: Check Application Logs on the Failing Target

SSH into the instance identified in Step 1 and examine application logs for errors, timeouts, or crashes during the request.

bash

# For common web servers:
sudo tail -100 /var/log/nginx/error.log
sudo journalctl -u apache2 --since "5 minutes ago" -f
# Check for out-of-memory kills:
sudo dmesg | grep -i "killed process"
# Check connection states:
ss -tnp | grep ESTAB | wc -l

Step 4: Validate ALB Timeout and Target Health Check Configuration

Ensure the ALB's idle timeout is longer than your application's longest processing time, and that the health check path is truly representative.

bash

# Describe target group attributes:
aws elbv2 describe-target-group-attributes --target-group-arn <TG_ARN>
# Key attributes:
deregistration_delay.timeout_seconds (default 300)
stickiness.enabled
# Describe health check settings:
aws elbv2 describe-target-groups --target-group-arns <TG_ARN> --query 'TargetGroups[0].HealthCheck'

Step 5: Test Direct Connection to the Target

Bypass the ALB and send a request directly to the target instance's IP and port to rule out network/security group issues.

bash

# From a bastion host or within the VPC:
curl -v -H "Host: your-app-domain.com" http://<TARGET_PRIVATE_IP>:<PORT>/your-health-path
curl -v -H "Host: your-app-domain.com" --max-time 31 http://<TARGET_PRIVATE_IP>:<PORT>/your-app-path

Step 6: Inspect Security Group and NACL Rules

Confirm the target instance's security group allows traffic from the ALB's security group (or IP) on the application port, and that NACLs are not blocking ephemeral return ports.

bash

# Check ALB security group ID:
aws elbv2 describe-load-balancers --load-balancer-arns <ALB_ARN> --query 'LoadBalancers[0].SecurityGroups'
# Verify target instance SG ingress:
aws ec2 describe-security-groups --group-ids <TARGET_SG> --query 'SecurityGroups[0].IpPermissions'

Step 7: Simulate Load and Profile Application

Use a load testing tool to simulate traffic and correlate 502s with application metrics (CPU, memory, thread pools, DB connections).

bash

# Simple load test with vegeta:
echo "GET http://<ALB_DNS>/your-app-path" | vegeta attack -duration=30s -rate=10 | vegeta report
# Monitor target instance metrics during test:
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=<INSTANCE_ID> --start-time ... --end-time ... --period 60 --statistics Average Maximum

Architect's Pro Tip

"This often happens when the health check endpoint (e.g., /health) is a simple, cached, or static response that succeeds, but the main application threads are deadlocked, the database connection pool is exhausted, or the response exceeds the ALB's idle timeout (default 60s). Always make your health check exercise the critical dependencies of your main app path."

Frequently Asked Questions

The health check is passing, so why is the ALB sending traffic to a 'bad' target?

Health checks are periodic, low-volume probes. A target can pass a health check and then immediately become unhealthy (e.g., due to a spike in traffic, memory leak, or backend dependency failure) before the next health check runs. The ALB only stops sending traffic after consecutive health check failures.

What's the difference between `HTTPCode_ELB_5XX_Count` and `HTTPCode_Target_5XX_Count` in CloudWatch?

`HTTPCode_ELB_5XX_Count` includes 502s generated by the ALB itself (like when it times out waiting for the target). `HTTPCode_Target_5XX_Count` are 5xx responses (e.g., 503, 500) that the target application sends back to the ALB. In this scenario, look for spikes in `HTTPCode_ELB_5XX_Count`.

Can Security Groups cause a 502 if health checks pass?

Yes. Health checks use the health check port/path. If your main application uses a different port or the target's security group incorrectly restricts the ALB's source IP for the *response traffic* on ephemeral ports (1024-65535), the main request can fail while the health check succeeds.

Related AWS Guides

AccessDeniedException

Troubleshooting HTTP 502 Errors from ALB Despite Target Health Checks Passing

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Verify and Isolate with ALB Access Logs

Step 2: Analyze Target Response Metrics

Step 3: Check Application Logs on the Failing Target

Step 4: Validate ALB Timeout and Target Health Check Configuration

Step 5: Test Direct Connection to the Target

Step 6: Inspect Security Group and NACL Rules

Step 7: Simulate Load and Profile Application

Architect's Pro Tip

Frequently Asked Questions

The health check is passing, so why is the ALB sending traffic to a 'bad' target?

What's the difference between `HTTPCode_ELB_5XX_Count` and `HTTPCode_Target_5XX_Count` in CloudWatch?

Can Security Groups cause a 502 if health checks pass?

Related AWS Guides

How to Fix AWS AccessDeniedException Error

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

AWS EKS: Fix Intermittent Pod Evictions due to Resource Exhaustion in Multi-Tenant Clusters