CRITICAL

Root Cause Analysis: Why Alibaba Cloud SLB 502 Bad Gateway Happens (Architecture Deep Dive)

Quick Fix Summary

TL;DR

Check backend server health, verify SLB listener configuration, and ensure proper session persistence settings.

A 502 Bad Gateway from Alibaba Cloud SLB indicates the load balancer received an invalid response from a backend server. This is a proxy-level error where SLB cannot fulfill the client request due to backend failure.

Diagnosis & Causes

  • Backend server process crash or timeout.
  • Health check configuration mismatch with application.
  • Backend server overload or resource exhaustion.
  • Network ACL or security group blocking traffic.
  • Race condition during backend server scaling.
  • Recovery Steps

    1

    Step 1: Diagnose Backend Server Health

    Verify the backend ECS instances or containers are running and responding correctly on the configured port. Use direct connection tests.

    bash
    # Test direct connectivity to backend server (replace IP and Port)
    nc -zv <backend_server_ip> <application_port>
    curl -I http://<backend_server_ip>:<application_port>/health
    2

    Step 2: Audit SLB Health Check Configuration

    A mismatch between the SLB health check settings and your application's actual health endpoint/behavior is a primary cause. The health check must succeed for SLB to route traffic.

    bash
    # Use Alibaba Cloud CLI to check health check config for your SLB listener
    aliyun slb DescribeHealthStatus --LoadBalancerId <lb-id> --ListenerPort <port>
    # Check the specific health check configuration
    aliyun slb DescribeHealthCheck --LoadBalancerId <lb-id> --ListenerPort <port>
    3

    Step 3: Analyze Backend Server Logs for Timeouts/Crashes

    Inspect application and system logs on the backend servers around the time of the 502 errors. Look for exceptions, restarts, or slow requests.

    bash
    # Check for application errors (example for Nginx/Web server)
    tail -100 /var/log/nginx/error.log | grep -A5 -B5 "502"
    # Check for system resource issues (OOM, CPU)
    dmesg -T | tail -50
    journalctl --since "5 minutes ago" -u <your-service>
    4

    Step 4: Verify Network Security Rules (ACL & Security Groups)

    Ensure the SLB's backend server security group allows traffic FROM the SLB's private IP addresses (100.64.0.0/10 and 100.96.0.0/11) on the health check and application ports.

    bash
    # Example: Check iptables or firewall-cmd rules on backend server
    iptables -L -n -v | grep -E ":(80|443|8080)"
    # For Alibaba Cloud Security Groups, verify via Console or CLI:
    aliyun ecs DescribeSecurityGroupAttribute --SecurityGroupId <your-sg-id>
    5

    Step 5: Implement Graceful Shutdown & Connection Draining

    Prevent 502s during deployments or scaling by ensuring your application handles SIGTERM, finishes active requests, and SLB stops sending new traffic before termination.

    yaml
    # Example: Kubernetes lifecycle hook for graceful termination
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 30"] # Allow SLB health checks to fail
    # Application should handle SIGTERM to stop accepting new connections.
    6

    Step 6: Monitor SLB Metrics & Set Alerts

    Proactively identify issues by monitoring key SLB metrics like BackendServerHealthStatus, ActiveConnection, and DropConnection.

    bash
    # Use CloudMonitor to get unhealthy backend count
    aliyun cms DescribeMetricLast --Namespace "acs_slb_dashboard" --MetricName "BackendServerHealthStatus" --Dimensions "{\"instanceId\":\"<lb-id>\",\"port\":\"<listener-port>\"}"
    # Set an alarm for when healthy hosts drop below a threshold.

    Architect's Pro Tip

    "The most insidious 502 cause is a race condition during auto-scaling: a new instance passes health check before the app is fully ready. Use a readiness probe that checks application logic, not just TCP."

    Frequently Asked Questions

    Can Alibaba Cloud SLB itself cause a 502 error?

    Rarely. The 502 is generated by SLB, but the root cause is almost always the backend server returning an invalid, malformed, or no response (e.g., connection reset, timeout, empty reply). SLB acts as the proxy reporting the backend failure.

    Why do I see intermittent 502 errors during peak traffic?

    This typically points to backend resource exhaustion (CPU, Memory, connection limits). The backend server becomes unresponsive under load, causing SLB health checks to fail and requests to be routed to unhealthy instances, resulting in 502s.

    How does SLB health check work and how can it be misconfigured?

    SLB sends periodic TCP/HTTP/HTTPS probes. Misconfiguration occurs when the check path (e.g., '/') differs from the app's actual health endpoint, the response timeout is too short for a slow-starting app, or the success codes (e.g., 200) don't match what the app returns.

    Related Alibaba Cloud Guides