Resolving 502 Bad Gateway Errors Blocking Your CI/CD Pipeline Deployments
Quick Fix Summary
TL;DRRestart the upstream service or proxy (e.g., nginx, HAProxy) and verify backend health.
A 502 Bad Gateway error indicates that a reverse proxy or load balancer (the gateway) received an invalid response from an upstream server (your application). This blocks deployments by failing health checks.
Diagnosis & Causes
Recovery Steps
Step 1: Verify Upstream Service Health
Check if your application pods/containers are running and ready. A failed liveness/readiness probe is a common culprit.
# For Kubernetes:
kubectl get pods -n <namespace> --selector=app=<your-app>
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Events"
kubectl logs <pod-name> -n <namespace> --tail=50
# For Docker/Systemd:
docker ps | grep <your-app>
docker logs <container-id> --tail=50
sudo systemctl status <your-app-service> Step 2: Test Direct Connectivity to Upstream
Bypass the proxy/gateway to confirm the application itself is reachable and responding correctly on its internal port.
# Get the application's ClusterIP and Port (Kubernetes)
kubectl get svc -n <namespace> <your-app-service>
# Curl the service from within the cluster or its node
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl -v http://<service-cluster-ip>:<port>/health
# For a known host:port (e.g., on a VM)
curl -v --connect-timeout 5 http://<backend-host>:<app-port>/health Step 3: Inspect and Restart the Gateway/Proxy
Examine proxy logs for connection errors (refused, timeout, reset). A restart can clear transient issues.
# For nginx (common ingress):
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100 | grep -i "502\|upstream"
# For HAProxy or other proxies on host:
sudo journalctl -u haproxy --since "5 minutes ago" -f
sudo tail -f /var/log/nginx/error.log
# Restart proxy (example for nginx on host):
sudo systemctl restart nginx Step 4: Check Resource Constraints and Timeouts
Insufficient CPU/Memory or proxy timeouts set too low can cause 502s during deployment spikes.
# Check for OOMKilled pods or high resource usage
kubectl describe pod <pod-name> -n <namespace> | grep -i "state\|oom\|limit"
kubectl top pods -n <namespace>
# Review proxy timeout configuration (example nginx ingress annotation):
kubectl get ingress <ingress-name> -n <namespace> -o yaml | grep -A 2 -B 2 "timeout" Step 5: Validate Network Policies and Security Groups
A new deployment might be blocked by network policies (K8s) or cloud security groups denying traffic from the proxy.
# List NetworkPolicies affecting your app namespace
kubectl get networkpolicy -n <namespace>
# Describe a specific policy
kubectl describe networkpolicy <policy-name> -n <namespace>
# For AWS, check Security Group of backend instances/ENIs:
aws ec2 describe-security-groups --group-ids <sg-id> --query 'SecurityGroups[0].IpPermissions' Step 6: Rollback to Last Known Good Deployment
If the 502 started with the latest deployment, perform an immediate rollback to restore service while you debug.
# Kubernetes rollout undo (Deployment)
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# For Helm releases:
helm rollback <release-name> <previous-revision-number> Architect's Pro Tip
"This often happens when a new deployment passes liveness probes but fails readiness probes due to slow startup (e.g., waiting for a database). The proxy routes traffic before the app is truly ready. Increase initialDelaySeconds for readiness probes."
Frequently Asked Questions
My app is healthy when I curl it directly, but the proxy returns 502. What now?
This points to a proxy configuration issue. Check: 1) Proxy's upstream definition points to the correct service/port. 2) Proxy's resolver can resolve the upstream service name (in K8s, use the internal DNS name). 3) Proxy client request timeouts are not too short for your app's response time.
The 502 error is intermittent. How do I troubleshoot?
Intermittent 502s suggest resource exhaustion or network flakiness. Monitor: 1) Backend application connection pool saturation. 2) Node/VM network bandwidth and error counters. 3) DNS resolution failures. Increase logging verbosity on the proxy and tail logs during the next occurrence.