Debugging Intermittent 504s: RBAC RoleBinding Mismatch Causing ServiceAccount AuthZ Timeouts
Quick Fix Summary
TL;DRCheck and fix RoleBinding namespace references to match the ServiceAccount's namespace.
Intermittent 504s occur when a ServiceAccount's token is used for API requests, but the associated RoleBinding references a Role/ClusterRole in a different namespace, causing the Kubernetes API server to time out during authorization checks.
Diagnosis & Causes
Recovery Steps
Step 1: Verify the Mismatch
Identify the problematic ServiceAccount and its associated RoleBindings. Look for bindings that reference ClusterRoles without specifying the correct namespace for the subject.
# Get all RoleBindings and examine their subjects and roleRef
kubectl get rolebindings,clusterrolebindings -A -o yaml | grep -A 5 -B 5 "<your-serviceaccount-name>"
# Check a specific ServiceAccount's tokens and bound roles
kubectl describe serviceaccount <sa-name> -n <namespace> Step 2: Inspect Specific RoleBinding Configuration
Examine the YAML of RoleBindings in the application's namespace. The critical issue is a RoleBinding that binds a ClusterRole to a namespaced ServiceAccount but has an incorrect or missing `namespace` field in the `roleRef`.
kubectl get rolebinding <binding-name> -n <app-namespace> -o yaml Step 3: Correct the RoleBinding
Update the RoleBinding to properly reference the ClusterRole. For a namespaced RoleBinding, the `roleRef` should point to the ClusterRole by name, and the binding itself provides the namespace context for the subject.
# Correct the RoleBinding. Ensure `roleRef` is a ClusterRole and `subjects` include the namespace.
kubectl edit rolebinding <binding-name> -n <app-namespace>
# Example correct snippet within the YAML:
# roleRef:
# apiGroup: rbac.authorization.k8s.io
# kind: ClusterRole
# name: my-cluster-role
# subjects:
# - kind: ServiceAccount
# name: my-service-account
# namespace: <app-namespace> Step 4: Check for Overly Broad ClusterRoleBindings
A ClusterRoleBinding granting permissions cluster-wide can cause unexpected behavior but is not the direct cause of a timeout. Verify if a more restrictive, namespaced RoleBinding is needed instead.
kubectl get clusterrolebinding -o yaml | grep -A 10 -B 5 "<your-serviceaccount-name>" Step 5: Review API Server and Kube-Apiserver Logs
Search for authorization timeout or denial messages related to the ServiceAccount. This confirms the AuthZ path is the bottleneck.
# On the control plane node(s)
sudo journalctl -u kube-apiserver --since "5 minutes ago" | grep -i "timeout\|forbidden\|<serviceaccount-uuid>"
# Or from the pod logs if using a pod-based API server
kubectl logs -n kube-system kube-apiserver-<node-name> --since=5m | grep -i "authorization" Step 6: Validate the Fix
Impersonate the ServiceAccount and attempt a forbidden API call to verify permissions are now correctly granted without delay.
kubectl auth can-i get pods --as=system:serviceaccount:<namespace>:<sa-name> -n <namespace>
# Simulate an actual call with impersonation and timeout flags
kubectl get pods -n <namespace> --as=system:serviceaccount:<namespace>:<sa-name> --request-timeout=5s Architect's Pro Tip
"This often happens during Helm chart deployments where the `.Release.Namespace` variable is misused in the RoleBinding's `roleRef` or `subjects` block, or when copying RoleBinding manifests between environments without updating namespace references."
Frequently Asked Questions
Why are the 504s intermittent and not constant?
The Kubernetes API server's authorization webhook cache. A denied request may be cached briefly. Subsequent requests hit the cache (fast fail), but when the cache expires, the request triggers a full, slow authorization check against the misconfigured RBAC rule, causing a timeout.
What's the difference between a RoleBinding and a ClusterRoleBinding in this context?
A RoleBinding grants permissions within a specific namespace. A ClusterRoleBinding grants permissions cluster-wide. The bug occurs when a RoleBinding tries to reference a ClusterRole but the binding's inherent namespace context conflicts with the ServiceAccount's intended scope, causing the API server to search incorrectly during authorization.