CRITICAL

Kubernetes RBAC: Fix ServiceAccount Authorization Failures Triggered by Resource Exhaustion (OOM/CPU)

Quick Fix Summary

TL;DR

Scale up the failing pod's resource limits and restart it.

When a pod (especially kube-apiserver or a critical service mesh sidecar) is killed due to OOM or throttled due to CPU exhaustion, it may fail to perform RBAC checks, causing cascading 'Authorization Failed' errors for ServiceAccounts across the cluster.

Diagnosis & Causes

  • Insufficient memory/CPU limits on kube-apiserver or critical system pods.
  • A spike in RBAC evaluation requests (e.g., many LIST operations) overwhelming the API server.
  • Recovery Steps

    1

    Step 1: Verify Resource Exhaustion and Identify the Failing Component

    Check for OOMKilled or CPU-throttled pods, focusing on system components. Use kubectl describe and kubectl top.

    bash
    kubectl get pods -n kube-system --field-selector=status.phase!=Running
    kubectl describe pod -n kube-system <pod-name> | grep -A 5 -B 5 "OOMKilled\|Terminated"
    kubectl top pods -n kube-system
    2

    Step 2: Immediately Scale Up the Affected Pod's Resources

    If kube-apiserver is affected, patch its deployment with higher limits for immediate relief.

    bash
    kubectl patch deployment -n kube-system kube-apiserver --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits", "value": {"cpu": "2", "memory": "4Gi"}}]'
    3

    Step 3: Check API Server and Kubelet Logs for Authorization Errors

    Examine logs to confirm the link between resource pressure and RBAC failures.

    bash
    kubectl logs -n kube-system <kube-apiserver-pod-name> --tail=100 | grep -i "authorization\|oom\|thrott"
    journalctl -u kubelet --no-pager | tail -100 | grep -i "authorization"
    4

    Step 4: Analyze RBAC Request Load with Metrics Server/API Priority & Fairness

    Check if a specific ServiceAccount or namespace is generating excessive LIST/WATCH requests.

    bash
    kubectl get --raw /metrics | grep "apiserver_flowcontrol"
    kubectl get --raw /metrics | grep "apiserver_request_total" | grep "resource=\"*\""
    5

    Step 5: Restart the Failing Pod to Clear Corrupted State

    Force a restart of the resource-exhausted pod after adjusting limits.

    bash
    kubectl delete pod -n kube-system <pod-name>
    6

    Step 6: Apply Correct Resource Limits and Requests Permanently

    Update the manifest (Deployment/DaemonSet) for the affected component with sustainable values.

    bash
    kubectl edit deployment -n kube-system kube-apiserver
    # In the editor, locate the 'resources' block and adjust. Example:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu: "3"
        memory: "6Gi"
    7

    Step 7: Review and Optimize Client Queries

    Identify clients performing inefficient LIST calls (e.g., missing label selectors) and enforce best practices.

    bash
    # Audit frequent requestors. This requires cluster auditing enabled.
    # Check for pods with high API call rates via sidecar metrics or client-side logging.

    Architect's Pro Tip

    "This often happens during a cluster-wide deployment that triggers thousands of pods to resync their informers simultaneously, overwhelming the API server's memory. Check for deployments using default, unoptimized Kubernetes client-go configurations."

    Frequently Asked Questions

    Why does resource exhaustion cause RBAC failures?

    The kube-apiserver process, which evaluates RBAC rules, becomes unresponsive or is killed. Requests from ServiceAccounts then time out or are rejected, appearing as authorization failures even if the RBAC rules are correct.

    My pod has 'OOMKilled' but my application logs show 'RBAC Authorization Failed'. Which is the real error?

    OOMKilled is the root cause. The RBAC error is a symptom. The pod's process died mid-request, leaving clients with a failed authorization response. Always treat OOMKilled as the primary issue.

    Related Kubernetes Guides