ERROR

Fix Alibaba Cloud ACK NodePool ErrImagePull After K8s Version Upgrade

Quick Fix Summary

TL;DR

Check and correct the image pull secret for the upgraded node pool.

After a Kubernetes version upgrade, nodes in a new node pool may fail to pull container images due to missing or incorrect authentication credentials for the container registry.

Diagnosis & Causes

  • Missing or outdated imagePullSecrets in the default service account of the new node pool.
  • Node pool using an outdated or incorrect Container Registry endpoint or credential.
  • Recovery Steps

    1

    Step 1: Verify the ErrImagePull Error

    Identify the specific pod and node experiencing the image pull failure to confirm the issue is related to authentication.

    bash
    kubectl get pods -A -o wide | grep -i errimagepull
    kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
    2

    Step 2: Check Default Service Account Secrets

    Inspect the default service account in the problematic namespace. New node pools often lack the necessary imagePullSecrets that were present in the old cluster.

    bash
    kubectl describe serviceaccount default -n <namespace>
    kubectl get secrets -n <namespace> | grep -i acr
    3

    Step 3: Patch the Default Service Account

    Add the required Alibaba Cloud Container Registry (ACR) image pull secret to the default service account. Replace `<your-acr-secret-name>` with the actual secret name (e.g., `acr-credential`).

    bash
    kubectl patch serviceaccount default -n <namespace> -p '{"imagePullSecrets": [{"name": "<your-acr-secret-name>"}]}'
    4

    Step 4: Restart Affected Pods

    Delete the pods stuck in ErrImagePull state to force them to re-create with the corrected service account credentials.

    bash
    kubectl delete pod <pod-name> -n <namespace>
    5

    Step 5: Verify and Prevent Recurrence

    Ensure the image pull secret is correctly configured in the node pool's scaling group template or as a cluster-wide secret to prevent future upgrades from breaking.

    bash
    # Check if secret exists cluster-wide
    kubectl get secret <your-acr-secret-name> --namespace=kube-system
    # Review ACK node pool configuration in Alibaba Cloud Console for ImageSecret.

    Architect's Pro Tip

    "This often happens when the node pool upgrade creates new ECS instances. The automated setup may not copy the `imagePullSecret` from the kube-system namespace to the new node's default service account in user namespaces. Always verify the default service account post-upgrade."

    Frequently Asked Questions

    The secret exists in kube-system, but pods still can't pull images. Why?

    Secrets are namespace-scoped. A secret in `kube-system` is not accessible to pods in other namespaces (e.g., `default`). You must either create the secret in each namespace or configure the node pool to inject it automatically.

    Can I fix this without restarting all my pods?

    For new pods, the fix is automatic after patching the service account. Existing pods must be restarted to pick up the new credentials. Use a rolling update or delete them individually.

    Related Alibaba Cloud Guides