ERROR

AWS Systems Manager: Fix SSM Agent Heartbeat Failures Resulting in Intermittent Timeouts

Quick Fix Summary

TL;DR

Restart the SSM Agent service on the affected instance.

The SSM Agent fails to send regular heartbeat signals to the Systems Manager service, causing the instance to appear unreachable and leading to command timeouts.

Diagnosis & Causes

  • Network connectivity or proxy issues blocking outbound HTTPS (port 443) traffic to SSM endpoints.
  • Resource constraints (high CPU/memory) on the instance starving the agent process.
  • Corrupted agent state or outdated agent version with known bugs.
  • Recovery Steps

    1

    Step 1: Verify Agent Status and Connectivity

    Check if the agent is running and can reach the SSM service endpoints from the instance.

    bash
    sudo systemctl status amazon-ssm-agent
    sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version
    curl -s -o /dev/null -w "%{http_code}" https://ssm.us-east-1.amazonaws.com/ || timeout 3 nc -zv ssm.us-east-1.amazonaws.com 443
    2

    Step 2: Restart the SSM Agent Service

    Gracefully restart the agent to clear any transient state issues.

    bash
    sudo systemctl restart amazon-ssm-agent
    sleep 10&&sudo systemctl status amazon-ssm-agent
    3

    Step 3: Check Instance Resource Utilization

    Identify if CPU, memory, or disk I/O pressure is causing agent timeouts.

    bash
    top -b -n 1 | head -20
    free -h
    df -h / /var/lib/amazon/ssm
    4

    Step 4: Inspect Agent Logs for Errors

    Examine the agent logs for authentication, network, or heartbeat-specific errors.

    bash
    sudo tail -100 /var/log/amazon/ssm/amazon-ssm-agent.log
    sudo grep -i "heartbeat\|error\|failed to update" /var/log/amazon/ssm/amazon-ssm-agent.log | tail -50
    5

    Step 5: Verify IAM Instance Profile and Permissions

    Confirm the instance's IAM role has the necessary SSM permissions.

    bash
    aws sts get-caller-identity --region us-east-1
    aws iam get-instance-profile --instance-profile-name YourInstanceProfileName --query "InstanceProfile.Roles[0].RoleName" --output text
    6

    Step 6: Update the SSM Agent to the Latest Version

    Install the latest agent version to resolve known bugs.

    bash
    sudo yum update -y amazon-ssm-agent   # For RHEL/Amazon Linux
    sudo snap refresh amazon-ssm-agent --classic   # For Ubuntu Snap
    sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version
    7

    Step 7: Re-register the Managed Instance (Last Resort)

    As a final step, de-register and re-register the instance with Systems Manager. WARNING: This clears all agent state.

    bash
    sudo systemctl stop amazon-ssm-agent
    sudo rm -rf /var/lib/amazon/ssm/registration
    sudo systemctl start amazon-ssm-agent

    Architect's Pro Tip

    "Intermittent timeouts are often caused by network ACLs or security groups that allow outbound HTTPS but then silently drop packets due to rate-limiting or stateful inspection timeouts. Test with a sustained curl loop (`for i in {1..30}; do curl -s -o /dev/null -w "%{time_total}\n" https://ssm.region.amazonaws.com/ && sleep 1; done`) to catch periodic packet loss."

    Frequently Asked Questions

    The agent restarts but fails again after a few minutes. What next?

    This strongly points to a resource exhaustion issue. Check for memory leaks using `ps aux | grep ssm-agent` over time and monitor `/var/log/messages` for OOM killer events. Also, verify the agent isn't stuck processing a very large State Manager association or Inventory collection.

    How can I prevent this in my Auto Scaling groups?

    Use a latest-generation SSM AMI (e.g., Amazon Linux 2023) which has the newest agent pre-installed. In your launch template, add a User Data script to run `yum update -y amazon-ssm-agent` on first boot. Also, ensure your instance IAM role uses the managed policy `AmazonSSMManagedInstanceCore`.

    Related AWS Guides