Alibaba Cloud RDS: Fix Connection Pool Exhaustion Triggering High-Latency Alerts
Quick Fix Summary
TL;DRRestart the application to clear stuck connections and temporarily relieve pool pressure.
The application's database connection pool has reached its maximum limit, causing new requests to wait or fail, which manifests as high latency and timeouts.
Diagnosis & Causes
Recovery Steps
Step 1: Verify Active Connections and Pool Exhaustion
Check the current number of active connections on the RDS instance and correlate with application metrics to confirm pool exhaustion.
# Check active connections on the RDS instance via DMS or CLI
aliyun rds DescribeDBInstancePerformance --DBInstanceId <your-instance-id> --Key 'MySQL_ActiveSessions' --StartTime $(date -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) --EndTime $(date +%Y-%m-%dT%H:%M:%SZ)
# Check CloudMonitor for 'Database Connections' metric and 'Failed to Get Connection' alerts. Step 2: Restart Application to Clear Stale Connections
Immediate recovery step to kill all existing connections and restart the application pool. Use your orchestration tool.
# For ECS instances running the app
sudo systemctl restart <your-application-service>
# For Kubernetes deployments
kubectl rollout restart deployment/<your-app-deployment> -n <namespace> Step 3: Increase RDS Maximum Connections Limit
Temporarily increase the instance's `max_connections` parameter to provide immediate headroom while you fix the root cause.
# Modify the RDS parameter group (replace with your group ID and instance ID)
aliyun rds ModifyParameter --ParameterGroupId <pg-id> --Parameters '[{"Name":"max_connections", "Value":"1000"}]'
# Apply the parameter group to the instance
aliyun rds ModifyDBInstanceParameter --DBInstanceId <your-instance-id> --Parameters '[{"Name":"max_connections", "Value":"1000"}]' --Forcerestart false Step 4: Kill Long-Running or Idle Connections
Manually terminate problematic connections consuming pool slots using the RDS DMS or SQL commands.
# Connect to the RDS instance via DMS or client, then run:
SHOW PROCESSLIST;
-- Identify idle/long-running connections and kill them
KILL <process_id>; Step 5: Optimize Application Connection Pool Settings
Review and adjust the application's connection pool configuration (e.g., HikariCP, Druid) to better match the RDS `max_connections` limit.
# Example: Check application config for pool settings (Spring Boot HikariCP)
# application.yml snippet:
spring:
datasource:
hikari:
maximum-pool-size: 20 # Adjust based on RDS max_connections and app instances
connection-timeout: 30000
idle-timeout: 600000
max-lifetime: 1800000 Step 6: Scale the RDS Instance
If the workload has permanently increased, scale the instance to a higher specification (CPU/Memory) which supports more connections.
# Upgrade the DB instance specification (causes a momentary connection drop)
aliyun rds ModifyDBInstanceSpec --DBInstanceId <your-instance-id> --DBInstanceClass '<new-spec>' --PayType Postpaid Step 7: Implement Connection Health Checks & Timeouts
Configure the connection pool to validate connections before use and evict idle connections to prevent leaks from affecting the pool.
# Example HikariCP health check properties
spring.datasource.hikari.connection-test-query=SELECT 1
spring.datasource.hikari.validation-timeout=5000 Architect's Pro Tip
"This often happens after a deployment where a code change introduced a connection leak, or when a downstream service slowdown causes transactions to hold connections open for much longer than usual. Always compare the deployment timeline with the connection growth chart in CloudMonitor."
Frequently Asked Questions
Will increasing `max_connections` on RDS impact performance?
Yes, increasing it too high without sufficient instance resources (CPU/RAM) can lead to contention and degrade performance. It's a temporary buffer, not a solution. Always scale the instance spec if you need a permanently higher limit.
How do I find the root cause of connection leaks?
Enable detailed logging for your connection pool (e.g., HikariCP leak detection). Use APM tools to trace slow database queries. Review recent code changes, especially around transaction management and connection handling in try-catch-finally blocks.