Introduction and Problem definition
If you have worked with containers and AWS for sure you have heard about Amazon Elastic Container Service, called in short ECS. It is a workhorse for your AWS infrastructure, mainly if you use containers and do not use Kubernetes.
Operating this service in most cases is an enjoyable experience. However, there is one caveat which you need to be aware.
The problem surfaces when you would like to downsize your cluster, but you want to preserve certain machines from being removed. It is a useful mechanism in both cases - if you have to remove only specific instances or you want to prevent AWS from removing a machine that runs an important task.
Let’s define our problem:
Requirement 1: We want to scale down Amazon ECS cluster with zero downtime avoiding cluster destabilization.
Requirement 2: Amazon ECS cluster uses autoscaling group, and we want to control removal of the machines.
Sketching the Approach
The first thing you may think of is leveraging autoscaling based on the Amazon CloudWatch metrics. However, it’s not the same problem. The problem we want to solve is finding the right size of the cluster rather than the reaction to the actual load. Also, using autoscaling group assumes you know the needs of your application regarding CPU/memory. That may not be the case when you are in the operations team, and you have limited knowledge about the applications.
Ruling autoscaling based on the metrics, we have prepared two following approaches:
boto3 Python script which uses AWS API to scale down an autoscaling group removing a specified machine and decreasing desired machines count by 1.
Such an approach can be especially useful when you want to automate the process of adjusting the size of the cluster by identifying idle machines. Combining the script with Amazon Lambda makes this job a no-brainer and fully automates that.
Identify machines that you want to remove. Enable scale in protection for all of the other machines. Decrease an autoscaling group to the desired machines count.
This solution is tremendously helpful when you are doing it for the first time, or in the uncontrolled environment. Especially if you do not have CPU/memory needs specified, or tasks allocation policy is not defined. In other words: you want to see how ECS cluster behaves when performing such a scaling.
I have chosen an approach marked as B, and performed the following steps:
- Identify idle machines, e.g., that don’t run any tasks.
- Drain them so that AWS do not schedule any tasks on it.
- Go to Auto Scaling Group panel, select one which is corresponding to your Amazon ECS cluster and open the Instances tab.
- Change Instance Protection to Set scale in protection for the rest of machines.
- Modify Auto Scaling Group desired machines count.
- The machines you identified should be shut down.
- Remove Instance Protection from previously affected machines.
- Observe Amazon ECS instances as AWS may decide to migrate/reallocate the tasks in answer to the change.
- Now Auto Scaling Group may decide to shut down other machines and spawn new (e.g., to update agent version).
- Observe how Amazon ECS migrate tasks. It aims to achieve full utilization of the machines. Read more in AWS docs here. So you may expect that less allocated machines are going to take new tasks which by default - that is not true. You may see a couple of changes until Amazon ECS converges to a stable status.
If you are not satisfied with the reallocation or your cluster was not stable during the process these are the things you may consider:
- Define the task placement strategies as advised here.
- Define proper CPU/memory needs for your tasks to ease scheduling inside Amazon ECS cluster.
In the end, I have to stress out that the metrics are the key element to do it right. Otherwise, you are blind while doing such changes. Of course, you have Amazon CloudWatch metrics, but at the same time - they may be too generic for the whole process.
Learn More About Our Product Oriented Operations Service
We are highly experienced with monitoring and keeping tight operations for massive scale distributed enterprise applications. The best reference for our expertise is that most of the time we know and fix the problems before the client even notice.I want to learn more about it