The applications we deal with very often involve some user interaction. They handle requests that mirror the domain-dependent business processes in which they operate. Often, however, there is a need to process data in isolation from interaction with the user. Colloquially, such tasks are called batch jobs. In this article, I will show you the ways to run batch jobs in the AWS cloud.
Examples of batch jobs could be:
- Data import or export in integration with an external company.
- Periodic recalculation or modification of data in the system.
- Periodically performed operations such as invoicing, sending documents, or creating reports.
Of course, there can be more types of tasks. It all depends on the requirements that we must meet. Another question is how to invoke these types of jobs. There are two types of triggers:
- Time-based - we run the task periodically, for example, every hour or once a month.
- Event-based - we run the job in response to some event, for example, the appearance of a file in the desired location.
Let’s see the example of such a process and how to implement that using possible technical solutions.
An example problem
To show you how to implement batch processing using the AWS cloud, I will use a simple example. Let’s assume that some files appear periodically in a particular S3 bucket, and we have to process them. Our exemplary batch process consists of reading file contents, saving the information in the database, publishing the message on the SNS topic, and finally transferring the processed file to another S3 bucket. I will not focus on the implementation details or the example domain’s details to analyze the possible solutions.
The first way to run a batch job is using the time-based trigger. It is helpful in the scenario where we want to run the processing periodically.
The second way to run a batch job is with an event-based trigger. It is helpful in a scenario where we want to run processing as soon as some event occurs. In this case, it is the appearance of a new file in the S3 bucket.
The first method of implementing the above-described process is basing on the AWS Lambda service. This method is the easiest, fastest, and cheapest way to implement this functionality. However, there are limitations that you need to pay attention to. First is an execution timeout of the Lambda function. The Lambda function’s hard limit is 15 minutes. If the processing we want to implement will take longer, we should think about another solution. The second limitation is the amount of memory that we can allocate. We can use a maximum of 10240 MB. This number may be too small in some scenarios.
If you want to implement a periodically launched Lambda function, it is best to use the CloudWatch Event. We can configure it in two ways, creating an appropriate rule. The first way are cron expressions, that allow us to trigger an event at the exact time. The second are rate expressions, that will enable us to trigger an event in certain time intervals. You can find out more about it here. The Lambda function should be specified as a target when creating the described rule.
To implement the scenario in which we want to invoke the task immediately after the appearance of a new object in the S3 bucket, we need to configure event notification. We can use it to call the Lambda function soon as there is a new object with the matching pattern in the configured location.
In situations where we need a task to run longer than 15 minutes or use more memory than 10240 MB, we need to use a different solution. It may be the use of the Amazon ECS service, which allows for container orchestration. The job defined in ECS uses docker images stored in Amazon ECR - the managed docker image registry on AWS.
To trigger a task defined in ECS in a schedule, we can create an appropriate CloudWatch rule (like in the Lambda case). The only change will be to select ECS Task as the target - instead of the Lambda function.
Running a task after an object appears in the S3 bucket is not as trivial as in the Lambda function. We do not have any mechanism by which we can meet this requirement directly. In this case, we need to combine the ECS-based solution with the Lambda-based solution and use the same functionality described earlier. S3 Event notification will automatically call the Lambda function when an object appears in the S3 bucket. After that, the Lambda function will run the ECS task.
If we need to run a more complex job, we plan to scale to multiple instances or run the array jobs, we should consider using AWS Batch. We need a few building blocks to run a job this way. First of all, we need to create a task definition in which (as in ECS) we will indicate which docker image in the ECR we want to use. Also, we have to introduce a definition of the runtime. To do that, we need to specify the machines’ parameters to run our tasks. Given these definitions, jobs on AWS Batch run in a fully AWS-managed manner when new messages appear in the job queue.
As in the previous examples, we can successfully integrate CloudWatch rules with AWS Batch. In this case, we should set the Batch job queue as the target.
Like in the case of ECS, there is no built-in functionality that would allow triggering a job in the AWS Batch in response to a new file appearance event in the S3 bucket. We have to use the previously described mechanism and integrate the auxiliary Lambda function with the S3 Event notification. This time in response to an event, the Lambda function will send a message to the AWS Batch job queue.
AWS Elastic Beanstalk
Another possible implementation of batch processing is the use of worker environments in the AWS Elastic Beanstalk service. The solution consists of creating a dedicated environment that will be responsible for handling batch jobs. The environment uses a dedicated queue to which requests for batch job execution arrive. It can take multiple types of jobs. We can select which one will start by sending the appropriate message to the queue. The solution is practical primarily when batch jobs are directly linked to an application deployed with Elastic Beanstalk. Using this approach, we gain the ability to manage the system components in one place. However, we must remember that environmental management is on our side, so we must ensure that it is adequately scaled (also down to zero).
This time we do not need an additional component, which was the CloudWatch Event. Elastic Beanstalk offers periodic running of tasks on the environment thanks to the Periodic task functionality. We need to define at what time and which tasks should be triggered.
Again, we are not able to provide such functionality without involving the auxiliary Lambda function. Since we have the SQS queue, can we directly send the S3 bucket event notification to it? We cannot do this. The message must have specific attributes with information about what task should be run and at what time. We can construct such a message (as in the previous examples) in an additional Lambda function that will send it to the appropriate queue.
I have presented several ways to run batch jobs in the AWS cloud. Each of them is useful in slightly different cases. From the ease of implementation and costs, a solution based only on the Lambda function is the best solution. However, as I mentioned, it is not suitable when we need more time to process data or more memory. In such a situation, it is worth considering using ECS. We should choose a solution based on AWS Batch only in more complex cases - such as the simultaneous triggering of several related tasks or groups of jobs. We can consider implementing the AWS Elastic Beanstalk worker environment when dealing with another symbiotic application that runs on the AWS Elastic Beanstalk service.
Efficient Cloud That Suits Your Pocket
We have many years of experience with migrating, designing and optimizing cloud systems. Let us prepare a solution that suits your needs.Schedule a call with our expert