Cloud-based IoT and M2M Connectivity Gateway at Scale

project image
Learn how we helped IoT company to deliver in time and flawlessly a critical Erlang component responsible for handling RADIUS and DIAMETER traffic.

Executive Summary

Learn how we helped to extend and deploy into AWS an Erlang component that is responsible for handling incoming billing requests. The company was able to meet its deadline with confidence thanks to our expertise in Erlang, AWS, DevOps and our extensive Performance Tests. Thanks to detailed dashboards with metrics that were introduced during our cooperation, the support team has better visibility into the system and can spot and react to issues much quicker. Thanks to the best DevOps practices we introduced, the deployment process is now very quick and reliable compared to the previous approach.

About the Subject

Pod Group provides a complete suite of IoT connectivity technologies, combined with expert consultancy and support to enable IoT companies, OEMs and systems integrators to connect, manage, bill and secure their connected devices.

Challenges and Objectives

The main challenge int this project was to understand the Erlang component inherited by the team and make it production-ready in a couple of weeks. It had never been handling significant live traffic before, so we had to start by adding a proper suite of tests to make sure it behaves correctly. It was not a straightforward task, because the main functionality is concentrated around the DIAMETER and RADIUS protocols used within Mobile Networks. These protocols are not widely popular and there is scarcity of ready to use open source libraries and tools - we had to build them on our own.

The next must have before going live was an extraction part of existing functionality to the external component. We were responsible for cleaning up the codebase and integrating with the new component and make sure it capable of handling big bursts of requests.

Having all of the above ready we had to make sure that the component will be able to sustain the expected load. The challenge here was to build the performance test environment from which we could try to recreate expected traffic and validate target installation in the AWS cloud. The expected traffic pattern was also very challenging - big bursts lasting up to 1h after which traffic goes down by ~80% for a longer period. The biggest question here is how to balance capacity for the bursts vs costs.

The final step was to go live with the tested and tuned software. Knowing what is going on with the component was critical during the traffic switch. We worked on adding and selecting the most important metrics and monitoring the system in the first hours of operating. We also needed a reliable way for quickly deploying bug fixes and do the configuration reloads. Previously it was a manual task that had been very prone to human errors.

Benefits

Our onboarding process was very quick. It was very crucial taking into account the tight schedule the client had for exposing the component to the live traffic.

We know from our experience, how important is to have a reproducible process of testing and deployment while operating a production system. Thanks to CI/CD pipeline that was set up together with the help of the internal team we were able to reduce the risk of human error and we made deployments fast, easy and safe.

The performance tests performed by us allowed assessing the capacity of the target installation as well as discovering all the issues that were only visible under high load. During this phase, we finetuned the whole system including:

  • Adding circuit breakers protecting our system when some of the dependencies are malfunctioning.
  • Modifying and fixing and 3rd party libraries used in the component
  • Configuring verbosity and rotation of logs to have all information needed while protecting the system from log overflow.
  • Increased the system level limits to accommodate such traffic.
  • Selecting the best EC2 instance type and validating AWS setup.

During the test phase, we also introduced a bunch of new metrics that helped a lot with tracing issues during day to day operation of the system.

Results

The component was successfully deployed in 2 AWS Regions to achieve high availability. After initial finetuning, the deployed installation is very stable - haven’t required any action for 3 months. Introduced metrics are very helpful for the support team to monitor incoming traffic and health of the related components. Additionally, we prepared the documentation and after the handover, the internal team is able to deal with this component.

Want to hear more?

CONTACT US