Crash Data Processing with Serverless

In this post we’d like to share our experiences when introducing a new service for one of our clients. During the journey to production, we’ve tested a few AWS storage and compute services, finding what they’re good at and testing their limits.

The service itself was supposed to collect details of software crashes from physical devices, and process/transform it asynchronously. This would aid us in troubleshooting potential bugs, and provide more detail about the overall health and performance of the devices.

Zipped Lambda

S3 + Lambda + Datadog Diagram

Taking all requirements into account, it seemed to be a perfect fit for lambda:

  • Something that is triggered by crash upload event (async)
  • Completely unpredictable traffic (we might get 1K crashes or even 0)
  • There is no pressure to process it as quickly as possible - we can live with delays (like lambda cold starts etc.)
  • Easy processing and transformation of small JSON files, that should finish in few seconds

After implementing basic endpoints for upload request (generating S3 link for file upload), we were informed that crash payload will include stacktrace, as well as systemd journal log file. No problem. We could attach it to a summarized crash report. However, later it turned out that stacktrace preprocessing/summarization could not be handled on the device - to put it simply, it didn’t have enough resources to do it without sacrificing core functionality.

We had to offload this functionality to the cloud counterpart - our simple lambda. That means installing tools like GDB, systemd (for journal) and related SDKs.

We’ve realized that this should most likely not be handled by lambda (given we were going to process a lot of binary data, instead of having a quick running function).

Around that time, AWS Lambda added a possibility to use a custom container instead of a set of pre-defined runtimes. We’ve given it a try.

Dockerized Lambda

Thanks to AWS allowing a dockerized lambda, we could include all required tools inside the image. Even though this allowed us to parse journal file, we still could not parse stacktrace. It turns out, that besides gdb, stacktrace requires a specific version of home-brewed SDK, that the executable was compiled with.

An additional twist down the road - each SDK was about 1.5 GB in size. After download, we had to install them, which required even more storage.

So what is the problem?

Lambda offers only 512MB of storage space. Not enough for our need. However, we have something in our toolbox to address that -> EFS.

EFS

S3 + Lambda (with EFS volume) + Datadog Diagram

We came up with an idea to attach an EFS storage to lambda and let it download/install whatever it wants inside that volume. That would drop the burden of worrying about running out of disk, and allow us quick access to whichever SDK given crash file required. EFS is also persistent, so we could have cached installed SDKs between the runs. It sounded very promising and required little work.

After attaching the storage and trying to process our first crash, we ran into something peculiar. Lambda would run the installation process, but timeout after a few seconds. Of course! We need to increase the timeout! We’ve started with 3 minutes, increased to 6, then ended up with a maximum of 15 minutes. Memory limit was increased to maximum. Result was still the same: timeout.

What’s interesting is that running the docker image locally, installed the SDK in about 3 minutes (on a single core).

A quick test of moving the docker image from lambda to ECS confirmed our suspicion. EFS was way too slow. On a running ECS container, the SDK would take about an hour to install. Sic. We made a quick test, running the same code on EC2 with EBS to confirm EFS is the problem.

Last chance to stay “without servers” - Fargate

S3 + SQS + Fargate + Datadog Diagram

After confirming EFS speed would not be sufficient to use, we’ve abandoned the idea of having all SDKs in a single volume. Thankfully Fargate provides about 20GB of storage for each container (it’s actually closer to 17GB), which should be more than enough for a single SDK. When we ran our container, using internal storage, we finally got an install that took about 3 minutes - good enough for now.

The biggest lesson

Mind storage requirements! Serverless world is tricky in that regard. With Lambda, we only get 512 MB of storage. In case you need a more, an EFS volume might be an option. However, as we saw in our example, it’s works best for storing data, not installing or running applications off of it. At the end we have Fargate, which worked best in our case. Gave us enough storage, and not a lot of refactoring to run our code as a container.

ServiceStorage Limit
Lambda512MB
EFSUnlimited
Fargate20GB

What is next?

The service is good enough for now, but it consumes too many resources. We will need to optimize it soon, and here are our options:

  • Migrate to EC2. Since the service is not mission critical, we don’t have to be servereless here. Migrating to EC2 would definitely help us with storage, since we could add 100GB EBS volume, and be able to juggle a bunch of SDKs. Unfortunately EC2 instances are not easily provisioned (like ECS containers).

  • Prebaked SDKs on EFS volume, Every new SDK would be asynchronously installed on an EFS volume (even if it took 1 hour), and later used anywhere we wanted.

  • Prebaked SDK containers. Custom container for each SDK. This means we wouldn’t have to waste time on installing anything. We’d simply run an ECS processing task with a specific SDK container.

  • S3FS (https://github.com/s3fs-fuse/s3fs-fuse). Mounted S3 bucket as a disk inside the container.

Eventually we will need to tackle this problem. If you’d like to know how we’ve solved it, follow us, to get updates on the story!