How would you debug a highly concurrent system?

By Wojciech Gawroński | May 28, 2018

How would you debug a highly concurrent system?

Debugging Concurrency

Let me tell you a story. We have deployed a new version of software on the whole fleet. Hundreds of machines in 4 different regions around the world. We observed our metrics, inspected our logs - nothing there. Success, another deployment without any issues.

Fast forward two days, and we received a customer support ticket to investigate. One of our integrations complained that they are observing a significant amount of errors when calling our endpoints, in two distinct regions. We jumped onto and began an investigation. Metrics for that particular partner were increased, but mostly okay - everything inside our thresholds. After back and forth over the comments inside a request, we discovered that there are some crashes, not frequent enough to hit our monitoring for that particular moment. We faced the situation that we need to inspect a live node to figure out what exactly is happening there.

What can you do in such situation? Probably you will put a trace or debug log level, and try to distill the truth from a pile of logs. You can also compile a debug version of your binary (extended with additional logging facilities), put into one host and either investigate metrics, dive into logs again or even use gdb on a live process if you are brave and experienced enough to do it.

That is not the only thing we could do. In our case, we could leverage the run-time. We have enabled a tracing facility in a virtual machine used for that component. And guess what? It handled it without any effort under the typical load of few hundred transactions per second. It feels like looking under the hood of your car, during the fast ride on the highway. Except this thing is totally safe and does not require any stunt skills.

After setting up a few traces and waiting for that particular situation to reappear, we have discovered exact code path that hit our client. For those of you that are interested, here is the explanation. A funny thing is that some HTTP client libraries are crashing badly when you send a body, when HTTP status code is equal to 204, no content. I encourage you to check how your library behaves in such case, just do not try this at your production system.

What is tracing?

You may say that I just described a fairy tale. There is no run-time where you can enable any debugging facility without a performance penalty. In case of highly concurrent and distributed system, there is no way to handle such thing gracefully. Except there is - our secret sauce, in that case, was the Erlang VM, often called the BEAM. Let’s talk what sorcery it is.

It is called tracing. It is an additional debugging facility added to the run-time that allows you to capture the stream of events that are exposed by the VM. Almost any action that your run-time is doing, starting from a function call to routine tasks like garbage collection can be captured and analyzed with the power of your scripting skills and libraries/tools placed alongside with your application.

1> redbug:start("module:function->return").
{156,1}

2> module:function().
% 12:13:10 <0.42.0>({erlang,apply,4})
% module:function(some, parameters)

% 12:13:10 <0.42.0>({erlang,apply,4})
% module:function/2 -> ok
ok
3>

Imagine a separate facility that is observing a running code of your application. That component, which is entirely event-based, is sending tons of events into a sink.

It is like observing the execution with various probes and measuring tools. In most cases, you are totally ignoring the performance impact. Because there is no such risk unless you do something really wrong as an operator. Then you are at risk of overwhelming the virtual machine, but those are sporadic cases. We cannot avoid all kind of human mistakes.

Why tracing?

You may ask why in the first place design and implement a facility like that in the run-time. And that will be an excellent question.

Almost 30 years ago, when Ericsson invested their time into designing Erlang they needed a platform that they would be able to inspect live. They searched for a solution that could be debugged remotely, without introducing performance impact. Why? Because they are in the telecommunication business for decades. And you cannot cut off people from the phone, especially when it comes to such critical elements as emergency calls.

By creating their own language and the virtual machine, they achieved that goal, which was more than critical for their systems back in days. Handling concurrent phone calls on a telecommunication switch is a challenging task, and they needed an individual approach to it. Phone call exchange is a perfect implementation of actor model and Erlang VM internally leverage that concurrency paradigm. However, that is topic for another article.

Fast forward to nowadays and not only telecoms are in need for such functionality. Many systems, even that they are not as critical as emergency calls, need to operate 247 with as little downtime as possible, and you still need to investigate what happens when stuff goes haywire. And I believe, that I do not have to explain to you how complicated and concurrent our engineering reality is.

Conclusion

As the Pattern Match, we firmly believe that Erlang and Elixir offer many unique value propositions for any domain or business. Being able to observe and inspect your live application and details of VM in production is one of such things. When stuff is on fire, insights provided by such tools are priceless.

If I had to choose the only one thing that I would like to transplant from BEAM to any other run-time or platform that would be tracing. I miss the most such capabilities in any different environment.

Veteran Elixir/Erlang Team Available

Considering to use Elixir or Erlang to build your new product?
Partner with our full-stack development team to design and develop a tailored product or improve your existing solution.

Schedule a call with our expert
comments powered by Disqus