Observability-first Approach

Tracing as a tool is immensely productive. It cannot be overestimated - especially in production, but also in development when debugging our own mistakes. In the previous article I covered why we should use that facility and why they invented it in the first place.

After first sight and couple of rounds, you fall in love with that tool. Right after you gain some experience, more questions start to arise. Curiosity is the primary driver of evolution in our industry. Moreover, that happened to us too. During one of the workshops that we prepared (it was in 2017 either on ElixirConf EU or ElixirConf US), someone rightfully asked that question: Can I trace everything?.

An immediate answer is yes, but it turned out that it is not so simple.

What we can and what we cannot trace?

From the perspective of experienced Erlanger answer is simple: you can trace almost everything, and we will return to that statement in a while. First, I would like to step back and cover a different point of view.

As that question appeared on ElixirConf I immediately answered yes, but …, because I have prepared for a different variant of that question. One of the reasons why we love Elixir so much (does not matter if you are an Erlang developer or Elixir aficionado) are macros. Hygienic, safe, based on real abstract syntax trees and transformation of those. Also, they bring a lot to the table, but there is no such thing as free lunch in IT.

Elixir macros

Macros are disappearing on the compilation step. It is quite apparent when you think about it, but a side-effect and consequence of that is that we cannot hook on such trace pattern. We describe that term in more details later in the article. For now, you can think about it as a place where we can hook to trace the execution.

That’s sad especially taking into account that many libraries/frameworks are heavily relying on macros in Elixir world. In that case, macros expansion is your best friend.

If you are interested, I encourage you to look into the documentation of Marco module, especially into function expand_once/2:

iex> h Macro.expand/2
iex> h Macro.expand_once/2

Compiler, oh Compiler!

Unfortunately, we did not finish with Elixir, as there is one more catch there: you cannot trace functions that are rewritten by the compiler or inlined. In many cases, it affects functions that are very lightweight wrappers around other elements of Elixir or Erlang standard library. Why did I mention two ways here?

There are two types of inlining mechanisms available: automatic rewrite by the compiler and proposal via annotation. Inlining with use of annotation is simple. You can picture it almost like 1:1 text-based replacements when we are replacing a particular function call with its body. There is no way to reorder, neither to rewrite arguments.

Another example is called an automatic rewrite. The complete implementation is inside a separate module called elixir_rewrite. It is the more advanced tool, than one mentioned above, and it can do much more, e.g., rearrange arguments. A canonical example of inlining is String.to_integer/1:

You cannot trace that particular function because compiler inlines it to the :erlang.binary_to_integer/1.

When it comes to inlining via annotation, you can find the full list of affected functions with this simple Github query.

So, Erlang is perfect, isn’t it?

Are we there yet? As tracing is built-in on some fundamental level in Erlang VM we should be able to trace everything there. Not so fast, it is not so apparent here as well.

Before we discuss what we cannot trace, we have to look into more details of trace patterns and tracing mechanism.

The one side note: we would not be ourselves if we did not mention that some tools based on tracing do not treat Erlang records as first-class citizens. Instead, you have use tuples with multiple elements. So redbug you are a great tool, but we are looking at you right now.

Who watches the watchman?

Let’s try to reverse engineer the process. To answer that question, let’s analyze what exactly happens when we enable a trace session?

First, a new process called tracer process is spawned. Such process receives messages to its mailbox based on the filter that we set up. Those events are processed, and can be collected, aggregated, printed out - any processing you can perform with use of the Erlang shell and programming language.

A filter mentioned above, created and passed down to the tracer process is called a trace pattern. It is a domain specific language that leverage Erlang terms (tuples, lists, and atoms) and describes what kind of events are we are interested in. In other words, based on that language we are informing virtual machine exactly which events send to a tracer process mailbox.

In that description, we already have two clues what may not be traceable. First one is that we are using a BIF for spawn a tracer process, so maybe we cannot trace Erlang built-in functions. The second one touches trace pattern language. Is it possible to trace such low-level elements of the language, which we use for constructing a filter for our incoming events?

The answer is not as obvious, as you may think.

BEAM me up, Scotty!

We can check both hypothesis experimentally, when trying to setup a trace pattern using only one function:

erl> erlang:trace_pattern({erlang, is_binary, 1}, true).
erl> erlang:trace_pattern({erlang, hd, 1}, true).
erl> erlang:trace_pattern({erlang, apply, 3}, true).
erl> erlang:trace_pattern({lists, reverse, 1}, true).

A zero means that we could not find any matching function for that particular trace pattern (but, clearly such function exists). You can see that you can trace some BIFs (last two are confirming that we found a 1 matching function to a given trace pattern), but not all of those. If you try that with any NIFs you get positive feedback too - so our first clue is wrong: we definitely can trace NIFs, and at least we can trace some of BIFs too.

Let’s investigate it even more. Knowing that trick, we can iterate over available BIFs and distill a list of those where we cannot setup a trace pattern. That list looks suspiciously similar to the elements supported in guard clauses and functions supported inside the trace pattern language:

`'*'/2, '+'/1, '+'/2, '-'/1, '-'/2, '/'/2, '/='/2, '<'/2, '=/='/2, '=:='/2, '=<'/2, '=='/2, '>'/2, '>='/2, abs/1, 'and'/2, 'band'/2, binary_part/2, binary_part/3, bit_size/1, 'bnot'/1, 'bor'/2, 'bsl'/2, 'bsr'/2, 'bxor'/2, byte_size/1, 'div'/2, element/2, float/1, hd/1, is_atom/1, is_binary/1, is_bitstring/1, is_boolean/1, is_float/1, is_function/1, is_function/2, is_integer/1, is_list/1, is_number/1, is_pid/1, is_port/1, is_record/2, is_record/3, is_reference/1, is_tuple/1, length/1, node/0, node/1, 'not'/1, 'or'/2, 'rem'/2, round/1, self/0, size/1, tl/1, trunc/1, tuple_size/1, 'xor'/2`

Almost, because trace patterns do not support binary_part/2 and binary_part/3 . They are supported in guard clauses though, as you can match binaries there. I was so close to the final answer that I could not surrender here. I had to dive into the BEAM source code.

It turned out that truth is even more complicated.

After finding those three places inside Erlang VM source code, it turned out that Erlang compiler inlines those functions we discovered above. By inlines, I mean that there is no context defined for them. As a consequence of that, there is no way to trace them. In order to obtain events from a function call/event Tracer BIF has to fetch context information left by the virtual machine. For those functions, there is no such context provided. For any other functions, including other BIFs, virtual machine provides it.

Such context is a specialized construct on the BEAM level, formally called a trace wrapper. That element provides context for the tracing BIF. As stated above, functions supported by a trace patterns language are not generating those, ergo: we cannot hook there with the tracing subsystem. It means that any other BIF is generating a separate trace wrapper, which means that we can trace it in the end.

Why? We can speculate on that, but one of the comments left in the source code gives a viable clue. If you try to trace a function which we used in the trace pattern, you introduce infinite recursive calls and eventually crash the node. If we would like to filter out such calls, it increases the complexity of tracer and possibly, introduce a performance penalty.

Knowing exactly how it works internally, we can extend our question: what else does not create a trace wrapper? It turns out that not so many things are left. The additional element (which again, is evident if you think about it) is an anonymous function. You cannot hook into those, so that is a reason why more experienced Erlangers are convincing you to structure the code differently. Use anonymous functions with caution. Otherwise, trouble appears when you miss a hook in that particular place.


Does any of those elements detract value from tracing? Not at all. In our opinion, those are quirks, something we use very rarely. In most cases, it depends on your code’s structure and organization, which can improve over time. After the small change, it enables limitless possibilities of tracing even in that place.

Knowledge of how it works internally is not required to use tracing either. That is a lot of intrinsic and information to process. I must admit, that I spent more than an afternoon digesting and distilling materials for that article. However, in the end, we are pleased with the outcome. Someone finally did that work, and thanks to that maybe someone else will not have to spend an evening diving into BEAM source code to find out the truth.

Veteran Elixir/Erlang Team Available

Are looking for Elixir or Erlang experts? You are in the right place! We truly love working with that technology, and as a side effect, it turned out that we have mastered it.

Schedule a call with our expert