How to deal with CDATA in xmerl?

By Wojciech Gawroński | January 21, 2019

How to deal with CDATA in xmerl?

The pain of fiddling with XML via xmerl

Let’s agree that the official library - called xmerl is far from perfection, mostly because it does not contain sane defaults for DTD (like XML entities), has deficiencies when it comes to XSD validation, but from the other hand contains exciting stuff like the one documented by Brujo Benavides here.

There are much better alternatives - to name the best one currently exml, or its fork used internally by Mongoose IM (which does a lot with XML) for its protocol XMPP implementation.

However, there are some justified cases when you have to deal with it xmerl, mostly because of legacy reasons. Fortunately, the elements mentioned above are less important - you are able to live without them. Although, there is one feature which xmerl does not have, and it is critical.

You are able to parse CDATA section, but you cannot write it out. 😱

How old problem is that? The first mention that you can discover after searching for the phrase, and it will point you here - to an old thread from the official mailing list.

BEAM me up, Scotty!

Click and enter your email to get access to the useful resources and get notified whenever we publish a new blog post on our website.

Subscribe me

What is CDATA and why is it important?

Let’s check the official documentation here.

According to the W3C standard CDATA is the following:

CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string “ <![CDATA[ “ and end with the string ” ]]> “

Those sections are relevant when you need to pass any characters that should not be directly interpreted as part of the XML tree, outside of the termination token ]]>.

In many cases you can work around that problem, by encoding XML entities (e.g. < is encoded as &lt;), but not in all cases.

Problem: Auth-Info Code

What is an AuthInfo? In the domain industry it is a way of ensuring the identity of domain owner:

An Auth-Code (also called an Authorization Code, Auth-Info Code, or transfer code) is a code created by a registrar to help identify the domain name holder (also known as a registrant or registered name holder) of a domain name in a generic top-level domain (gTLD) operated under contract with ICANN.

In other words, to invoke a possibly destructive operation, a registrar will ask you to provide them authorization code, as a confirmation, e.g. in case of domain transfer.

As a base of their APIs most of the domain registrars are using a standard called EPP, which is XML based.

How the example AuthInfo code looks like? The EPP standard (RFC5731) defines the constraint of this element as being of XML type eppcom:pwAuthInfoType which is itself defined in RFC5730 which says in summary that it is an XML normalizedString.

Basically, it can be a string of any length and any characters (except three: newline, carriage return, and tab), so this one also fits:

qwerty<>&+12345.foobar

Ouch. Now you can see the problem - to send such code to the API which is XML based, and does not know that we will encode XML entities we need to send out that code inside CDATA section.

What if I need to deal with CDATA?

Are we doomed? Luckily, there is one mechanism which we can leverage available in xmerl called callbacks:

How does it work? xmerl allows us to pass our own callback implementations when serializing (using export/2 or similar). To satisfy that, we need to create our own module, which looks like this:

One important thing is the section with '#xml-inheritance#'() which allow us to use already defined implementation of xmerl and just add on top our support of CDATA.

Now, in order to write out XML with CDATA we need to invoke seralization method with our module like this:

God Dammit!

Why do I see escaped XML entities here?

One more problem

Unfortunately, by default the whole content passed to the export/2 or related functions is escaped before it will be passed to the callbacks. So as an argument of cdata/4 we receive a string with escaped XML entities. Luckily, it escapes only &, < and > so we can reliably unescape it:

One additional pass for &amp; at the beginning is necessary if someone actually passes the string with encoded XML entities to the serialization.

After applying that fix, we can finally cheer and use that library to solve our problem described above:

Summary

Phew. That’s all. Enough struggling with the xmerl. If you are forced to deal with this library like us and you want to avoid doing such acrobatics on your own, we have combined those in the helper library which we called (surprisingly!) xmerl_ext and it is available here:

Enjoy! And remember: friends do not let friends use xmerl for XML manipulation in Erlang.

Veteran Elixir/Erlang Team Available

Are looking for Elixir or Erlang experts?
You are in the right place! We truly love working with that technology, and as a side effect, it turned out that we have mastered it.

Schedule a call with our expert
comments powered by Disqus