Strings and binaries

Posted on Posted in Embedded C/C++

Strings are everywhere; they are all around us, even now in this very post. You can see them when you look on your screen, or when you open up a book. You can feel them when you use JSON, when you use XML, when you use YAML.

That is the lot of the modern day software engineer. And from many perspectives strings are great – they are flexible, verbose, easy to read and can encode complex types. Most modern languages have great support for working with strings, from built-in types to a host of useful functions and libraries that can process even the most complex strings. The common string based data formats (such as the aforementioned JSON, XML or YAML) are widely used, have libraries in almost any language and are easy to implement and debug.

But what about the costs? Strings cost us in execution time, in memory storage and in large and complex libraries to parse them. When running on a server or even a mobile device with ample CPU and memory capacities, using a high level language (be it Swift, C#, Java, Python, Ruby or any other) that already gives you all the libraries needed to handle the strings and the structures they encode, it is easy to get a warm fuzzy feeling when using strings as the go-to data structure. I’ve even seen production code that saves numbers as strings.

For the embedded environment, strings pose two major problems:

  1. String types and parsing libraries are typically not a built-in part of the languages commonly used in embedded code (assembly, C and C++). Although there are excellent libraries for those languages, especially C++ (from std::string to Boost), they are typically large in code size, complex and it is difficult to choose just the features one needs (as opposed of just taking the entire library and not using most of it).
  2. Strings take up a lot of memory, especially when encoding data in a more human-readable format (such as JSON). Naturally when you need to encode an actual string (say, a user’s name) you don’t have much of a choice, but almost any other data types take more memory when encoded as strings.

Which leads us to the ugly cousin of strings – binary data. Binary data is quite the opposite of strings – it’s rigid, compact and difficult to read; typically takes less memory, doesn’t require complex libraries to handle and parse and any parsing that needs doing is typically very fast.

The typical embedded environment enjoys the benefits of binary data while not suffering too much of the disadvantages (due to the nature of the embedded systems).
We will explore two common requirements for handling complex data – configuration and communication, comparing string and binary data.

For configuration, you typically have a key-value store of items. The keys are the configuration items and the values are the possible options for each item, varying from binary (true/false), through numerical, to strings. The configuration for a large system can become quite complex, with conditional items, reference to other configuration stores, etc. For the sake of the discussion, we will present a simple configuration, considering that embedded systems are typically smaller and have less configuration items (although the same reasoning can be applied to more complex configurations as well).

Let’s assume the following configuration:

gadget_a_enabled – binary true/false
gadget_a_value_1 – numerical, from 0 to 100
gadget_a_value_2 – numerical, from -1,000,000,000 to 1,000,000,000
gadget_a_value_3 – ASCII string, under 200 characters long

If we want to define this configuration in a JSON format, it would look something like this:

Omitting the whitespaces, this takes 85 characters. Using real variable names would increase this. There are many other ways to encoded this configuration, taking more or less space than the above example. In addition, having different values for the configuration will change the length of the data, even having a difference between “true” and “false”. Having more verbose variable names, while increasing readability, will have an impact on data length. Keeping whitepsaces will typically explode the size of the data (unless you are the sort of person to use tabs instead of spaces, in which case the increase will be more modest (at the cost of condemning your soul to hell)), more than doubling the data size for the above example, to 191 characters. This is typically done when configuration is stored in files. Most sane engineers will quickly parse the JSON (or other encoding) to an internal structure, dropping any whitespaces (and comments, if any).

Encoding this configuration in typical binary format will look like this (in hexdecimal notation):
0121075BCD15106D7920737472696E672076616C756500
This takes 23 bytes. The length of the data will depend on the number of characters in the string “gadget_a_value_3”, but nothing else.
But how do you work with binary data? C (and C++) structures are extremely useful here. This is a behavior that is much more difficult to reproduce in higher languages (like C#, Java, Python and the like), often involving serialization and deserialization.

A structure to easily handle the above configuration would look like this:

Reading the configuration data is simple as casting the binary data array to the structure’s pointer:

The configuration data shown above is broken up as follows:
01 21 075BCD15 10 6D7920737472696E672076616C756500
01 – the binary field takes just 1 byte, with 1 being true and 0 being false
21 – since gadget_a_value_1 can be only between 0 and 100, a single byte is enough to represent it.
075BCD15 – since gadget_a_value_2 can have large values, we need 4 bytes to encode it. Since we need to encode negative values as well as positive values we need to use a signed integer.

String encoding in binary data is typically handled in one of two ways – either the maximal number of characters is always allocated for the string (in our example, 200 characters); or the length of the string is encoded before the start of the string, with only that amount of characters following the length. Knowing the limit on the length of the string allows to choose the smallest size for the length variable (so in our case 1 unsigned byte is enough). Fixed length strings in binary data are easier to parse, but take more space and are more appropriate for short strings that typically take most or all of the length. Dynamically sized strings are more complex to parse, especially if there are many of them, but they can be more space efficient.
In the above binary example, the string is dynamically sized, with the length of 16 characters (including the C-style NULL terminator) encoded before the string itself.

Parsing the dynamic string (and more importantly, any additional variables after it) is done in steps:

  1. Break down the structure definition of the data into as many structure as you have dynamic strings. Each such structure will end with the dynamic string.
  2. Cast the data pointer to the first part of the structure, as before.
  3. To handle the string, you have the length of the string as well as the pointer to the start of the string which can be used for any string handling function (for example, strncpy).
  4. To access the next structure, add the size of the previous structure and the number of characters in the string to the current data pointer. Then you can cast the new pointer to the next structure.

Let’s use the following (pseudo-code) data definition:

The structures will be as follows:

And to read the data, simply do this:

Writing the configuration is conceptually similar:

The above code examples are pretty naive, not handling all the memory allocation and validating the string boundaries. Production ready code to handle such complex binary data will be longer. Using generic helper functions to handle the dynamically sized data (be it strings, arrays (which are handled the same way as strings, since string are arrays of char’s) or anything else) will help keep the configuration code clearer and more robust.

Still, compare these code snippets to a JSON parser written in C, and you will find that you’d rather move some pointers around to get to your binary data.

With all the praise for binary data as configuration, it is pretty limited. The typical embedded system will a handful of numerical and boolean variable stored in the EEPROM. Anything more isn’t usually needed, so the burden of actually trying to encode (and humanly-read!) complex binary data is avoided.

However, binary data can still be used even in complex configuration files, with a little bit of pre-processing. To enjoy the benefits of binary data one can have the configuration itself be stored in a binary format, and having a reader and writer helper application that can allow human-readable viewing and editing of the binary configuration file. This is sometimes done in the embedded world, with one device I worked on had a UI (originally written in MFC, and re-written in C# by myself) to show and edit the configuration of the embedded device. Another embedded environment I used had quite complex configuration options for its OS and built-in peripherals. The vendor supplied pretty thorough tools to view and edit that configuration. The internal code to handle the binary data will still need to handle all of the complexities of it, but at least you will have an easy way to read and change the configuration.

In communication, binary data is more widely used, especially considering that the core protocols used today were developed long before the prevalence of the string data formats mentioned before (TCP header in JSON anyone?). Although most of the lower level protocol (transport layer and below) use binary formats, application level protocol using string data are much more common (such as FTP, HTTP and RESTful services). Still, when you need to implement communication from scratch (be it over UART, Ethernet or other communication medium) a lot of the code is for transport and lower protocols, with the application level protocols depend greatly on the need of the device – the TCP, IP and Ethernet layers would be the same whether you need a simple proprietary binary protocol or a complex string based protocl such as XMPP, SNMP or LDAP.

For communication, the binary data handling shines thanks to the strict binary structure of the protocol headers – even if the payload itself is not binary data, parsing the communication headers is simple, even when using dynamic options (which aren’t actually used all that much). The struct‘s of C and C++ allow direct casting of the incoming binary data to usable structures.

For example, the basic Ethernet headers up to TCP will look like so:

Also note the ease of parsing the bit flags using the bit-field notation.

Using these structures, a protocol stack needs only cast the data pointer to the appropriate structure, and then advance the data pointer by that structure’s size before passing to the next layer. For example:

Binary data still has a place in the modern world. Sometimes they will be represented in a string format (such as Base64) for easier handling in binary-unfriendly systems, but their advantages of conciseness and performance are often tempting enough to use them anyway. And in the world of C and C++ code (be it in embedded systems or in performance oriented implementations) these advantages can be crucial to the success of the product.

Leave a Reply

Your email address will not be published. Required fields are marked *