Importance of error detection

Posted on Posted in Embedded C/C++

The use of guaranteed delivery communication protocols (TCP being the foremost of these) and the high reliability of the Internet (even when accessed through a cellular network) often lets us feel that our communication medium is perfect and provides bit-accurate data transfer.
Many modern application level protocols rely on convenient data representation formats (such as JSON, XML or YAML), without doing any data integrity validation (instead doing just business logic validation on the data). Relying on the lower level protocols (mostly the transport layer protocols, such as the aforementioned TCP) is both convenient, efficient and is the idea behind the OSI model’s layered structure.
It’s easy to forget that our network communication is protected not by a single error detection and correction mechanism, but typically by 3 or more (Ethernet frame check sequence, IP header checksum and TCP header and data checksum).
Sadly, in the embedded world the developer often doesn’t have the luxury of a full stack of protocols to rely on. The go-to communication channel in the hands of the embedded engineer, the UART, provides only a parity bit (more on that below), and most other communication protocols and mediums lack the well-structured and highly abstracted guaranties of the typical PC, server or mobile communication. The other two most common embedded communication busses – I2C and SPI, don’t provide any error detection. CANBus, on the other hand, does provide a well-defined communication stack and error detection, but its implementation is much more complex and is still limited mainly to the automotive industry, with few general-purpose microcontrollers supporting it.
The typical error detection mechanisms include (ordered from least to most error detection capability):

  1. Parity bit – an additional bit that is added to the data to make the count of 1 bits even (or odd, depending on implementation). So, for the byte 0xA5 (10100101b) the parity bit is 0, transmitted as 101001010. This is used by UART (typically in the hardware). A parity bit can detect only odd numbered bit errors (so 111111010 will be recognized as an error, while 111110010 will not). Parity bits are very simple to implement in hardware, but code likes to work in bytes and words, so it is less convenient (although still is cheaper to calculate than the methods below).
  2. Checksum – an additional field that is the product of the sum of all the data. The sum is typically done on chunks the size of the checksum field – 8 bits sum for an 8-bit field (such as XMODEM’s checksum) and 16 bits sum for a 16-bit field (such as the TCP header and data checksum). Checksum can detect errors much better than a single parity bit, but still worse than CRC (below). The main type of errors that checksum fails to handle is when all the data (including the checksum) field is received as all zeroes (the sum of which is also zero). To counter that XORing the checksum field with all 1’s (0xFF for 8-bit checksum and 0xFFFF for 16-bit checksum) is typically used, as well as “seeding” the sum with a special value. Checksum is cheap to calculate, so it still has some use despite being less effective than CRC.
  3. Cyclic Redundancy Check (CRC) – this is a mathematical function that uses polynomial division to detect data corruption (the math behind CRC must be fascinating, but alas, I lack the mathematical inclination to understand it. Luckily code to calculate CRC is widely available and is straightforward to implement). It is much more computationally intensive than checksum, and the typical way to improve performance is to use a large lookup table (thus increasing code size). CRC will detect most data corruption errors, and is well suited when performance and code size are not absolutely critical. CRC implementation range from 8 bit to 64 bit fields, with widely available implementations in most languages (including C and embedded platform’s assembly).
  4. Hash functions, cryptographic (such as SHA) or otherwise (such as Pearson hashing). These are commonly used in contexts other than communication and we will not expand on them here.

In the modern embedded world, CPU and memory aren’t as constrained as in the early days of computing and embedded devices, and even modest PIC microprocessors can easily run checksum and often CRC as well (with clock speeds measured in MHz and memory (both RAM and program memory) in Kbytes). For those with extreme limits, relying on a HW parity bit might be the only option, although 8-bit checksum takes just an additional byte in RAM and can be implemented in about 10-15 instructions.To illustrate the importance of error detection, I will present two cases that I have experienced, debugged and fixed using error detection.

In both cases the problem was with doing a firmware update (i.e. updating the code on the embedded device), as this is the most data intensive operations that the typical embedded device performs, increasing the potential for errors to crop out.The first case is the more typical communication related need for error detection – the embedded device has two microcontrollers, each running its own separate firmware. The two microcontrollers were connected to each other via UART, while only one of the microcontrollers was connected to the outside world via Ethernet. The firmware update process thus involved the microcontroller connected to Ethernet (namely, MCU_A) receiving the firmware image for itself as well as for the other microcontroller (namely, MCU_B). MCU_A sent MCU_B’s firmware image to MCU_B via UART as it was received from Ethernet. So, for every Ethernet packet that was received (over TCP) by MCU_A it forwarded it to MCU_B via UART, waited for an acknowledgment and then acknowledged the Ethernet packet to the server.

Typically, firmware update worked correctly, but once in a while it caused MCU_B to fail to start after performing the update. Downloading MCU_B’s firmware from its internal flash memory using JTAG and comparing to the expected firmware image. As expected, there was a corrupted byte in the firmware taken from MCU_B. Investigating further revealed that this happened quite often, and the location of the corruption affected the outcome of the update. The protocol that was used on top of the UART did not have any error detection (other than the UART’s parity bit) since it was based on a legacy implementation in MCU_A. After adding a 16-bit checksum on the UART messages and a full CRC32 on the entire firmware image (which ironically enough already existed for MCU_A’s firmware update, but not for MCU_B) caused corrupted message to be retransmitted by MCU_A until they were received correctly by MCU_B or the maximal retransmission count was exceeded, in which case the firmware update was aborted with an error. Luckily this error occurred in the QA lab during development. The second case was not so lucky.

The second case involved only one microcontroller – firmware update was performed in place on the flash, with a small bootloader code relocated to RAM to do the update. The regular firmware would receive a compressed image file and store it in an external flash memory. Then the bootloader code would decompress the image file and overwrite the internal flash with the new image. Once all the image was decompressed and written to the internal flash, the microcontroller is reset and starts executing the new firmware. Except for the times it failed to start after the firmware update. This happened in the field, and units returned to the development lab were analyzed. Firstly, hardware issues were suspected, but re-writing the firmware to the microcontroller showed that the hardware was correct. Reading the internal flash of the microcontroller and comparing to the expected image (as was done in the first case) again showed byte corruptions. Since the firmware update was protected with a CRC32 both on the compressed image and on the decompressed image (with the CRC being calculated on the RAM buffer used for decompression) I was initially completely stumped. Then, reading on data correction of flash memory showed that rarely flash memories can report a success in a write, but a byte might still be corrupted once read. Adding write validation to the flash write (by reading back what was written and comparing with what was expected to be written, re-writing on failure) of the decompressed image chunks proved the theory (with the assistance of several dozen units that were left to do back-to-back firmware updates over the weekend).

Moral of the two cases? Never trust the hardware, especially when errors might mean a dead unit. In addition, make sure to preserve the firmware image of a defective unit that arrived from the field before doing any changes on it (try explaining to your manager why you need to ship another unit from the field to the lab).

Leave a Reply

Your email address will not be published. Required fields are marked *