Business man in modern office

Residual Error Rates Due to Corruption for Industrial Communication Networks

Most industrial safety functions require communications. Functional safety standards generally agree that 1% of the allowed dangerous undetected failure rate should be allocated to the communications system. For a SIL 3 safety function this would be 1% of 1e-7/h = 1e-9/h. A common problem on communications busses is corruption whereby some of the bits in a message get flipped while in transit. A common means to detect this is a CRC appended to the end of the data. However the CRC will not detect all possible errors. Some multibit errors will escape the CRC and these escapes contribute to a residual error rate. It’s residual because it exists despite the added diagnostics. This blog explores the equation commonly used to calculate the residual error rate for such comms.

Figure 1 - a Safety Protocol Data Unit

A frame, block, code word or SPDU (Safety protocol data unit - to use the correct terminology) is made up of a data portion and a CRC. There may be other diagnostic information but for the purposes of today blog let’s leave it at that. There are then a number of possibilities whereby any of the bits in the SPDU can get corrupted

     1) The data is uncorrupted, and the CRC bits are uncorrupted – excellent this is exactly what we want, good data and a good CRC receive.

     2) The data is uncorrupted, but the CRC is corrupted – this should lead to a spurious trip of the safety system but is otherwise safe

     3) The data is corrupted, and the CRC is uncorrupted – if the number of data bits corrupted is below the Hamming distance of the chosen CRC, then the data corruption will be detected. If the number of bits corrupted is greater than the hamming distance then the corruption may not be detected and contributes to the residual error rate

     4) The data is corrupted, and the CRC is corrupted – if the total number of corrupted bits in the frame is below the Hamming distance of the chosen CRC, then the corrupted data should still be flagged but otherwise may not be detected and contributes to a residual error rate

An explanation of Hamming distance is required. If the Hamming distance is 6 then all 1 bit, 2 bit, 3 bit, 4 bit and 5 bit errors will be detected. As an example, many (perhaps all) of Analog Devices ADBMS68 series of lithium ion battery monitor chips have a CRC with a Hamming distance of 6 e.g. ADBMS6815.

Note – you can have a hamming distance without a CRC. Suppose you only have two code words 000 and 111. These have a Hamming distance of 3 as all 1 bit and 2 bit errors get detected.

To quantify 3) and 4) above the following equation is used

Figure 2 - Basic equation for calculation of residual error rates due to corruption

Looking at this equation in detail

  • The summation is from the Hamming distance to the total number of bits (number of data bits + number of CRC bits). So suppose the hamming distance is 6 then all 1 bit, 2 bit, 3 bit, 4 bit and 5 bit errors are detected and so don’t contribute to the residual error rate, but higher number of corrupted bits will not be detected.
  • The next thing in the equation is a combinatorial. How many ways can you choose k bits from n or in how many ways can 6 bits be corrupted out of 32. The assumption being that each of those combinations is equally likely. It is often written c(n,k) and can be calculated as n!/(k!*(n-k)!) where “!” is the factorial operator. If d=3 and total number of bits are 6 there and 6 combinatorial 3 ways that can happen which is 6!/(3!*3!) or 20 ways. Some of these ways are the bits 1,2,3 corrupted, bits 1,2,4 corrupted, bits 1,2,5 corrupted, bits 1,2,6 corrupted……
  • If Pe is the probability of a single bit getting corrupted, then (1-Pe) is the probability of a single bit being not corrupted. Common values of Pe for safety are 0.5 (random data) or 0.01 (1 in a hundred bits getting corrupted). Then Pe to the power of k is the probability of k bits getting corrupted. But if there are k bits corrupted then there are n-k bits not corrupted and so we need to multiply Pe^k with (1-Pe)^(n-k) to get the probability of k bits corrupted and n-k bits not corrupted

For Pe=0.01, 3 bits corrupted and 6 total bits we have 20 ways this can happen we need to start our summation with 20*0.01^3*(0.01^3). We then move onto 4 bits corrupted…..

However, this equation is conservative and pessimistic. It assumes that no errors are detected if more than d bits are corrupted. However, in general the CRC is still likely to detect most of the errors. For instance, if you have a 256-bit CRC there would appear to be only a 1 in 256 chance that if the data was corrupted that it would still match the data. In fact this is true if the CRC is proper (see links below to find out what this means).

If you can’t prove your CRC is proper, then you should use the equation above. If you can prove it is proper, then you multiply everything by 1/2^r where r is the number of CRC bits.

I hope you have enjoyed this blog. The reason I am currently interested in functional safety of communications is that Ethernet APL is looking like it will replace 4/20mA and bring with it concerns such as the above. ADI also do IO-link ICs and there is a new IO-link safety standard a new draft OPC UA safety standard that needs to be read and understood.

For anybody reading this blog who thinks “wow – ADI really knows there stuff” and feels a sudden urge to buy ADI products you can find information on our industrial Ethernet PHY here.

To learn more on safety rated communications – see.

      my blog here explores the topic of black channel safety communications

      In this blog, I give some thoughts on white channel safety communications.

      In this blog I explore good and proper codes and how the undetected failure rate can actually get worse as the bit error probability improves

Check back next month on the second Tuesday of the month for the next blog in this series. Until then I hope to post “mini blogs” on the other Tuesdays in the month directly from my LinkedIn account. Please follow me on LinkedIn if interested.

For previous blogs in this series see here.

For the full suite of ADI blogs on the EngineerZone platform see here

For the full range of ADI products see here.