Internally within Analog Devices, I deliver training courses on functional safety for industrial applications. One of the modules in this training is called “Functional safety for networking”. I like to play some YouTube videos during the training and in the networking module I use an excerpt from the film “Cool Hand Luke” where the warder talks about a failure to communicate (it’s the film where he has to eat 50 eggs to win a bet). Its relevance to functional safety is tangential at best but sometimes you have to use what you have rather than what you want.

Figure 1 - A scene from the film Cool Hand Luke

In this blog I am going to show how to do the very basics of calculating a Pue (probability of an undetected error) on a communications interface. I would hope to follow it up with future blogs covering more of this topic.

The main standard covering functional safety for networking is IEC 61784-3. There are other good standards such as IEC 62280/EN 50159 which contain additional data but for calculating Pue, IEC 61784-3 has the details. If coming to IEC 61784-3 for the first time it is complicated and while some of the theory is there unless you are familiar with the topic it can be hard to get started.

Networking is a topic of interest to me because we do a lot of networking ICs such as RS485 transceivers, Ethernet APL chips (10-BASE-T1L) to facilitate 1km replacement of 4/20mA and other chips such as those used for lithium ion battery monitor systems.  Most safety systems include some sort of network so you will need to know about the functional safety requirements for networks.

Let’s take the simple example of a code with four words {00,01,10,11}. Let’s further assume all four codes are equally likely to be transmitted.

Transmit {00} and depending on the level of EMI or hardware failures in the system you might receive any of the four code words. Let’s call the BER (bit error rate)/BEP (bit error probability) “p” where p=0.01 means that one in one hundred bits will be corrupted.

Therefore, the probability of a bit being corrupted is “p” and the chance of a given bit being not corrupted is (1-p).

Then the probability of a 1-bit error is p*(1-p) because if there is one bit correct then for a two-bit code word there must also be one bit corrupted. There are two bits and so there are two ways this error could occur so the probability of a 2-bit error is 2*p*(1-p).

There is only one way a 2-bit error can occur and so the probability of a two-bit error is p2.

So, the total error probability when transmitting code {00} is p2+2*p*(1-p) i.e., the sum of the probabilities of the probability of a one bit or a two-bit error.

The other three codes words give the same answer and since all four of them are equally likely to be transmitted the probability for the code itself is Pue(p)= p2+2*p*(1-p).

Note – you could multiply by four to allow for the four code words but then you divide again by four as the average probability being transmitted.

The equation above is plotted vs as p is varied from 0 to 0.5 in the graph below in the orange trace. You will notice that the orange trace has a maximum value of Pue at p=0.5 and falls as p falls. This is referred to as “good and proper”. Good meaning that the Pue is a maximum p=0.5 and proper meaning it is monotonic. A code can be good but not proper if the Pue initially falls only to rise again before continuing to fall but all proper codes are good.  

Figure 2 - Analysis of two codes as BER/BEP is varied from 0 to 05

Why stop at p=0.5. At a p of 0.5 it doesn’t matter what code you wanted to transmit you will receive all possible codes with equal probability. The first bit you transmit is equally likely to stay at its original value regardless of whether it is a 0 or a 1 or to flip to the other value and it is the same for all other bits. In effect you receive random rubbish regardless of what you try to transmit. At very low values of p the bits are less likely to flip and what you receive is biased by the code work you are transmitting i.e., if you try to transmit all zeros and p=1e-6 you will receive all zeros most of the time and if you transmit all ones you will receive all ones most of the time. A code with r check bits will typically give a Pue of 1/2r when p=0.5. If you use a CRC to detect the bit errors and if it is good and proper, then that is the maximum value for any value of p. Proving the properness of a CRC polynomial is definitely in the advanced future blog.

Let’s analyse another code which is also plotted above (blue curve) and as you will see while its Pue is better than for the first code it is not proper as it has a higher value of Pue for p=0.4 than for p=0.5. There the curve is not monotonic.

The new code is simply the old code with a 0 appended and so there are four valid code words {000,010,100,110}.

If we want to transmit 000 there are 3 ways you can get a single bit error (bit 1 wrong, bit 2 wrong or bit 3 wrong) so that contributes 3*p*(1-p)2 to the Pue formula. But wait a minute code 001 is not valid and can be detected so it actually should be 2*p*(1-p)2. There are three ways to get a 2-bit error (bits 1,2 corrupted, bits 1,3 corrupted or bits 2,3 corrupted) but 2 of these would have the last bit as a “1” and so are detectable and so the formula is p2*(1-p) and finally there is only one way to get a 3-bit error (bits 1,2,3 all corrupted) and that is detectable as the last bit is a “1” and so makes no contribution. Therefore, the Pue =2p*(1-p)2+p2*(1-p).

Finally let’s look at an example from the excellent paper “On the probability of undetected error for linear block codes”. The paper gives an example of a code that violates the 2-r rule but doesn’t give the math behind it so as an exercise I figured it out and now you too can benefit from my unrelenting efforts on you, the readers behalf.

Suppose we want to transmit one of eight data words {000,001,010,011,100,101,110, 111} and the encoding is such that we repeat the data nineteen times so that all zero is transmitted as sixty zeros, 001 is transmitted as 001001001……001 with forty zeros and twenty ones. Anyway, lets analyze this step by step.

  • Three data bits and fifty-seven check bits
  • Up to 260 (1.152e18) possible received codes only eight of which are valid
  • What is transmitted
    • 000 -> 000 repeat by twenty
      • sixty possible 1-bit errors all of which are detected (say bit 1 is corrupted then it’s a valid word but all 57 check bits are still all zero=> detectable
      • one thousand five hundred possible 2-bit errors all of which are detected
      • 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19-bit errors are all detectable since even if first bit in nineteen of the twenty sets of three bits are changed the twentieth one still has the good code value
      • three ways a 20-bit error can be undetected
      • three ways a 40-bit error can be undetected (bits 1,2 or bits 1,3 or bits 2,3 corrupted in all the three-bit sets)
      • One way a 60-bit error can be undetected
    • => eight possible codes all equally likely to be transmitted =>
    • Pue = 3*(1-p)40p20+3*(1-p)20p40+p60
    • Pue (1/3) =7.78e-17
    • 1/257 = 6.9e-18

Given the above you can’t use the 1/2-r approximation as Pue (1/3) exceeds that value. It shouldn’t take much imagination to convert the above analysis into something more sensible such as reading 16-bit data from an ADC three times and comparing the results.

For the above cases it has been easy to say whether a failure is detected or not. If using a CRC, it can be harder to establish how many errors are detected and you need to get down and dirty with detailed calculations.

In this presentation I have shown how to convert a bit error probability into an undetected error probability for a message. For those that stuck with this blog to the end I hope you got something out of it. For those who didn’t get to the end it means you aren’t reading this note but I nonetheless hope you will return to read my future blogs where I may expand on other aspects of the calculation of the probability of an undetected error. This could include Hamming distances, Hamming weights, how to calculate a CRC, etc., etc., etc.

As an aside IEC 61784-3 suggests you should assume a BER/BEP of 0.01. IEC 62280/EN 50159 goes further and should assume you will just receive random data (BER/BEP=0.5). Both standards cover the black channel (no assumptions made regarding the equipment used and rely on a safety communications layer at both ends of the channel to implement safety). For a white channel (everything designed to IEC 61508) there is no guidance at all on the BER/BEP to use. You could argue that if you design a system to IEC 61508 (or indeed ISO 26262) that lack of EMI is a systematic failure mode and that EMI failures should be eliminated through design measures so that the BER/BEP is at a very low level.

As a further aside there have been instances of people encoding a hidden data channel by inserting false bit flips – see here for hacking which dates back to France 1834.

For more reading:

  • A previous blog on functional safety for networking – see here