It's been a while since I last wrote a more technical piece. Well, sadly, most folks I dealt with care more about that piece of certification than doing things correctly. But the engineering spirit within can only take so much of the dark side, so I guess I'll have to vent out some nerdy stuff.
Before you proceed, I assume you have some basic understanding of redundancy (be it DMR, TMR, whatever) as well as terms such as ASIL decomposition (which is just another fancy way of claiming redundancy by ISO26262).
Ok, let's dig right into this. The major EDA providers would love to sell you a solution that will have you sign over the right of your first born. Fancy schemes are claimed, but here are the main things to thinks about.
- How's the protection scheme (such as hamming on FSM or voter) protected itself? That is, if the protecting scheme itself is more vulnerable than the to-be-protected logic, you end up with net lost. How can that be possible you asked. Well, let me give you an example, for FPGA, if you TMR just the FF, which is 1 bit in configuration plus the various flop settings (CE, RST, etc), you simply shifted the problem to routing and the voter downstream. Remember voters are implemted as LUT in FPGA. So chances are you'll come out with a net lost withat that scheme. How do I know? Let's just say to fight off some genius from doing this in the past, I had to prove this with fault injection to show higher management.
- If sufficient logics including async logic are protected, everything must be fine right? Well, this assumes you know the magic number which is design dependent. The best way to validate is still through combination of verifications such as simluation, fault injection, etc. Note that I said combination because simulation is not sufficient -> not to mention low coverage if you've actually done it even with the major EDA tool's solution. Simply not practical.
- What happens if your logic for one redundant domain is stuck or gets out of sync? Can it recover? Logics such as FSM and counters are notiorious at killing the redundancy scheme. With TMR, the problem can be masked out as silent error when one domain's error is masked out with voted output. Now, the system can be more unreliable running on just two legs.
- Verification coverage and scheme. Pounding logics through random injection is not practical. Even if you narrow the logic down to only the functional safety related portion.
- What's been done on the implementation side of things. That is, place and route?
Well, so what should one do then?
- If you're a US based company, there are some tools widely deployed by the A&D folks that have been validated through not just beam tests but also actual missions with environments much worse than commercial applications. These tools are typically restricted, so if your company happen to have some A&D applications as well, ask around.
- Hand code TMR on logic level is not recommended, but one can start at higher block level which may be just sufficient for commercial applications. Protect the outputs with both major voter as well minority voter to catch persistent problems. In addition, consider board level tracing voting technique to further lower component failure.
- For memory, scrub!!!
- Watch your layout. Problems such as cross clock domain and SET can be well mitigated. Some common layout technique such as fencing and interleave are proven to be very effective.
- TMR is costly, identify the most critical components. DMR is typically sufficient for most cases, just ensure you can detect persistent problem - if the performance requirement allows it, SW is typically a most cost effective solution~
Well, I managed to keep this piece still short and sweet. And yes, the proposed solution is rather high level. Feel free to DM me to follow up on more detailed discussion.