Do Redundancy Correctly and Practically

Chen Wei Tseng

Published Jan 30, 2026

It's been a while since I last wrote a more technical piece. Well, sadly, most folks I dealt with care more about that piece of certification than doing things correctly. But the engineering spirit within can only take so much of the dark side, so I guess I'll have to vent out some nerdy stuff.

Before you proceed, I assume you have some basic understanding of redundancy (be it DMR, TMR, whatever) as well as terms such as ASIL decomposition (which is just another fancy way of claiming redundancy by ISO26262).

Ok, let's dig right into this. The major EDA providers would love to sell you a solution that will have you sign over the right of your first born. Fancy schemes are claimed, but here are the main things to thinks about.

How's the protection scheme (such as hamming on FSM or voter) protected itself? That is, if the protecting scheme itself is more vulnerable than the to-be-protected logic, you end up with net lost. How can that be possible you asked. Well, let me give you an example, for FPGA, if you TMR just the FF, which is 1 bit in configuration plus the various flop settings (CE, RST, etc), you simply shifted the problem to routing and the voter downstream. Remember voters are implemted as LUT in FPGA. So chances are you'll come out with a net lost withat that scheme. How do I know? Let's just say to fight off some genius from doing this in the past, I had to prove this with fault injection to show higher management.
If sufficient logics including async logic are protected, everything must be fine right? Well, this assumes you know the magic number which is design dependent. The best way to validate is still through combination of verifications such as simluation, fault injection, etc. Note that I said combination because simulation is not sufficient -> not to mention low coverage if you've actually done it even with the major EDA tool's solution. Simply not practical.
What happens if your logic for one redundant domain is stuck or gets out of sync? Can it recover? Logics such as FSM and counters are notiorious at killing the redundancy scheme. With TMR, the problem can be masked out as silent error when one domain's error is masked out with voted output. Now, the system can be more unreliable running on just two legs.
Verification coverage and scheme. Pounding logics through random injection is not practical. Even if you narrow the logic down to only the functional safety related portion.
What's been done on the implementation side of things. That is, place and route?

Recommended by LinkedIn

Kill supsended VM

Roel Gijtenbeek 5 years ago

Secure Multiparty Computation

Omid Jafarinezhad 2 years ago

Route-map

masoud mahjoubi 1 year ago

Well, so what should one do then?

If you're a US based company, there are some tools widely deployed by the A&D folks that have been validated through not just beam tests but also actual missions with environments much worse than commercial applications. These tools are typically restricted, so if your company happen to have some A&D applications as well, ask around.
Hand code TMR on logic level is not recommended, but one can start at higher block level which may be just sufficient for commercial applications. Protect the outputs with both major voter as well minority voter to catch persistent problems. In addition, consider board level tracing voting technique to further lower component failure.
For memory, scrub!!!
Watch your layout. Problems such as cross clock domain and SET can be well mitigated. Some common layout technique such as fencing and interleave are proven to be very effective.
TMR is costly, identify the most critical components. DMR is typically sufficient for most cases, just ensure you can detect persistent problem - if the performance requirement allows it, SW is typically a most cost effective solution~

Well, I managed to keep this piece still short and sweet. And yes, the proposed solution is rather high level. Feel free to DM me to follow up on more detailed discussion.

To view or add a comment, sign in

Do Redundancy Correctly and Practically

Chen Wei Tseng

Recommended by LinkedIn

More articles by Chen Wei Tseng

Others also viewed

eBPF Kernel bypass in XDP_FURPF

From "Hello World" to "Agent Ready": Why We Built Instead of Bought

Back2Basic: Fail Closed, Fail Open. Layer 3 vs Layer 2 failures behavior

GOOSE Protocol Deep Dive: Anatomy of a GOOSE Frame — Field-by-Field Analysis for Wireshark Diagnostics

Who won the Logic Analyzer Wars?

The Invisible Wi-Fi Frame

OSPF Advanced Concepts - Part 2

The Abandoned Memory Problem

The OSI Model

How does traceroute work?

Explore content categories

Recommended by LinkedIn

More articles by Chen Wei Tseng

It’s not Cheating, It’s a Strategy Series – Ensuring OOC IC can meet ISO21434

It’s not Cheating, It’s a Strategy Series – Crafting Multi-Purposes SEooC Safety Case of IC for ISO26262

Obsolesce of ISO Auditor and Consultants?!

When FuSa Meets Cybersecurity~~~

ISO21434 - What Does Semi Need to Do? Part - Asset Identification

Ghetto Style Fault Tree Analysis (FTA)

Janet Wrecking Cybersecurity and/or Functional Safety?

ISO26262-11:2018 Tech Brief Series: Adjusting Transient Failure FIT numbers

星鏈衛星(倫敦鐵橋)垮下來?

ISO26262-11:2018 Tech Brief Series: Interleaving ECC

Others also viewed

eBPF Kernel bypass in XDP_FURPF

From "Hello World" to "Agent Ready": Why We Built Instead of Bought

Back2Basic: Fail Closed, Fail Open. Layer 3 vs Layer 2 failures behavior

GOOSE Protocol Deep Dive: Anatomy of a GOOSE Frame — Field-by-Field Analysis for Wireshark Diagnostics

Who won the Logic Analyzer Wars?

The Invisible Wi-Fi Frame

OSPF Advanced Concepts - Part 2

The Abandoned Memory Problem

The OSI Model

How does traceroute work?

Explore content categories