BCJR (Max-Log Map) Implementation

Mathematical theories are fascinating in books and papers — but they become truly exciting when we bring them to life on real hardware.

In this article, I’d like to share my latest FPGA implementation of the BCJR algorithm, an important technique in forward error correction and decoding. This journey was not just about coding or designing logic — it was about translating abstract math into working circuits that can handle real-world high-speed communication.

Over this journey, we will walk through the following topics together:

1. Introduction

Why BCJR matters in modern communication systems
Challenges of mapping algorithms to hardware

2. Implementation of Max-Log Map

A review of the Max-Log Map algorithm and its role as one implementation of BCJR
Key features of the Max-Log Map decoder
Architecture and modular design choices

3. Core Design Units

Input Unit (I-Unit)
Control Unit (F-Unit)
Computational Units (A-Unit, B-Unit, D-Unit)
Metric Storage Unit
Computational LLR Unit

4. Hardware Testing & Integration

Creating the Vivado project
Designing with MicroBlaze
Interface and communication between SDK and Vivado
Debugging and project management in SDK

5. Performance Reporting

Resource utilization results
Power consumption report for XA7A100T FPGA

6. Summary & Conclusion

Key takeaways from mapping BCJR to hardware
Lessons learned during design and testing

1. Introduction

Why BCJR matters in modern communication systems

In wideband HF communication systems, convolutional coding is a key requirement for handling different data rates and waveforms. While the Viterbi algorithm is widely known, modern HF systems demand iterative processing and soft-input/soft-output decoding, which makes the MAP algorithm (BCJR) the better choice.

I recently worked on implementing a MAP decoder with constraint length 7, focusing on efficient controller design for optimal resource sharing. In the end, I also looked into how this decoder can be smoothly integrated into a complete system.

Using convolutional codes is one of the common methods of Forward Error Correction (FEC). A convolutional encoder adds a sequence of redundant bits to the original data at the transmitter, which allows the receiver to detect and correct errors. The use of convolutional codes is especially important in noisy channels.

A convolutional encoder operates based on a state table or state diagram, which contains the following information:

The input bit(s)
The current state of the encoder (i.e., the bits stored in its registers)
The next state of the encoder (after a new input bit enters)
The output bits, or the coded bits, corresponding to each input

Although state machines and state diagrams can fully represent the encoding and decoding process, a more common graphical representation is the Trellis diagram. This diagram (shown in Figure 1-1) may look a bit crowded at first, but it is the best way to describe the operation of both the encoder and, consequently, the decoder.

In a Trellis diagram, the x-axis represents time, and the y-axis represents all possible states of the encoder. Each transition on the diagram corresponds to a new input bit entering the encoder, producing an output of two coded bits.

Two main algorithms are commonly used for decoding convolutionally encoded information: Viterbi and BCJR. Compared to Viterbi, the BCJR algorithm offers higher performance but also comes with greater computational complexity.

In the following sections, we first present a simplified version of the BCJR algorithm, known as Max-Log MAP, along with the conventional architectures used for implementing this type of decoder. After that, we describe the final architecture adopted for our implementation.

In the last chapter, we provide the hardware test reports of the project, including details on resource utilization and the testing methodology that was applied.

Challenges of mapping algorithms to hardware

As FPGA engineers, we often start with a clean mathematical algorithm in MATLAB or Python. But once we move to hardware, reality hits: things aren’t that simple. Here are some of the main challenges we face:

- Parallelism vs Sequential Nature – algorithms are usually written step by step, but hardware needs parallel execution. Finding the right balance is tricky.

- Precision & Quantization – theory assumes floating-point, but hardware prefers fixed-point. Choosing the right word length without losing accuracy is always a battle.

- Resource Constraints – limited LUTs, DSPs, and BRAMs mean trade-offs between area and speed are constant.

- Latency vs Throughput – some systems demand low latency, others demand high throughput. Pipelining and unrolling decisions make or break the design.

- Memory Bottlenecks – algorithms like MAP or BCJR rely heavily on storing and accessing intermediate results. Memory design becomes a real bottleneck.

- Control Complexity – iterative algorithms need smart controllers to manage resource sharing and operations efficiently.

- Scalability – hardware should adapt to evolving standards (LTE, 5G, beyond). Making architectures reusable is not easy.

- Verification & Debugging – proving your hardware matches the math is much harder than simulating it in software.

2. Implementation of Max-Log Map

A review of the Max-Log Map algorithm and its role as one implementation of BCJR

Processing Logarithmic Likelihood Ratios (LLRs)—instead of processing raw bits—is performed in the BCJR decoder. An LLR expresses the decoder’s confidence about whether a coded data bit is a 0 or 1. In other words, instead of making a hard decision on the output bit, the decoder provides the probability of the bit being 0 or 1.

In general, the architecture and implementation of a decoder can be fully serial or fully parallel. The choice of architecture typically depends on the processor and project requirements, such as data rate and the maximum achievable operating frequency.

BCJR is one of the common architectures used for decoder implementation. Figure 2.1 shows an example of a simplified version of this architecture. Due to the high computational load required for determining LLR values, the Max-Log-MAP algorithm—also known as the BCJR algorithm—is used.

In this architecture, a sliding-window technique is applied, as described extensively in the literature. In this method, the relatively large LLR input block of the decoder is divided into multiple sequences of equal length that are significantly smaller than the original block size. The required computations for generating the output are then performed over these smaller sequences instead of the full input block.

The input data is first stored in an initial buffer whose length equals three sliding windows (as shown in the architecture of Figure 2.1). This buffered data is then used to perform three sets of parallel recursive computations, known as Forward, Backward, and Pre-Backward (Dummy) calculations.

Using array-based input buffers is highly effective for increasing the decoder’s operational throughput and reducing the memory required to store intermediate metrics. This is particularly important because one of the major drawbacks of the BCJR algorithm is its high memory demand for storing state metrics.

According to the algorithm, the values that must be computed during execution in order to determine the LLR output are:

Branch (Transition) Metric Matrix (γ) — transition metric between states
Forward State Metric Matrix (α)
Backward State Metric Matrix (β)

The resulting LLR output, based on Equation 2.1, is calculated as follows. In general, the LLR of the input bit at time k represents the ratio of the posterior probabilities of the bit being 0 or 1. In this relationship, the LLR is computed using the natural logarithm of the conditional probabilities of the output bit being +1 or –1 (given the received data from the channel):

Step 1: First, the state-transition metric matrix for the active sliding window must be computed. In this step, the transition probabilities between states (γ metrics) may either correspond to the a priori probabilities or be set to zero, depending on the pair of states (s,s′)(s, s')(s,s′).

The block-based algorithm receives an input sequence of length N symbols, and the possible transitions in the Trellis are determined accordingly. To compute the transition metrics γ over these N symbols, the Trellis is traced forward along the diagram in accordance with Equation 2.2.

In this relationship, C represents the Trellis output generated based on the generator polynomials used in the encoder. As noted earlier, LLRₖ, the data received from the channel, is also used in computing these values.

Step 2: In this step, the forward state metrics (α metrics) must be fully computed for each window. According to the sliding-window technique, the received block of N symbols is divided into several sufficiently large windows, and the computations are performed on these smaller windows.

The forward recursion metrics α are then calculated for each window based on Equation 2.3, and each resulting metric is referred to as the Forward Recursion metric.

The expression (s′→s) represents the set of all states 's′ that can transition into state s, based on the encoder’s generator polynomials. Note that the computation of the α forward recursion metrics for the first window begins independently, but for all subsequent windows, the recursive computation uses the α metrics obtained from the previous window. This is why the computation of these metrics proceeds recursively and in the forward direction throughout the Trellis.

The Max-Log-MAP algorithm, which is a simplified version of the MAP and Log-MAP algorithms, is used here. In the Log-MAP algorithm, the correction term

is included, whereas in Max-Log-MAP this correction is omitted. In practice, the second term in the Log-MAP expression is often implemented using a lookup table.

During the computation of α metrics, a set of metrics of size L×SL \times SL×S must be stored, where S is the number of decoder states. The metrics are stored using direct addressing.

In this architecture, the block size is considered to be N=M×L, where M is the window length. Thus, each window has a length of M, and the block consists of L windows. Consequently, the memory required for storing the α metrics is reduced proportionally to M.

Step 3: Next, the backward state metrics (β metrics) are computed. To calculate β, a pair of stored LLR values is read from the input buffer. After computing the branch metrics in the reverse direction, the β values are updated once per clock cycle using Equation 2.5. This operation is known as the Backward Recursion.

In parallel with this process, the corresponding α metrics stored in the α-Memory (

𝛼𝛼

α–Mem) are also retrieved, since they are needed for the β computations. It is clear that in this architecture, computing β over any window of input data is only possible after the α metrics for that window have been fully computed.

Furthermore, the Pre-Backward Recursion must be executed before the main Backward Recursion to properly initialize the β metrics.

Figure 2.2 presents a graphical illustration of how the recursive relations are computed along the forward path (α metrics, right side) and the backward path (β metrics, left side). Figure 2.3 also shows two consecutive steps of the recursive computations required to determine the forward state metrics α at time steps k−1 and k .

For initializing the state metrics αk(s) and βk(s), the initialization must be performed according to Equation 2.6.

Step 4: The process of computing the output LLR using the architecture shown in Figure 2.1 is performed through four pipeline stages. According to this flow, all α metrics for every state are first computed and stored. After both LLR blocks are fully received, the β metrics are then computed in the backward direction over two windows. Finally, the output LLR, representing the posterior probability of the decoded bit, is determined using Equations 2.7 and 2.8.

By pipelining the recursive computations—Pre-Backward, Forward, and Backward—and assigning independent computational units to each process, the architecture produces one LLR per clock cycle after an initial latency.

The pipeline stages of the algorithm implementation are shown in Figure 2.4. In the figure—on the right side—the horizontal axis represents time, and the vertical axis represents the decision LLRs applied at the input of the model.

It can be observed that the Forward and Pre-Backward recursive computations start simultaneously, but they operate on two consecutive sliding windows. Therefore, it is clear that before the decoder begins processing, a buffer of at least two windows must be allocated to store the incoming data.

Managing data retrieval from the input buffer and pipelining the computations (i.e., the implementation of the sliding-window technique on the Trellis diagram, and on the right side, the timing schedule for computing the values of β′, β, α, and δ)

Decoder Specifications for the Max-Log-MAP (BCJR) Algorithm

As mentioned earlier, the Max-Log-MAP decoder can be designed and implemented either fully serial or fully parallel. The choice of architecture depends on the processing requirements, target data rate, maximum achievable clock frequency, and other project constraints.

For FPGA-based implementations, parallel architectures are generally preferred. However, a fully parallel structure consumes a large amount of hardware resources. Therefore, to achieve the best practical design, a careful trade-off between resource usage and processing speed must be made.

Design Requirements of the Max-Log-MAP (BCJR) Decoder

The expected requirements for the decoder are:

Maximum 500 nanoseconds processing time for each input LLR
Sliding window length of 128
The decoder must process a data block whose size may be variable in each run
Full compatibility with constraint lengths 7 and 9
The decoder should be designed with low footprint and low power in mind

Chosen Architecture: Semi-Parallel Implementation

Based on the trade-off between FPGA resource usage and achievable operating frequency, a semi-parallel architecture is selected.

In this structure:

All computations are performed through four parallel processing paths.
Therefore, in each clock cycle, the calculations required for four metrics are produced simultaneously.
Compared to a fully serial design—where only one metric is computed per cycle—this approach provides significantly higher speed and throughput.

Decoder Architecture and Implementation

The input LLRs are processed in pairs, as shown in Figure 2.5, which illustrates the high-level architecture of the Max-Log-MAP (BCJR) decoder. For decoding, the input data first enters the input buffer. After performing the required computations and decision-making on the input LLR sequence, the output values are generated.

The building units and submodules of this decoder are introduced with a hierarchical structure in Table 2.1, along with brief explanations of the main functional blocks. In this block diagram, the details related to the control sections are omitted for simplicity. Additionally, the names and labels assigned to each implemented HDL submodule exactly match those used in the source code.

Input (Shift) Memory / I-Unit: A temporary buffer for storing incoming LLRs using a dual-port memory. Since the decoder works in a semi-parallel fashion, LLRs are stored and assigned to the B-, A-, and D-Units in the proper processing order.

B-, A-, and D-Units: These units are essentially identical and act as three instances of the same computation module with consistent functionality. They are responsible for calculating branch and state metrics for each incoming LLR.

The A-Unit computes forward (Alpha, α) metrics.
The B-Unit and D-Unit compute backward (Beta, β) metrics.

To calculate the metrics, the Beta unit first performs a temporary backward computation of the β metrics and generates initial values that the B-Unit uses to finalize the Beta metrics. Since branch metrics (γ) are required for state metric calculations, they are also considered within these computation units. After computation, the metrics are stored in temporary memory for further use.

M-Unit (Beta Metric Storage Unit): The M-Unit is a relatively large memory unit that stores the fully computed metrics generated by the B-Unit for each input LLR window. During LLR output estimation, these stored metrics are retrieved for further calculations.

The LLR output computation unit provides a preliminary estimate of the output bit and LLR value based on the metrics stored in the M-Unit and the metrics computed by the A-Unit. To reduce critical path delays, this unit has been implemented in a fully pipelined architecture.

F-Unit (Data Flow Control Unit): The F-Unit is responsible for managing the sequential reading of LLRs from the input buffer, addressing temporary memory and beta metric storage units, and generating control signals for the parallel processing blocks. It also handles the activation of calculation units, selection of multiplexers, and other control tasks. Additionally, the F-Unit schedules the appropriate timing for providing the initial values of metrics.

I-Unit (Input Buffer): The I-Unit serves as the system’s input buffer and is considered the first stage in the decoder path. This buffer is essential for sequentially managing the incoming data, enabling pipelining, and supporting the implementation of window-based processing techniques.

For the simplest implementation of this buffer, four single-port memories and three four-to-one multiplexers are used, as it is necessary to provide simultaneous access to the buffer. One memory is dedicated to writing, while the remaining three store LLRs for reading, as illustrated in Figure 2.6.

Memory Design Optimization for Sequential Data Access

Due to the inefficiency of the previous high-level design, using two dual-port memories is considered a more suitable option for implementation. This is because the resource utilization of two dual-port memories on an FPGA is lower than using four single-port memories. However, it is still necessary to use multiplexers on the output ports of the memories.

Based on this, another implementation idea was proposed in which only a single dual-port memory is used. This approach reduces the memory processing overhead by approximately half. The principle behind this method is to divide the memory into four equal sections (corresponding to the size of the LLR window, assumed to be 128) and manage read/write operations using the two MSB bits of the port addresses (a and b). This ensures proper sequential access to each of the four sections.

The internal architecture of this buffer is shown in the figure below. Address generation and multiplexer control for memory access are handled in the F-Unit address path. Using this new approach, there is no longer a need for multiplexers on the output ports of the buffer.

As also indicated in Tables 2.2, upon issuing the first read command, the input LLR data stored in the input buffer is made available on port a for the A-Unit. In the next clock cycle, after issuing the second read command, the LLR values read from ports a and b are sequentially assigned to the B-Unit and D-Unit computational units.

To ensure that all three read LLR values are simultaneously available to the corresponding units, a dedicated register has been placed in the LLR data path to the A-Unit, as illustrated in Figure 2.7.

The size of the memory used for this buffer should be selected according to the size of the windows and must accommodate 4 windows. This means it should have the capacity to store 512 pairs of LLR values; therefore, the chosen memory size is 8 Kbits.

Each time a new pair of LLRs is written to the input port, the buffer address is properly controlled to ensure that the data is not overwritten.

Calculation: 4 × 128 × 2 × 8 = 8192 bits

Figures 2.8 and 2.9 show the internal architecture of the input buffer and the dual-port memory used in it, respectively.

This project was a perfect reminder of why I love working with FPGAs: the ability to take a mathematical concept and make it run efficiently in silicon, optimized for speed, resources, and power.

I’ll be sharing more detailed insights (including code snippets, block diagrams, and measurement results) in upcoming posts. Stay tuned!

#BCJR

#viterbi

#Implement_math_algorithm

Written by Ehsan Faraji

BCJR (Max-Log Map) Implementation

Ehsan Faraji

1. Introduction

2. Implementation of Max-Log Map

3. Core Design Units

4. Hardware Testing & Integration

5. Performance Reporting

6. Summary & Conclusion

1. Introduction

Why BCJR matters in modern communication systems

Challenges of mapping algorithms to hardware

2. Implementation of Max-Log Map

A review of the Max-Log Map algorithm and its role as one implementation of BCJR

Recommended by LinkedIn

Decoder Specifications for the Max-Log-MAP (BCJR) Algorithm

Design Requirements of the Max-Log-MAP (BCJR) Decoder

Chosen Architecture: Semi-Parallel Implementation

Decoder Architecture and Implementation

Others also viewed

To Speed Up the Research in WAMS, the Prerequisites Are...

RFabric is a modular hardware-agnostic signal runtime framework engineered for structured protocol interaction and experimental RF system design.

Cadence and MathWorks Announce Flow from MATLAB to RTL

The journey of my Image Processing ASIC Design

Board Bring-Up for AI Edge Prototypes

AI-Assisted Coding for Simulation Software in Practice

SU2 and the Challenge of Scalability: How CloudHPC is Speeding Up CFD Simulations

Me and the 486

The Solver Pipeline of a Modelica Model: From Graphs to BLT

How does bare metal Firmwares work - Implementing the basic components(in C)

Explore content categories