FPGA RTL Coding Style

FPGA RTL Coding Style

FPGAs are very versatile and give you a lot flexibility - IO, Logic, embedded memory, DSP, Standard Interfaces like PCIe, Ethernet, DDRx and so on. Advanced features like Dynamic Reconfiguration and Partial Reconfiguration allows even more flexibility. The biggest challenges are compile time and timing closure.

Today's FPGAs, like the Agilex-9 or even the Agilex-7 series, are large FPGAs with a few million ALMs, thousands of M20Ks embedded memory, and DSP blocks - which when translated to transistor count amount to billions! Imagine now being able to Synthesize, Plan the clocking and IO, Placement, Routing and closing timing across 5-7 corners. I'd confidently say 6-8 hours of compile time is excellent. The Altera Quartus Engineering team has made significant progress towards reducing compile times and out-of-the-box timing closure over the last several years but we will talk about that in a later post. Today the focus is on RTL coding style which will take advantage of the FPGA architecture to help you achieve timing closure.

Overs the years, I have had the opportunity to interact with brilliant Engineers. You and I probably see Verilog/VHDL code as it is - RTL, but there are Engineers who can "see" the synthesized netlist just by looking at the RTL. These Engineers are the ones who can spot RTL coding style issues that hamper timing closure and will even re-write RTL code that will squeeze a few Mhz of additional performance boost that will determine whether you ship the product on time or not! Cool stuff, eh? Let's see a few examples.

  1. At the top level of your FPGA or at hierarchical boundaries, register all your inputs, outputs and synchronize all asynchronous resets in the respective clock domains. Unregistered IOs are one of the leading cause for timing closure problems in most designs. Registering the IOs ensures that the IO paths are reg-reg paths, enabling easy timing constraints and timing closure.
  2. In the FPGA core, ensure that the control signals (address, read/write enable) for embedded memory blocks (m20ks) are driven by registers. The registers allows the fitter to optimize the design for timing closure by either pulling those registers into the RAM blocks or duplicating them if the RAM blocks are spread across columns.
  3. Altera's embedded RAM block memories have output registers. When writing RTL code ensure that the your inferred memories utilize the output registers. This will ease timing closure.
  4. Altera's Agilex series devices have embedded DSP blocks. These DSPs offer a lot of flexibility - variable precision or native fixed point DSP options. When using DSPs ensure that your RTL code takes advantage of the available pipeline registers. Another option would be to use Altera FPGA IP core reference designs for Multiply Adder, ALTMULT_COMPLEX, LPM_MULT and LPM_DIVIDE. These IP cores enable easy inference and instantiation but also gets you to meet your performance requirements.
  5. Often times, we need to compare two signals and then decide the next steps. For example, if a < b, do x else do y. This can be easily accomplished this way: wire [3:0] a, b; wire [4:0] a_sub_b = a - b; //if a < b,then a - b will be a negative number so MSB=1 This could be easily rewritten as: a_comp_b = a_sub_b[4] // now onlly one bit needs to be checked for the if/else The above trick can be used effectively for address decoding too, where depending on the address range, certain decisions have to be made.
  6. Sometimes, we have to count to a certain number and then do something. Say, we need to count 16 cycles, from 0 to 15, or from 4'b0000 to 4'b1111 and then do something - if(count==15) then {do something}. It might be more efficient from a RTL coding style perspective to simply code the 4-bit counter as a one-hot with 16-bits and use the 16th bit alone as the "compare" point instead of comparing 4-bits: i.e the RTL code changes from if(count==4'b1111) to if(count[16]). With this technique, certainly, a lot more flops are needed but the advantage is that those flops are already available in the FPGA so why not use it to maximum benefit. The "one-hot" counter is just a shift register.
  7. Another common RTL coding style optimization is with case statements as shown below: casex(sel) // synopsys parallel_case full_case 8'bxxxxxxx1   :  data_out = b0; 8'bxxxxxx1x   :  data_out = b1; . . . // and so on The above casex could be easily rewritten as: case(1''b1) // synopsys parallel_case full_case            sel_reg[0]   :  data_out = b0;            sel_reg[1]   :  data_out = b1; //and so on The latter coding style results in a fewer LUT depth, which results in lower data delay and a faster clock speed.
  8. In VHDL, the use of STD_LOGIC_VECTOR to INTEGER conversion requires the use of ieee.std_logic_unsigned.ALL library to be used, else users will see a syntax error. Not realizing the true cause for the syntax error, the user now has used a for loop, when a simple if would have been enough.

NUM_FRAMES is 850   FOR int IN 0 TO (NUM_FRAMES-1) LOOP                     IF (addr(17 DOWNTO 1) = CONV_STD_LOGIC_VECTOR(ind,17)) THEN    rdata  <= in(ind); END IF;                 END LOOP;            END IF;

//after adding ieee.std_logic_unsigned.ALL library                variable index : INTEGER;              index := CONV_INTEGER(addr(17 DOWNTO 1));  if (index < NUM_FRAMES) THEN    rdata  <= in(index); END IF; 

The modified RTL after adding the ieee library, results in 6 fewer LUT levels, resulting in lower data delay and hence higher Fmax.

Conclusion:

The Quartus Prime Pro Design Software provides the user with all the tools they need to identify their RTL coding style issues. In synthesis stage, review & address all the Critical/Warning messages. As much as possible ensure that there are no critical/warning messages. The synthesis engine also has a robust DRC report. Review and address the DRCs to minimize timing closure challenges. Use the RTL and the Technology Map Viewers to review the logic levels in your post-synthesis netlist.

Irrespective of the technology node, assume 500ps of delay per logic level. So if you want your design to run at 500 MHz, then a majority of your timing paths need to have less than 4 levels of logic.

The Quartus Prime Pro design software in the High-Effort compiler setting, will aggressively retime - forward and backward, but for retiming to happen there needs to be sufficient slack before/after the critical path - the path that is actually failing timing.

Reach out to your FAE for help. In most cases, the FAEs will help you rewrite your RTL and constraints to help you achieve timing closure.

I also encourage you to submit your design to Altera. Submitting your design with the RTL, to Altera will help us fine-tune the algorithms to help you achieve timing closure out-of-the-box. Submitting your design with the RTL helps us add your designs to our daily and weekly regression suites. These regressions helps us to monitor the QoR - Fmax/compile time/Memory, helping us to debug outliers.

To submit your design, reach out to your Altera FAE.



To view or add a comment, sign in

Others also viewed

Explore content categories