Xilinx DMA PCIe tutorial-Part 2
In part 1 of my tutorial I've gone over the basic issues related to DMA. I covered the various solutions applicable (the ones I've found. I'm sure there's more).
In the following part 2 of my tutorial I will dive deeper into the implementation. I'll start with the block diagram of my design. The image below gives a high-level view of the design including all main blocks and how they connect to the XDMA main IP Core.
The block diagram in the figure above is the full design of a basic PCIe + DMA with Scatter/Gather mode and Descriptor Bypass feature enabled (one channel).
I will now go over the main blocks and give tips, tricks and insights on how to do it right and gain remarkably high throughput results.
XDMA - block 1
After Importing the “DMA/Bridge Subsystem for PCIe” block to Vivado BD, double-click on it to open the IP customization window. It shows the first screenshot. I will now go over all tabs and explain most of the options and what is the optimal value to insert.
Basic tab:
After configuring the PCI link speed (8GT/s) and the lane width (x8), the AXI Data Width changes automatically to 256 bits and the AXI Clock Frequency to 250MHz.
Next, I selected:
a. AXI Stream checkbox, as I wanted continuous DMA data flow.
b. AXI-Lite Slave interface checkbox, which enables me to program the DMA core internal registers, whether it’s performance counters or reading the number of descriptors, etc.
Regarding the 4 checkboxes at the bottom:
PIPE generates the core that can be simulated with PIPE interfaces connected (I’ll explain it later on).
The DRP (Dynamic Reconfiguration Port) checkboxes allows the dynamic change of parameters of the transceivers and common primitives. It has a processor friendly interface with an address bus, data bus and control signals. I’ve left it unchecked.
The Additional Transceiver Control and Status Ports adds Transceiver debug ports. The IP advises to change them in accordance with the GT user guide.
I’ve left all these un-checked (default state).
PCIe ID Tab
This tab holds info on the PCIe endpoint (Xilinx FPGA). The user can change all the fields. Obviously, since the driver communicates with the PCIe endpoint, the device ID (at least) must be identical to the device ID used in the driver code.
The Vendor ID is vendor specific. PCIe SIG website has a list of various vendor ID’s (here) and the driver translates the vendor ID accordingly. The default value of this field is 10EE and from looking at the website we can see there’s a match:
The user can insert any value, but for the good engineering practice it is better to use a known value. It affects only how the driver translates this number to a specific vendor. For example, when inserting the value: 0x1172, the driver will identify the PCIe endpoint as Intel (Altera), whereas entering the value 0x10EE will be identified as Xilinx. Other than that, there’s no other implication to the on-going work.
A great tool which helped me alot during my debug phase with Linux was devmem2 tool. It enables the user to read/write to specific locations in memory without any dedicated program (running it from SSH terminal). This is a great advantage of using Linux over Windows since you have access to memory locations which Windows users are not allowed to handle.
PCIe: BARs tab
The next tab relates to the various PCIe BARs. The default values and checkboxes are as follows:
BARs (Base Address Registers) are used to define the memory addresses and I/O port addresses (memory-mapped I/O) used by the PCI devices. They define the memory space and start address of a contiguous mapped address in system memory or I/O space.
The endpoint (our FPGA for this matter) requests a size of memory (contiguous) which is then mapped by the host memory manager, and the BAR (shown in the endpoint PCI configuration space) is programmed with the base address for that endpoint’s BAR0/1/2 field in the endpoint’s configuration space.
Xilinx has a great explanation about BARs in AR65062
This whole process is carried out in the lower level of PCIe, BIOS, driver, etc., so the common user need not intervene in this process. Nonetheless, as these BARs have implications on our design (see next paragraph), the user should decide what to define in these fields.
There are various checkboxes available. The user manual defines them well.
Master AXI Lite:
This interface related to the memory map register interface. According to PG195 manual:
I’ve checked this check-box, as I wanted register memory mapped interface available, meaning, for register interface to the outside world, this checkbox must be checked.
Master AXI Bypass:
Same as Slave AXI Lite, but full AXI. I did not check it, as AXI Lite was enough for the implementation.
Slave AXI Lite:
DMA/Bridge Subsystem for PCIe registers (i.e., the internal registers of the core) can be accessed from the host or from the AXI-Lite Slave interface. These registers should be used for programming the DMA and checking status. I’ve checked this checkbox.
Moving on to the next fields, these are the ‘Size’, ‘Value’ and ‘PCI to AXI Translation’ fields; the ‘Size’ and ‘Value’ fields are in charge of the allocated space used for this BAR. For example, looking in my configuration below, you can see I’ve allocated 64MB of memory for the ‘PCI to AXI Lite Master Interface’ which means the register address space mapped could be the size of 64MB (so address space is ).
The ‘PCI to AXI Translation’ translates the PCI address to AXI territory. No matter what address the host uses to place the PCIE BAR within the host address space, any host access to that BAR will translate to an AXI address of 0, in our case. In other words, all accesses to the BAR will be translated to a base address of 0 in AXI space. Need to remember, though, that using the default value of 0 will cause all accesses to the BAR to be translated to a base address of 0 in AXI space. This seems logical if the BAR size is large enough, but in case there are multiple AXI peripherals that acquire access it could limit them and cause issues.
Now we’ve left with 2 last checkboxes: ‘Prefetchable’ and ‘64bit enable’. The Prefetchable option enables faster operations between the CPU and the memory. It is a region of memory marked as prefetchable and the CPU can request in advance as an optimization.
Taken from PCI Express Base Specification:
So, you guessed right, checked both…
Putting it all together, in my project this tab looks like this:
XDMA downsides
Xilinx XDMA, even if very easy to implement, and very straight forward, does have a few drawbacks. Though they are not a deal-breaker from my point of view, still, the average user must know them before starting to work with this core:
1. There are only 4 RX DMA channels and only 4 TX DMA channels. This means if you need more than 4 for your design, you cannot use this solution.
2. The XDMA is a Xilinx wrapper for the PCIe bridge. This is simple as that. What it means, is if you do want to implement further enhancements (like adding more channels), this cannot be achieved, as all under the hood – cannot be seen by the user. This may be sufficient for the average user, but when thinking ahead to a more sophisticated implementation of the DMA with this core, this is a show stopper.
Other than that, Xilinx did a great job with this core. It is simple to use and easy to implement in your designs. The addition of DMA_Reset_bridge feature is a life-saver for the cases where you did something wrong, and you want to reset everything except the link you’re using. This is an example of how Xilinx made an effort to ease the implementation phase with the XDMA for the average user.
MISC Tab
In this tab I did not alter anything. I decided not to use Interrupts in my design as Polling is much preferred in terms of bandwidth. You can watch Jason’s tutorial on PCIe at the specific time how performance is much better with polling compared to interrupts.
DMA Tab
In this tab I’ve defined Read and Write channels, 4 channels each (maximum). Regarding the “Number of Request IDs for Read channel” and “Number of Request IDs for Write channel” I did not change it (the default is as shown). Even though I wanted to design a very simple example design with only one Master. AXI protocol supports minimum number of ID’s as 2, so I could change it. Nonetheless, I did not change it.
Furthermore, Xilinx has a nice feature called: Descriptor Bypass. This enable achieving high performance and bandwidth. I’ve checked it as I wanted to receive the highest performance from my setup. Descriptor Bypass means the descriptors are handled by the hardware, and not by SW or driver. The implication of that is the user must write his own logic for this mechanism, and I will warn you it is not straight forward. Adding this feature will place an input port named dsc_bypass_h2c/c2h.
DMA Status Ports – I’ve checked it as it could help in my logic implementation.
So, before updating this tab it looked like that:
And after changing all checkboxes as written above the core looks much more interesting, not to mention complicated:
Few tips regarding controller design with PG195
In an effort to save you from going over the whole 150 pages, I thought about giving you some references to the most important and interesting parts in the manual.
Descriptor Bypass Ports
Looking at table 33 and table 34 at PG195 we can see the ports which are in charge for the Descriptor Bypass feature. These tables, together with table 8 (which I'll cover at Part 3) will ease the design phase towards a working DMA PCIe with Descriptor Bypass feature.
Pay attention to the fact that if you're using Descriptor Bypass, like I was, then any reference to SGDMA is non-relevant. Also, I wanted to emphasize that in my design I did not use MSI and IRQ for DMA since I wanted a better performance. I explained it earlier and refer you to watch Jason’s tutorial on PCIe at the specific time how performance is much better with polling compared to interrupts.
DMA Channel Control
Moving on to Tables 40 (for H2C) and Table 59 (for C2H) - I placed a part of the H2C table:
The Run bit is obviously the one you would want to control in your design. Most of the others are used for logging.
DMA test
DMA test is a nice feature you may want to implement at the end of your design phase. obviously, you would want to test your design mainly in terms of throughput.
For this purpose Table 52 - Table 56 (for H2C) and Table 71 - Table 75 (for C2H) comes handy.
The idea here is very simple and I'll explain it for H2C, as it is exactly the same for C2H. Just set the Run bit mentioned in Table 52 and then pass a predefined data file (counter, for example) from your host towards the board (when testing H2C direction) , measure the cycle counts and the data bytes, mentioned in tables 53 and 55, Respectively and divide them to receive the full throughput. Just pay attention to the fact that the data counts mentioned in table 55 is in beats (that is 4 bytes) and not in bytes, so take that in mind with your calculations.
dma_bridge_resetn port
Now, after going over all tabs, there’s one small issue I’ve wanted to mention, and it helped me a lot during the debug phase. This is the dma_bridge_resetn port, mentioned in table 13:
This optional pin allows the user to reset all internal registers of the XDMA core. During the debug phase there were a lot of times I’ve needed to reboot the board, since the core/board was stuck (as a result of a bad logic design), and then, while going over the manual, I’ve stumbled upon this pin.
Chapter 4 in PG195 manual explains how to enable it (by default it is disabled). You must read it carefully, as not following the rules mentioned here will cause the whole design to malfunction (data corruption, etc.). I’ve highlighted the most important point here, since this caused me the biggest headache. I've set this pin before the PCIe transaction was actually finished. Took my long time just to realize what I've done, so be careful!
PIPE
Even though the PIPE is just a checkbox, it deserves a through explanation. To accelerate verification and device development time for PCIe based sub-systems, PIPE (PHY Interface for the PCI Express) architecture was defined by Intel. The core supports that to enable faster simulation speed. What it does under the hood is removing all PCIe transceivers from the simulation and this can speed up simulation time.
These 2 figures from Xilinx tutorial on Mentor QVIP explains it:
When not using PIPE:
When using PIPE:
Since I’ve received my board quite fast, I did not have to use this feature, but just to give an idea, the PCIe simulation takes about 20 minutes till the user receives a ‘Link up’ from the simulator to start a data transmission. Think how annoying this could be to wait so long every time you change something in the code.
Xilinx AXI BFM has been discontinued as of December 1, 2016 (read it here) and not supported after Vivado 2016.4 so in any case if you’re using higher version you cannot use AXI BFM.
Going back to the VIP, I’m sure there are many solutions to work with VIP, so I’ll name a few I’ve found:
XAPP1184 is a nice app which has a link to download a free BFM for PCIe simulations from Avery-designs.com. It uses Cadence IES (which is not free). Xilinx, by the way, has a tutorial on how to work with Cadence IES with Vivado here.
Mentor Graphics Questa Verification IP (QVIP) has another PIPE solution and there’s a Xilinx tutorial on how to use it. It uses the Questa Verification tool.
Follow Part 3 of my tutorial to dive deeper into the various blocks in the design.
Nice tutorial, Roy! Regarding your comment on simulation time, "...the PCIe simulation takes about 20 minutes till the user receives a ‘Link up’ from the simulator to start a data transmission..." Was this time using the PIPE interface or PCIe serdes? If using the PIPE interface, which virtual PIPE IP was used?
thank you so much!im a student from 2024.11 in China .its help me a lot
Thank you Roy. It helped me to understand about DMA and XDMA. Looking for more articles from you.