I need to communicate with an FPGA device based on an AXI-burst interface. What are the ways to access such a device through Linux without involving a DMA? Burst is an intrinsic property of the AXI standard, which should typically be triggered automatically when large amounts of data are being transferred. And the bigger problem is the FPGA is designed so as to respond only to burst-type requests over the AXI bus. So this causes serious issues on Linux when the application tries sequential copy. I have already tried memcpy and it doesn't work. 
The AXI protocol is burst-based. The master begins each burst by driving control information and the address of the first byte in the transaction to the slave. As the burst progresses, the slave must calculate the addresses of subsequent transfers in the burst. A burst must not cross a 4KB address boundary.
The AXI protocol defines three burst types described in: Fixed burst. Incrementing burst. Wrapping burst.
I assume your “FPGA device” is a custom block, memory-mapped over AXI interface to Cortex-A9. I think there are 2 or 3 ways you could make this work.
1) Cacheable mapping. Cache hardware interface does burst-transfer of an entire cache line at a time. You would need to manually clean (after writes) and invalidate (before reads).
2) Non-cacheable mapping, and have an ARM assembly language routine handle the low-level transfer. I think the “Load and Store Multiple registers” instructions can provide what you are looking for.
I had a similar problem where an AXI peripheral (custom memory controller) needed to be accessed with 8-byte transfers from Cortex-A9 processor. The usual ARM instructions, of course, transfer 1, 2, or 4 bytes (byte, halfword, word). Those worked through cacheable mapping, but not through non-cacheable mapping. LDM/STM, 2 words at a time, worked with both mappings.
AHB/AXI transfer modes are implementation dependent, of course. Per your description, you need INCR or WRAP modes rather than SINGLE. But it should not have to be that way. That brings up the third way you could make this work:
3) Talk with your digital hardware designer, make him aware of the software impact of his implementation.
In my opinion, you shouldn’t have to do unusual / custom low-level MMU operations. Linux has high-level methods, you would put standard hooks in your device driver and/or board.c, the main option is whether to go uncached (i.e. COHERENT). Refer to LDD3.
You need to set up the MMU to inform the hardware of the capabilities of your component. You also need to ensure that the entire interconnect supports bursts and isn't doing any conversions (which can happen if there's any ambiguity about your component's capabilities when the interconnect is generated).
To set up the MMU you perform a call like this:
/* shareable device: S=b0 TEX=b000 AP=b11, C=b0, B=b1 = 0xC06*/
Xil_SetTlbAttributes(COMPONENT_BASE_ADDRESS, 0xC06);
attributes are defined as follows (from Zynq Technical Reference Manual):
Encoding Bits   Cache Attribute
  C   B
  0   0         Non-cacheable
  0   1         Write-back, write-allocate
  1   0         Write-through, no write-allocate
  1   1         Write-back, no write-allocate
So the above line would set the region to write-back, write allocate, which may give you burst access on write.
See also Xilinx AR#47406 and this forum post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With