The DMA Engine is a Linux kernel subsystem that provides a standardized framework for DMA (Direct Memory Access) controllers. It allows device drivers to offload memory copy, memset, and scatter-gather operations to dedicated DMA hardware without CPU intervention.
// Step 6: Wait for completion (polling or callback) enumdma_statusstatus = dma_async_is_tx_complete(chan, cookie, ...);
// Step 7: Release channel when done dma_release_channel(chan);
DMA Capabilities
The DMA Engine supports multiple capabilities:
DMA_MEMCPY - Memory to memory copy
DMA_MEMSET - Memory fill/set
DMA_SLAVE - Slave/device DMA (peripheral to memory)
DMA_CYCLIC - Cyclic DMA for streaming (audio, etc.)
DMA_XOR - XOR operations (RAID)
DMA_PQ - P+Q calculations (RAID6)
Transfer Directions (for Slave DMA)
DMA_MEM_TO_MEM - Memory to memory
DMA_MEM_TO_DEV - Memory to device
DMA_DEV_TO_MEM - Device to memory
DMA_DEV_TO_DEV - Device to device
DMA_CYCLIC
● DMA_CYCLIC - Cyclic DMA for Streaming
What is Cyclic DMA?
Cyclic DMA is a special DMA mode where the DMA controller automatically loops through a series of buffer descriptors in a circular fashion, continuously transferring data without CPU intervention after the initial setup. Key Characteristics
1 2 3 4 5 6 7 8 9 10 11 12
┌──────────────────────┬─────────────────────────────────────────┐ │ Feature │ Description │ ├──────────────────────┼─────────────────────────────────────────┤ │ Circular Buffer │ Descriptors form a ring/loop │ ├──────────────────────┼─────────────────────────────────────────┤ │ Continuous Operation │ Runs indefinitely until stopped │ ├──────────────────────┼─────────────────────────────────────────┤ │ Periodic Callbacks │ Can interrupt at segment boundaries │ ├──────────────────────┼─────────────────────────────────────────┤ │ No Resubmission │ No need to constantly requeue transfers │ └──────────────────────┴─────────────────────────────────────────┘
// Called at each period boundary // ALSA updates its hardware pointer snd_pcm_period_elapsed(substream);
// No need to resubmit - DMA continues to next period automatically }
Buffer Layout Example
1 2 3 4 5 6 7 8 9 10
For a 4-period audio buffer:
Time ─────────────────────────────────────▶
Buffer: [ Period 0 ][ Period 1 ][ Period 2 ][ Period 3 ] │ │ │ │ ▼ ▼ ▼ ▼ IRQ: ●──────────●──────────●──────────●──────────▶
DMA Head: ───────────────────────► (wraps to Period 0)
Key Differences: Cyclic vs. Scatter-Gather
1 2 3 4 5 6 7 8 9 10 11 12 13
┌──────────────┬──────────────────────────┬────────────────────────────┐ │ Aspect │ Cyclic DMA │ Scatter-Gather DMA │ ├──────────────┼──────────────────────────┼────────────────────────────┤ │ Pattern │ Circular, repeating │ Linear, one-shot │ ├──────────────┼──────────────────────────┼────────────────────────────┤ │ Duration │ Continuous until stopped │ Single transfer │ ├──────────────┼──────────────────────────┼────────────────────────────┤ │ CPU Overhead │ Very low (no resubmit) │ Higher (resubmit needed) │ ├──────────────┼──────────────────────────┼────────────────────────────┤ │ Use Case │ Streaming (audio, UART) │ Block transfers (file I/O) │ ├──────────────┼──────────────────────────┼────────────────────────────┤ │ Interrupts │ Periodic (optional) │ End-of-transfer │ └──────────────┴──────────────────────────┴────────────────────────────┘
Important Considerations
Buffer Must Be DMA-Coherent
1 2
// Use coherent allocation for cyclic buffers void *buf = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
Period Size Matters - Smaller periods = more frequent interrupts = lower latency but higher CPU - Larger periods = fewer interrupts = higher latency but less CPU overhead
No Rewind - Once started, cyclic DMA flows forward continuously - To “rewind”, you must terminate and restart
Memory Must Be Contiguous - Cyclic DMA typically requires physically contiguous buffer - Cannot use standard dma_map_sg() with scattered pages
staticintsnd_soc_pcm_lib_trigger(struct snd_pcm_substream *substream, int cmd) { switch (cmd) { case SNDRV_PCM_TRIGGER_START: dma_async_issue_pending(chan); break; case SNDRV_PCM_TRIGGER_STOP: dmaengine_terminate_all(chan); break; } return0; }
Common DMA Controller Drivers Supporting Cyclic
dw_dmac - DesignWare DMA
pl330 - ARM PrimeCell PL330
imx-sdma - i.MX SDMA
omap-dma - OMAP DMA
stm32-dma - STM32 DMA
xilinx-dma - Xilinx AXI DMA
Want me to explain any specific aspect in more detail, such as the hardware implementation or how to debug cyclic DMA issues?
✻ Cooked for 52s
❯ the code you showed to me: ⎿ Interrupted · What should Claude do instead?
❯ the code you showed to me: dma_addr_t buf_addr = dma_map_single(dev, buf, buf_size, DMA_TO_DEVICE);
struct dma_async_tx_descriptor *desc; desc = dmaengine_prep_dma_cyclic( chan, // DMA channel buf_addr, // Buffer DMA address buf_size, // Total buffer size period_size, // Size of each period (segment) DMA_MEM_TO_DEV, // Direction DMA_CTRL_ACK // Flags );
dmaengine_prep_dma_cyclic
how to understand the dmaengine_prep_dma_cyclic? this api will split the continous memory which from addr “buf_addr” start, into each segment in length of “period” ?
● Yes, exactly! Let me clarify how dmaengine_prep_dma_cyclic() works:
Understanding dmaengine_prep_dma_cyclic()
What It Does
1 2 3 4 5 6 7 8 9 10 11
Continuous Buffer (buf_size bytes) ┌─────────────────────────────────────────────────────────┐ │ │ ▼ │ buf_addr │ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ │ Period 0 │ Period 1 │ Period 2 │ Period 3 │ │ │ (period_sz) │ (period_sz) │ (period_sz) │ (period_sz) │ │ └─────────────┴─────────────┴─────────────┴─────────────┘ │ ▲ │ └────────────── wraps around ───────────────────┘
How the Memory is Split
1 2 3 4 5 6 7 8 9 10 11 12 13
// Example values buf_size = 4096 bytes // Total buffer period_size = 1024 bytes // Each segment // Result: 4 periods (4096 / 1024 = 4)
// Internally, the DMA engine driver creates descriptors like: struct dma_cyclic_desc { // Period 0: buf_addr + 0 // Period 1: buf_addr + period_size // Period 2: buf_addr + 2 * period_size // Period 3: buf_addr + 3 * period_size // Then wraps back to Period 0 };
┌──────────────────────────────────────┐ │ [P0]──▶[P1]──▶[P2]──▶[P3]──┐ │ │ ▲ │ │ │ └────────────────────────┘ (loop) │ │ │ │ Each period = period_size bytes │ │ Callback at each period boundary │ └──────────────────────────────────────┘
Your Understanding is Correct
this api will split the continous memory which from addr “buf_addr” start, into each segment in length of “period” ?
Yes! The API:
Takes one contiguous buffer starting at buf_addr
Logically divides it into buf_size / period_size segments
Creates a circular descriptor chain where each segment transfers period_size bytes
The hardware automatically loops through all segments forever (until terminated)
DMA_PREP_FENCE
tell me about flag “DMA_PREP_FENCE”
● DMA_PREP_FENCE Flag
What It Does
DMA_PREP_FENCE is a flag used with dmaengine_prep_*() functions to enforce memory ordering constraints. It tells the DMA engine to act as a memory barrier (fence) that ensures all previous memory operations are completed before this DMA transaction begins.
The Flag Definition
1 2 3 4 5 6
// In include/linux/dmaengine.h #define DMA_PREP_FENCE (1 << 4)
// Used in the flags parameter: desc = dmaengine_prep_dma_memcpy(chan, dst, src, len, DMA_CTRL_ACK | DMA_PREP_FENCE);
Why It’s Needed - The Problem
Without DMA_PREP_FENCE, DMA transactions may be reordered for performance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
CPU writes to memory: ┌─────────────┐ │ Write A │ │ Write B │ │ Write C │ └─────────────┘ │ ▼ DMA Engine Queue (without fence): ┌─────────────┐ │ Transfer C │ ← May execute first! │ Transfer A │ │ Transfer B │ └─────────────┘
With DMA_PREP_FENCE:
1 2 3 4 5 6 7 8 9
DMA Engine Queue (with fence): ┌─────────────┐ │ Transfer A │ ← Must complete first ├─────────────┤ │ FENCE │ ← Memory barrier ├─────────────┤ │ Transfer B │ ← Waits for all prior ops │ Transfer C │ └─────────────┘
/** * dma_run_dependencies - process dependent operations on the target channel * @tx: transaction with dependencies * * Helper routine for DMA drivers to process (start) dependent operations * on their target channel. */ voiddma_run_dependencies(struct dma_async_tx_descriptor *tx) { structdma_async_tx_descriptor *dep = txd_next(tx); structdma_async_tx_descriptor *dep_next; structdma_chan *chan;
if (!dep) return;
/* we'll submit tx->next now, so clear the link */ txd_clear_next(tx); chan = dep->chan;
/* keep submitting up until a channel switch is detected * in that case we will be called again as a result of * processing the interrupt from async_tx_channel_switch */ for (; dep; dep = dep_next) { txd_lock(dep); txd_clear_parent(dep); dep_next = txd_next(dep); if (dep_next && dep_next->chan == chan) txd_clear_next(dep); /* ->next will be submitted */ else dep_next = NULL; /* submit current dep and terminate */ txd_unlock(dep);
Summary: DMA_PREP_FENCE ensures memory ordering by acting as a barrier - all previous DMA/memory operations must complete before this transfer starts. Essential for data pipelines where transfer order matters.
the provider for cyclic api
❯ you really a expert in dma field, so i have some more profund questions:
the api like dmaengine_prep_dma_cyclic is consumer api, the corresponeding provider api is platform-related
● You’re asking excellent architectural questions! Let me clarify the consumer vs provider API split in the DMA Engine framework.
Your understanding is correct: dmaengine_prep_dma_cyclic() is the consumer-facing API. The provider implements device_prep_dma_cyclic callback which is hardware/platform-specific. The DMA Engine Core bridges these two layers.
What’s your next question about the DMA architecture?
talk about dma provider common infrastructure
so maybe there are some common logics for “Allocate HW descriptors” between the different hareware/platform implements?
● Yes! Excellent observation! There are common patterns and even shared helper code for descriptor allocation across different DMA controllers.
Common Descriptor Allocation Patterns
Generic Descriptor Management (DMA Engine Core)
The DMA Engine Core provides the base infrastructure:
structmy_dma_desc { /* Embedded generic descriptor - MUST be first */ structdma_async_tx_descriptorasync_tx;
/* Driver-specific fields */ structlist_headnode;/* Link to free/active list */ structmy_dma_hw_desc *hw;/* Hardware descriptor in DMA memory */ dma_addr_t hw_addr; /* DMA address of hw descriptor */ structmy_dma_chan *chan;/* Back pointer to channel */ int cycles; /* For cyclic transfers */ /* ... */ };
/* Try to get from free list first (NO kmalloc!) */ if (!list_empty(&dwc->free_desc_list)) { desc = list_first_entry(&dwc->free_desc_list, struct dw_desc, desc_node); list_del(&desc->desc_node); memset(desc->lli, 0, sizeof(struct dw_lli)); /* Clear HW part */ return desc; }
/* Allocate new descriptor + hardware LLI */ desc = kzalloc(sizeof(struct dw_desc), GFP_NOWAIT); if (!desc) returnNULL;
/* Hardware descriptor (STM32 format) */ structstm32_dma_desc { structdma_async_tx_descriptordesc; structlist_headnode; u32 chan_config; /* Channel config register value */ structstm32_dma_sg_req *sg_req;/* Scatter-gather list */ int num_sgs; /* Number of SG entries */ };
/* Common cyclic descriptor allocation */ staticstruct stm32_dma_desc * stm32_dma_alloc_cyclic_desc(struct stm32_dma_chan *chan, dma_addr_t buf_addr, size_t buf_len, size_t period_len) { structstm32_dma_desc *desc; int num_periods = buf_len / period_len; int i;
desc = kzalloc(sizeof(*desc), GFP_NOWAIT); if (!desc) returnNULL;
/* Allocate SG entries for each period */ desc->sg_req = kcalloc(num_periods, sizeof(*desc->sg_req), GFP_NOWAIT); if (!desc->sg_req) { kfree(desc); returnNULL; }
desc->num_sgs = num_periods;
/* Build descriptor chain - COMMON PATTERN */ for (i = 0; i < num_periods; i++) { structstm32_dma_sg_req *sg = &desc->sg_req[i];
/* Calculate period address */ sg->dma_config.dma_periph_data = chan->config.dst_addr; sg->dma_config.dma_memory_data = buf_addr + (i * period_len); sg->dma_config.dma_size = period_len;