embd.cc

Embedded

Bugs in NXP Kinetis Ethernet Driver

Published 10 Sep 2025. Written by Jakob Kastelic.

The SDK^[1] drivers provided by NXP for use on the Kinetis K64 platform are extensive, well-tested and … not perfect. This article shows three bugs found in the ethernet driver. Note that none of this is original content; I merely put it together here for my future reference.

Forgetting to check for zero-length buffers

I have only seen this bug happen once in two years and have not found a way to reproduce it at will. So the analysis below may or may not be correct.

The symptom was that the firmware froze upon triggering the assertion in lwip/port/enet_ethernetif_kinetis.c:

“Buffer returned by ENET_GetRxFrame() doesn’t match any RX buffer descriptor”

After some Googling I found this forum thread, which suggests, in a roundabout way, that there is a missing check in fsl_enet.c. We have to add following to ENET_GetRxFrame():

if (curBuffDescrip->length == 0U)
{
    /* Set LAST bit manually to let following drop error frame
       operation drop this abnormal BD.
    */
    curBuffDescrip->control |= ENET_BUFFDESCRIPTOR_RX_LAST_MASK;
    result = kStatus_ENET_RxFrameError;
    break;
}

The NXP engineer on the forum explains: “I didn’t use this logic because I never meet this corner case and consider it a redundant operation.” I was curious if this “corner case” every happens, so I added a breakpoint, which got triggered after about two days of constant testing.

ChatGPT seems to think this check is necessary (but then again, I seem to be able to convince it of just about anything I do or do not believe in):

If you omit the check and DMA ever delivers a BD with length == 0: Your code will think it’s still in the middle of assembling a frame. It will not see the LAST bit yet, so it will happily advance to the next BD. That means the logic walks into an inconsistent state: rxBuffer may point to nothing, your rxFrame bookkeeping goes out of sync, and later you’ll crash on a buffer underrun, invalid pointer, or corrupted frame queue.

It remains to be seen if this check was behind my original crash, and if the body of the if statement is appropriate to handle the condition of unexpected zero-length buffer descriptor.

Credit: User pjanco first reported the error, while AbnerWang posted the solution. [source]

Incorrect memory deallocation

In fsl_enet.c, the function ENET_GetRxFrame() tries to deallocate the pointer of the receive buffer:

while (index-- != 0U)
{
    handle->rxBuffFree(base, &rxFrame->rxBuffArray[index].buffer,
        handle->userData, ringId);
}

First need to unpack some definitions to understand what the above means.

If we dig into the rxBuffFree() function, we discover it in the file lwip/port/enet_ethernetif_kinetis.c. The buffer to be deallocated is passed as a pointer void * buffer, and freed

int idx = ((rx_buffer_t *)buffer) - ethernetif->RxDataBuff;
ethernetif->RxPbufs[idx].buffer_used = false;

Next, what are rxFrame and rxBuffArray? The first one is of type enet_rx_frame_struct_t, which is defined in fsl_enet.h:

typedef struct _enet_rx_frame_struct
{
    enet_buffer_struct_t *rxBuffArray;
    ...
} enet_rx_frame_struct_t;

This allows us to see what is the type of rxBuffArray:

typedef struct _enet_buffer_struct
{
    void *buffer;
    uint16_t length;
} enet_buffer_struct_t;

Finally, what is ethernetif->RxDataBuff? We find it declared in lwip/port/enet_ethernetif_kinetis.c as the static array in the function ethernetif0_init():

SDK_ALIGN(static rx_buffer_t rxDataBuff_0[ENET_RXBUFF_NUM],
    FSL_ENET_BUFF_ALIGNMENT);
ethernetif_0.RxDataBuff = &(rxDataBuff_0[0]);

More precisely, RxDataBuff is a pointer to the first element of this array. This pointer therefore has the type rx_buffer_t*.

That type itself is declared at the top of the same file as an aligned version of a uint8_t buffer:

typedef uint8_t rx_buffer_t[SDK_SIZEALIGN(ENET_RXBUFF_SIZE,
    FSL_ENET_BUFF_ALIGNMENT)];

Now we can take a step back and think whether the idx calculation would be best done with the buffer itself, or a pointer to it. The calculation subtracts the following:

rxFrame->rxBuffArray[index].buffer, of type void*, is a pointer to the memory location that stores the ethernet frame.
ethernetif->RxDataBuff, of type rx_buffer_t*

The corrected code should pass the buffer pointer stored in .buffer, not the address of the .buffer field (omit the &):

handle->rxBuffFree(base, rxFrame->rxBuffArray[index].buffer,
    handle->userData, ringId);

Credit: This bug was found by KC on 7/31/2024.

Buffers not zero-initialized

Another bug in ethernetif0_init() in enet_ethernetif_kinetis.c: the ethernet buffer descriptor structs are declared static:

AT_NONCACHEABLE_SECTION_ALIGN(
    static enet_rx_bd_struct_t rxBuffDescrip_0[ENET_RXBD_NUM],
    FSL_ENET_BUFF_ALIGNMENT);
AT_NONCACHEABLE_SECTION_ALIGN(
    static enet_tx_bd_struct_t txBuffDescrip_0[ENET_TXBD_NUM],
    FSL_ENET_BUFF_ALIGNMENT);

The assumption is that since they are declared static, the descriptors will be zero-initialized at system startup. However, the macro AT_NONCACHEABLE_SECTION_ALIGN potentially places these descriptor in a special section that can bypass the zero-initialization, depending on the startup code and linker script.

In that case, we need to manually zero out these buffers. I put the following at the top of ethernetif_enet_init() in enet_ethernetif_kinetis.c:

// Buffer descriptors must be initialized to zero
memset(&ethernetif->RxBuffDescrip[0], 0x00, ENET_RXBD_NUM*sizeof(ethernetif->RxBuffDescrip[0]));
memset(&ethernetif->TxBuffDescrip[0], 0x00, ENET_TXBD_NUM*sizeof(ethernetif->TxBuffDescrip[0]));

Credit: This bug was also found by KC.

I am using SDK version 2.11.0 for the MK64FN1M0xxx12. ↩︎

Linux

STM32MP135 Flashing via USB with STM32CubeProg

Published 7 Sep 2025. Written by Jakob Kastelic.

This is Part 2 in the series: Linux on STM32MP135. See other articles.

In the previous article, we built a Linux kernel and manually copied it to an SD card. This works for a first test, but quickly becomes annoying. Here, we show how to use the STM32CubeProg to flash the SD card without removing it from the evaluation board.

Tutorial

Note: You may find the extensive explanations in the Bootlin article about flashing a similar chip helpful.

Finish the build process as per the previous article, so as to have at least the following files under buildroot/output/images/:
- tf-a-stm32mp135f-dk.stm32
- fip.bin
- u-boot-nodtb.bin
- sdcard.img

Go to the ST website to download the STM32CubeProg. This unfortunately requires a registration and sign-up.

Get the Linux version, unpack in a new directory, and run the installer (just follow its verbose prompts):

$ cd cubeprog
$ unzip ../stm32cubeprg-lin-v2-20-0.zip
$ ./SetupSTM32CubeProgrammer-2.20.0.linux

Now plug in all three USB cables for the board. Set the DIP boot switches for serial boot (press in all the upper parts of the white rocker switches). Press the black reset button. If everything worked, you should be able to see the board under your USB devices:

jk@Lutien:/var/www/articles$ lsusb
...
Bus 001 Device 114: ID 0483:3753 STMicroelectronics STLINK-V3
Bus 001 Device 012: ID 0483:df11 STMicroelectronics STM Device in DFU Mode
...

The STLINK-V3 is what you can use to monitor the flashing progress via UART. Simply open a serial monitor:

sudo picocom -b 115200 /dev/ttyACM0

Run the STM32CubeProg from the location that you installed it in to check that it is able to detect the board:

$ sudo ~/cube/bin/STM32_Programmer_CLI -l usb
      -------------------------------------------------------------------
                        STM32CubeProgrammer v2.20.0
      -------------------------------------------------------------------

=====  DFU Interface   =====

Total number of available STM32 device in DFU mode: 1

  Device Index           : USB1
  USB Bus Number         : 001
  USB Address Number     : 002
  Product ID             : USB download gadget@Device ID /0x501, @Revision ID /0x1003, @Name /STM32MP135F Rev.Y,
  Serial number          : 002800423232511538303631
  Firmware version       : 0x0110
  Device ID              : 0x0501

If that worked, it’s time to prepare the images for flashing. Go to buildroot/output/images and create a file flash.tsv with the following contents:

#Opt	Id	Name	Type	IP	Offset	Binary
-	0x01	fsbl1-boot	Binary	none	0x0	tf-a-stm32mp135f-dk.stm32
-	0x03	fip_boot	Binary		none	0x0		fip.bin
-	0x03	ssbl-boot	Binary	none	0x0	u-boot-nodtb.bin
P	0x10	sdcard	RawImage	mmc0		0x0	sdcard.img

Finally, run the flashing command itself:

sudo ~/cube/bin/STM32_Programmer_CLI -c port=usb1 -w flash.tsv

The STM32CubeProg will go through the sequence of files you wrote into flash.tsv. First, the Arm Trusted Firmware (TF-A) gets written to the memory and executed. It then does some secure magic behind the scenes and accepts the next payload via the DFU protocol, the U-Boot. At last, U-Boot itself is executed and it in turn accepts the last payload: the SD card itself. Which was, after all, the only thing you wanted to transfer anyway …

Discussion

The tutorial above again presents the simplest method I have found so far, with a minimum of steps and prerequisites, to flash the SD card of the eval board without taking the card in and out. What’s the issue?

The STM32CubeProg comes in a 291M zip file, which gets installed as a 1.5G program. We use it to copy a disk image to the SD card. See the problem yet? Or let’s consider the on-board procedure: TF-A (4,212 files and 506,952 lines of code according to cloc) is used to run U-Boot (21,632 files and 3,419,116 lines of code), just so that a semi-standard USB DFU protocol can expose the SD card to write the image.

But why??? ChatGPT explains:

U-Boot became the standard since vendors upstreamed support there, and it offers cross-platform flashing via DFU/fastboot for factories and Windows users who can’t dd raw disks. It also doubles as the hook for A/B updates, rollback, and secure boot. In practice, this forces developers into a complex boot stack, even though most boards could just boot Linux directly from SD/eMMC and use a tiny DFU mass-storage tool for recovery.

A more likely explanation is that the boot process has acquired an unnecessary reputation for being difficult, so that few want to mess with it. If there is a working solution, it will get incorporated into the software stack, no matter how baroque. The warning has been around for a long time:

Big building-blocks […] can lead to more compact code and shorter development time. […] Less clear, however, is how to assess the loss of control and insight when the pile of system-supplied code gets so big that one no longer knows what’s going on underneath.

[… As] libraries, interfaces, and tools become more complicated, they become less understood and less controllable. When everything works, rich programming environments can be very productive, but when they fail, there is little recourse.^[1]

All these tool are intended to make our work easier, but as they are piled on without any reasonable limit, the resulting mess is ironically far more complicated than the problem they are solving. If the task at hand is to flash an SD card image, why doesn’t the firmware expose the medium as a USB mass storage device, so that standard tools like dd could be used to work with it? The cynical answer suggests itself … They didn’t know better.

Those who do not understand Unix are condemned to reinvent it, poorly.^[2]

Surely it cannot be too difficult to write a simple “bare-metal” program, which we could load to the board using the simple and well-documented UART protocol implemented in the ROM of the STM32MP1. The program would be very small and quick to load. The program would expose the available media as mass storage devices, and that’s it.

But … You may object, we need U-Boot anyways, otherwise how are we to load Linux? As we will explain in a future article, that is not so. U-Boot is entirely unnecessary for a large class of embedded Unix applications.

All Articles in This Series

B. Kernighan and R. Pike Overview: The Practice of Programming. Addison-Wesley, 1999. ↩︎
Attributed to Henry Spencer as his November 1987 Usenet signature in E. S. Raymond: The Art of Unix Programming. Addison-Wesley, 2004. ↩︎

Philosophy

What Unix Contributed

Published 6 Sep 2025. Written by GPT-5 from notes by Jakob Kastelic.

Unix was built on a handful of ideas that turned out to be both powerful and practical. The following discussion blends established Unix facts with interpretive commentary; it does not claim to describe any single historical Unix precisely.

Programs and the Shell

The shell runs commands as programs. There’s no special class of built-ins; if you want a new command, you write a program. By default, programs read from standard input and write to standard output, unless redirected.

Most commands are small filters for text streams. They do one job, and they work together naturally. Connecting them with pipes lets you build bigger tools out of simpler ones.

The File System Abstraction

Everything is a file: user data, programs, directories, and even devices. Directories form a tree; each entry points to an inode, which knows where the data blocks live. Devices show up as files too.

This means that I/O and storage use the same calls: open, close, read, write. That’s the interface for everything. Executables and data files are stored in the same way, reinforcing the idea that a single abstraction suffices.

Processes and the Kernel

The kernel is deliberately small. It multiplexes I/O and leaves the rest to user programs. Even init, the first process, is just a program: it opens terminals, prints the login message, and starts shells in a loop.

Processes come from the fork/exec pair. One process copies itself, then overlays the copy with another program. The idea is simple, and it works.

System calls are invoked by a trap instruction, wrapped in library functions so programs don’t depend directly on kernel details. Programs stay independent, and the operating system can change underneath.

Small, Understandable, Portable

Unix was small enough that one person could understand the whole thing. That made it easier to modify, port, and teach. The manuals were short, consistent, and focused on usage, not internals. A second volume provided tutorials and background for those who wanted more.

The guiding principle was: be general, but not too general; portable, but not too portable. If you try to solve every problem in advance, you get bloat. By keeping it modest, Unix was more useful—and paradoxically more general and portable—than larger systems.

The 80/20 Rule

Some parts were machine-specific, usually device drivers or bits of assembly. But not many. Most code was reusable, and the exceptions were small. An array of function pointers mapped device numbers to driver routines; that was about as complex as it got. For example, a character device^[1] driver needs to expose the following functions:

extern struct cdevsw
{
	int	(*d_open)();
	int	(*d_close)();
	int	(*d_read)();
	int	(*d_write)();
	int	(*d_ioctl)();
	int	(*d_stop)();
	struct tty *d_ttys;
} cdevsw[];

The 80/20 rule applied everywhere: make most of the system simple and portable, accept a little complexity when it really pays off. Code was meant to be 80% reusable, not 100%, which avoided the kind of rigidity seen in later systems.

Self-Hosting and Accessible

Unix came with all its own sources and tools. It was self-hosting, and people could read, study, and change the code. The system included what you needed, and nothing more. No useless programs, no dead code, and very little irrelevant platform-specific clutter.

The philosophy was to write programs you would actually use, not ones meant to satisfy a standard or some hypothetical future need.

Simplicity Above All

The enduring lesson of Unix is that simplicity beats complexity. Interfaces were orthogonal, text was the universal medium, and programs were small and self-contained. Each one did one thing, and did it well.

That philosophy proved more important than any single feature. It made Unix portable, teachable, and durable. It showed that you don’t need a committee or a grand design to build something powerful. You need clarity, restraint, and the discipline to write only what you need.

Reflections and Extensions

Unix also suggests how to go further. Small, portable, self-contained programs can approach the kind of stability that TeX achieved—systems so refined that they don’t need to change.

Portability itself can be modular. The Wollongong group^[2] showed this by first porting Unix piece by piece to an Interdata 7/32, running it alongside the host system, and then replacing the host functions with assembly routines. That approach points toward kernels that are more modular, where pieces like fork and exec could be reused without bringing along a whole scheduler.

Device drivers can also be simplified. One idea is to treat them as user processes whose IDs match their device numbers. They would implement the usual open, read, and write interfaces, but otherwise behave like ordinary programs: start and stop freely, hold their own memory, receive signals. The kernel would not “manage” them, yet the familiar Unix file interface would still apply.

The same lesson holds today. Artificial intelligence can sometimes repair or adapt programs automatically, but only if the systems are small and self-contained. Large, tangled software offers no foothold. Unix worked because it avoided dead code, avoided over-abstraction, and made each interface simple enough to understand and replace.

Finally, Unix showed that the way forward can’t be too innovative. If “the way” is too radical, no one will follow it.^[3] The genius of Unix was that it was just radical enough.

From version 7 Unix, found in /usr/sys/h/conf.h. ↩︎
Juris Reinfelds: The First Port of Unix. Department of Computing Science, The University of Wollongong. See also Richard Miller: The First Unix Port. Miller Research Ltd. (Both documents undated. Why don’t people date all their documents!?) ↩︎
Still looking for the source of this quote … ↩︎

Linux

STM32MP135 Default Buildroot Configuration

Published 3 Sep 2025, modified 10 Sep 2025. Written by Jakob Kastelic.

This is Part 1 in the series: Linux on STM32MP135. See other articles.

Wouldn’t it be great to have a single board computer that runs just Linux? That is, no external libraries, tools, driver modules—just the kernel binary, and whichever programs you choose to run under it. After all, Linux is just a C program, so how hard can it be to compile and run it? Read on to get started exploring the wild world of embedded Linux.

Tutorial

In this tutorial, we show how to get a basic “minimal” Linux installed on the STM32MP135 evaluation board with a minimum of steps or obscure scripts. For detailed explanations, refer to the excellent writeup from Bootlin.

Get a copy of Buildroot:

$ git clone https://gitlab.com/buildroot.org/buildroot.git
$ cd buildroot

As of this writing, the latest commit in this repository is

$ git rev-parse HEAD
bbb0164de08f761a3399c961700db44befff5c70

Find the default configuration appropriate for this board:
```
$ make list-defconfigs | grep stm32mp135
```
This shows that stm32mp135f_dk_defconfig is available. Install it by calling make on it:
```
$ make stm32mp135f_dk_defconfig
```
Let’s enable the USB mode for Arm Trusted Firmware (TF-A) so that we will be able to use USB flashing in the future. Open
```
$ make menuconfig
```
Navigate under Bootloaders ---> ARM Trusted Firmware (ATF) and add the following at the end of the “Additional ATF build variables” string:
```
STM32MP_USB_PROGRAMMER=1
```
Then select “OK”, and “Esc” your way out of the menuconfig. Make sure to say “Yes” when asked whether to save the new configuration.
Run the build, and collect logs into a file:
```
$ time make >log.txt 2>&1
```
On my dual-core i5-7300U laptop, this took about an hour and a half.

Watch the build from another terminal:
```
$ tail -f log.txt
```

Copy the generated image to an SD card (assumed to be at /dev/sdb):

$ sudo dd if=output/images/sdcard.img of=dev/sdb bs=1M

Time to run it on the evaluation board! Set it up as follows:
- Insert the SD card into the slot
- Connect the USB-C port to the right of the screen (CN12, labelled PWR_IN) to a powered USB hub
- Connect the Micro USB (CN10, left of the screen) to a desktop computer, which will enumerate as a serial port (/dev/ttyACM0 on my computer).
- Open a serial console (115200 baud, no parity) to listen
Set the DIP switches to boot from the SD card as shown in the image below. In this orientation, press in on the upper side of the rockers of BOOT0 and BOOT2, and on the lower side for BOOT1.

Press the black reset button and if everything went right, you should see the kernel boot messages displayed on the serial monitor, until the login prompt gets displayed. Done!
```
Welcome to Buildroot
buildroot login: root
# uname -a
Linux buildroot 6.12.22 #1 SMP PREEMPT Wed Sep  3 20:23:46 PDT 2025 armv7l GNU/Linux
```

Discussion

This is, to my knowledge, the fastest way to get started with embedded Linux on “real” hardware in terms of number of steps. However, it does not result in the most minimal distribution. Besides the Linux kernel, the default configuration pulls in a number of large programs, such as U-Boot, ATF, OP-TEE, and more. (Examine buildroot/dl) after building to see all the stuff that got downloaded in the build process.)

For someone used to bare-metal embedded work, the size and complexity of this auxiliary software is utterly baffling. A bootloader with more lines of code than my whole project? Several different compilers and interpreters (C, Rust, Perl, Python) needed for a “Hello, world!?”

In my mind I hold an ideal of the “pure” essence of Unix: there is one kernel, which at the end of its boot process invokes one user-space program (init), which then does anything it wants to. I believe in the simplicity of the Linux boot process, as outlined by Russell King in the 2002 masterpiece, “Booting ARM Linux”:

Initialize RAM and copy the kernel into it
Initialise one serial port
Load the device tree blob (DTB) into RAM, and place its address in r2
Pass control to Linux and watch it boot like magic!

It is time we regain the pristine uncomplicated state that used to exist before everything became bundled and containerized and wrapped inside countless layers of abstraction that are supposed to make things simpler, but in the end serve only to confuse everyone.

In the next articles, we will take this “primordial mess” and cut it down to size. Stay tuned!

All Articles in This Series

Embedded

UART DMA with a Circular Buffer on K64

Published 2 Sep 2025. Written by Jakob Kastelic.

Working on a FreeRTOS-based project running on the NXP Kinetis K64, I wanted to read data from several UART interfaces. The default “easy” method using interrupts is easy to set up via the NXP MCUXpresso Config Tools; the problem is that the UART peripheral only has a few bytes of built-in buffer. If the data is arriving at the interfaces faster than the firmware can read it, the tiny buffer immediately overflows and part of the data is lost.

This is obviously a task for Direct Memory Access (DMA): let the data arrive into a much larger buffer, from where the software can take characters at its leisure. In this article, we will briefly contrast three ways of receiving data from UART: interrupt-based, simple DMA, and DMA with a circular buffer (aka ring buffer).

UART with Interrupts

The “point-and-click” method works out of the box with the Config tools. Leave things at their default values, or change whatever is necessary. In my setup, I had the values shown in the image below.

Config Tools: UART Interrupt Configuration

Even the interrupt handler is provided by the tool. Click on “Copy to clipboard”, and adapt as necessary. All that is needed is to read the status flags and read a byte of data, and push it to a FreeRTOS queue that we set up previously:

void UART0_SERIAL_RX_TX_IRQHANDLER(void)
{
   uint32_t intStatus = UART_GetStatusFlags(UART0);
   if ((kUART_RxDataRegFullFlag | kUART_RxOverrunFlag) & intStatus) {
      const uint8_t data = UART_ReadByte(UART0);
      xQueueSendToBackFromISR(rx_queue, &data, NULL);
   }
}

This seems like it couldn’t be simpler, and it just should work. Indeed, for baud rates at or below about 9600, it does work almost reliably. And yet sometimes, for no good reason, a byte gets dropped. Is that possible?

Let’s see. At 115,200 baud, we get a character every 8.7 μs. It’s entirely possible that some other interrupt or critical section takes priority over this one for much longer than that, especially looking at some of the rather long Ethernet ISRs supplied with the NXP SDK. If we have less than 10 microseconds to process a character before it goes away, we’ll probably miss many of them! Onwards to DMA.

UART with eDMA

With direct memory access, the UART peripheral should be able to place the bytes directly in a software buffer, without having to go through an RTOS queue. Besides being much faster, we can also make the queue almost arbitrarily large, ensuring that we’ll be able to get to the data eventually and process it.

Thanks to the Config Tools, setting this up is again very straightforward. Just enable some checkboxes and set up DMA channels, as I did in the image below.

Config Tools: UART eDMA Configuration

This time, we do not have a handy “Click to get code” button. Instead, ChatGPT or Gemini or Claude are happy to write something that nearly works. In a dedicated FreeRTOS task, let’s poll for characters as they arrive into the buffer:

for (;;) {
   uint32_t bytesWritten;
   if (UART_TransferGetReceiveCountEDMA(UART0_PERIPHERAL,
            &UART0_UART_eDMA_Handle,
            &bytesWritten) != kStatus_Success) {
      vTaskDelay(pdMS_TO_TICKS(1));
      continue;
   }

   while (readIndex != (bytesWritten % DMA_BUF_SIZE)) {
      char c = uartRxBuf[readIndex];
      readIndex = (readIndex + 1) % DMA_BUF_SIZE;
      // now do what you want with the received character
   }

   vTaskDelay(1);
}

Note the vTaskDelay(1) calls which ensure that this task does not starve lower-priority tasks of processor time.

In the above, the chatbot made use of the function UART_TransferGetReceiveCountEDMA() from the NXP SDK and implemented the ring buffer wraparound with some modulo operation that’s probably about correct. But what happens when the buffer gets filled up? In that case, an interrupt triggers the DMA callback, which simply restarts the DMA to receive data back into the beginning of the buffer:

static void uart0_callback(UART_Type *base, uart_edma_handle_t *handle,
                           status_t status, void *userData)
{
    uart_transfer_t rxXfer = {
        .data = uartRxBuf,
        .dataSize = DMA_BUF_SIZE
    };
    UART_ReceiveEDMA(UART0_PERIPHERAL, handle, &rxXfer);
}

This works great, but has an unfortunate limitation. When the buffer is at the end of its capacity, characters must arrive slowly enough so we have time to re-start the DMA. In other words, we have not solved anything! Data will occasionally get lost when the callback is executing.

Naturally, the DMA should be writing continuously without us having to restart it. Much back and forth with AI achieved nothing. The version of the NXP SDK that I’m using, while full of intricate code covering what appears to be almost every possible use case, somehow does not cover the most elementary DMA operation: continuous data reception.

True continuous DMA ring buffer

This brings us to the final section: how to make the ring buffer work without stopping? By my count, the SDK takes 4517 lines of C to implement the “simplified” DMA configuration code, and yet there’s no easy UART_MakeMeARingBuffer(). There are 80 pages in the K64 Sub-Family Reference Manual about the DMA, and 85 about UART, with lots of tables of registers, flow charts, diagrams. Enough to leave one entirely mystified.

The key revelation here is that the DMA peripheral is not nearly as complicated as NXP would have us believe. As it turns out, there are only a couple fields one has to fill out in the “transfer control descriptor” (TCD): source and address of the data, the size of each logical word (for UART: one byte), how many bytes to write altogether, and what to do after writing them.

To integrate it with the code generated by the Config Tools above, we need the handles to the UART and eDMA controls, and a buffer to place the data in:

UART_Type *uart = UART0_PERIPHERAL;
edma_handle_t *dma = &UART1_RX_Handle;

#define BUF_SIZE 256
static uint8_t buf[BUF_SIZE];

Next, we define the source (UART buffer) and destination for the data:

edma_tcd_t *tcd = (edma_tcd_t *)&dma->base->TCD[dma->channel];
tcd->SADDR = UART_GetDataRegisterAddress(uart);
tcd->DADDR = (uint32_t)buf;

Fill in the source and destination offset, number of bytes to transfer at once (just one byte), and number of bytes to transfer in a so-called “major loop” (equal to the size of the ring buffer):

tcd->SOFF = 0;
tcd->DOFF = 1;
tcd->NBYTES = 1;
tcd->ATTR = DMA_ATTR_SSIZE(kEDMA_TransferSize1Bytes) | DMA_ATTR_DSIZE(kEDMA_TransferSize1Bytes);
tcd->CITER = BUF_SIZE;
tcd->BITER = BUF_SIZE;
tcd->DLAST_SGA = -BUF_SIZE;

Note the DLAST_SGA field: it causes the write address to be rewound back to the start of the data buffer upon completing the major loop. This is really all that’s needed to create a true circular ring buffer that runs without stopping.

Finally, to make it run, we need to enable the DMA service request from both the eDMA and UART sides:

dma->base->SERQ = DMA_SERQ_SERQ(dma->channel);
uart->C5 |= (uint8_t)UART_C5_RDMAS_MASK;
uart->C2 |= (uint8_t)UART_C2_RIE_MASK;

Beware: If making use of any NXP-provided functions, know that several of them implicitly set the DREQ bit (“Disable Request”) which will stop DMA once a single major loop is complete. As the Reference Manual puts it,

DREQ (Disable Request) If this flag is set, the eDMA hardware automatically clears the corresponding ERQ bit when the current major iteration count reaches zero.

Readout from the buffer

Once the DMA is happily streaming UART bytes into the buffer, we would like to get it out. The simplest way to do so would be a character at a time, getc-style.

Step One: consult the DMA registers about the current iteration inside a major loop:

uint32_t ch   = dma->channel;
const uint16_t biter = dma->base->TCD[ch].BITER_ELINKNO;
const uint16_t citer = dma->base->TCD[ch].CITER_ELINKNO;

Step Two: compute the number of bytes available in the buffer.

size_t write_index = (biter - citer) % UART_DMA_BUF_SIZE;
int avail = 0;
if (write_index >= *head) {
   avail = write_index - *head;
} else {
   avail = (UART_DMA_BUF_SIZE - *head) + write_index;
}

Step Three: If any bytes are available for readout, let’s read it out (into the variable c), then advance the queue pointer:

if (avail) {
   uint8_t c = buf[*head];
   *head = (*head + 1) % UART_DMA_BUF_SIZE;
}

That’s it!

Conclusion

My intention with this writeup was to show that DMA configuration on the Kinetis K64 is very simple once one learns to let go of the tools that are supposed to make it easier: the NXP SDK, the Config Tools GUI, the Reference Manual. 17 lines of code is all it takes to configure a UART ring buffer with DMA!

Direct manipulation of system registers may appear confusing. After all, where is one supposed to learn which are all the registers that must be set? The SDK is often the only entry point, but it’s discouraging: confusing, over-engineered, and ironically missing the simplest features.

It’s an amusing experience to take the “official” code examples and manually inline all the functions, unfolding the complex call chain into hundreds of lines of code. Once all the unnecessary conditional compilation and inapplicable conditions are removed, you can often be left with just a handful of lines of simple code which, with the help of the reference manual, is pretty readable. Certainly more than the mostly undocumented, partially implemented SDK morass of functions, drivers, modules, what have you!