Embedded

Ethernet on Bare-Metal STM32MP135

Published 21 Jan 2026. Written by Jakob Kastelic.

In this writeup we’ll go through the steps needed to bring up the Ethernet peripheral (ETH1) on the STM32MP135 eval board as well as a custom board.

Eval board connections to PHY

The evaluation board uses the LAN8742A-CZ-TR Ethernet PHY chip, connected to the SoC as follows:

PHY pin PHY signal SoC signal SoC pin Alt. Fn. Notes
16 TXEN PB11/ETH1_TX_EN AA2 AF11
17 TXD0 PG13/ETH1_TXD0 AA9 AF11
18 TXD1 PG14/ETH1_TXD1 Y10 AF11
8 RXD0/MODE0 PC4/ETH1_RXD0 Y7 AF11 10k PU
7 RXD1/MODE1 PC5/ETH1_RXD1 AA7 AF11 10k PU
11 CRS_DV/MODE2 PC1/ETH1_CRS_DV Y9 AF10 10k PU
13 MDC PG2/ETH1_MDC V3 AF11
12 MDIO PA2/ETH1_MDIO Y4 AF11 1k5 PU
15 nRST ETH1_NRST IO9 MPC IO
14 nINT/RECLKO PA1/ETH1_RX_CLK AA3 AF11

Reset pin

In this design, the Ethernet PHY connected to ETH1 has its own 25MHz crystal. Note the ETH1_RX_CLK connection, which uses the MCP23017T-E/ML I2C I/O expander.

One wonders if it was really necessary to complicate Ethernet bringup by requiring this extra step (I2C + IO config) on an SoC that has 320 pins. True to form, the simple IO expander needs more than 1,300 lines of ST driver code plus lots more in the pointless BSP abstraction layer wrapper.

With a driver that complicated, it’s easier to start from scratch. As it happens, writing these GPIO pins involves just two I2C transactions. The I2C code is trivial, find it here.

Sending an Ethernet frame from eval board

Again ST code examples are very complex, but it takes just over 300 lines of code to send an Ethernet frame, by way of verifying that data can be transmitted over this interface. I asked ChatGPT to summarize what happens in the code:

  1. Configure the pins for Ethernet First, all the GPIO pins required by the RMII interface are set up. Each pin is switched to its Ethernet alternate function, configured for push-pull output, and set to a high speed. This ensures the STM32’s MAC can physically drive the Ethernet lines correctly. If you’re using an external GPIO expander like the MCP23x17, it is also initialized here, and relevant pins are set high to enable the PHY or other control signals.

  2. Enable the Ethernet clocks Before the MAC can operate, the clocks for the Ethernet peripheral—MAC, TX, RX, and the reference clock—are enabled in the RCC. This powers the Ethernet block inside the STM32 and allows it to communicate with the PHY.

  3. Initialize descriptors and buffers DMA descriptors for transmit (TX) and receive (RX) are allocated and zeroed. The transmit buffer is allocated and aligned to 32 bytes, as required by the DMA. A TX buffer descriptor is created, pointing to the transmit buffer. This descriptor tells the HAL exactly where the frame data is and how long it is.

  4. Configure the Ethernet peripheral structure The ETH_HandleTypeDef is populated with the MAC address, RMII mode, pointers to the TX and RX descriptors, and the RX buffer size. The clock source for the peripheral is selected. At this stage, the HAL has all the information needed to manage the hardware.

  5. Initialize the MAC and PHY Calling HAL_ETH_Init() programs the MAC with the descriptor addresses, frame length settings, and other features like checksum offload. The PHY is reset and auto-negotiation is enabled via MDIO. Reading the PHY ID verifies that the PHY is responding correctly.

  6. Start the MAC With HAL_ETH_Start(), the MAC begins normal operation, monitoring the RMII interface for frames to transmit or receive.

  7. Build the Ethernet frame A frame is constructed in memory. The first 6 bytes are the destination MAC (broadcast in this case), the next 6 bytes are the source MAC (the STM32’s MAC), followed by a 2-byte EtherType. The payload is copied into the frame (e.g., a short test string), and the frame is padded to at least 60 bytes to satisfy Ethernet minimum length requirements.

  8. Transmit the frame The TX buffer descriptor is updated with the frame length and pointer to the buffer. HAL_ETH_Transmit() is called, which programs the DMA to fetch the frame from memory and put it onto the Ethernet wire. After this call completes successfully, the frame is sent, and you can see it in Wireshark on the network.

For the record, when a cable is connected, the PHY sees the link is up:

> eth_status
Ethernet link is up
  Speed: 100 Mbps
  Duplex: full
  BSR = 0x782D, PHYSCSR = 0x1058

Custom board connections to PHY

The custom board (Rev A) also uses the LAN8742A-CZ-TR Ethernet PHY chip, connected to the SoC as follows:

PHY pin PHY signal SoC signal SoC pin Alt. Fn. Notes
16 TXEN PB11/ETH1_TX_EN N5 AF11
17 TXD0 PG13/ETH1_TXD0 P8 AF11
18 TXD1 PG14/ETH1_TXD1 P9 AF11
8 RXD0/MODE0 PC4/ETH1_RXD0 U6 AF11 10k PU
7 RXD1/MODE1 PC5/ETH1_RXD1 R7 AF11 10k PU
11 CRS_DV/MODE2 PA7/ETH1_CRS_DV U2 AF11 10k PU
13 MDC PG2/ETH1_MDC R1 AF11
12 MDIO PG3/ETH1_MDIO L5 AF11 1k5 PU
15 nRST PG11 M3 10k PD
14 nINT/RECLKO PG12/ETH1_PHY_INTN T1 AF11 10k PU
5 XTAL1/CLKIN PA11/ETH1_CLK T2 AF11

The differences with respect to eval board are:

Signal Eval board Custom board
ETH1_CRS_DV PC1/ETH1_CRS_DV PA7/ETH1_CRS_DV
ETH1_MDIO PA2/ETH1_MDIO PG3/ETH1_MDIO
nRST GPIO expander PG11, 10k pulldown
nINT/REFCLKO PA1/ETH1_RX_CLK PG12/ETH1_PHY_INTN
XTAL1/CLKIN 25 MHz XTAL PA11/ETH1_CLK

That is, two different port assignments, direct GPIO for reset instead of expander, clock to be output from the SoC to the PHY, and using INTN signal instead of RX_CLK. All alternate functions are 11, while on the eval board one of them (CRS_DV) was 10.

Transmit Ethernet frame from custom board

First, we need to set the clock correctly. Since Ethernet does not have a dedicated crystal on the custom board, we need to source it from a PLL. In particular, we can set PLL3Q to output 24/2*50/24=25 MHz, and select the ETH1 clock source:

pclk.PeriphClockSelection = RCC_PERIPHCLK_ETH1;
pclk.Eth1ClockSelection   = RCC_ETH1CLKSOURCE_PLL3;
if (HAL_RCCEx_PeriphCLKConfig(&pclk) != HAL_OK)
   ERROR("ETH1");

With the scope, I can see a 25 MHz clock on the ETH_CLK trace and the nRST pin is driven high (3.3V). Nonetheless, HAL_ETH_Init() returns with an error.

Of course, we forgot to tell the HAL what the Ethernet clock source is. On the eval board, we had

eth_handle.Init.ClockSelection = HAL_ETH1_REF_CLK_RX_CLK_PIN;

But on the custom board, the SoC provides the clock to the PHY:

eth_handle.Init.ClockSelection = HAL_ETH1_REF_CLK_RCC;

Mistake in HAL driver?

With the RCC clock selected for Ethernet, yet again HAL_ETH_Init() fails. This time, it tries to select the RCC clock source:

if (heth->Init.ClockSelection == HAL_ETH1_REF_CLK_RCC)
{
  syscfg_config |= SYSCFG_PMCSETR_ETH1_REF_CLK_SEL;
}
HAL_SYSCFG_ETHInterfaceSelect(syscfg_config);

The Ethernet interface and clocking setup is done in the PMCSETR register, together with some other configuration.

void HAL_SYSCFG_ETHInterfaceSelect(uint32_t SYSCFG_ETHInterface)
{
   assert_param(IS_SYSCFG_ETHERNET_CONFIG(SYSCFG_ETHInterface));
   SYSCFG->PMCSETR = (uint32_t)(SYSCFG_ETHInterface);
}

Now the driver trips over the assertion. The assertion macro expects the config word to pure interface selection, forgetting that the same register also carries the ETH1_REF_CLK_SEL field (amongst others!):

#define IS_SYSCFG_ETHERNET_CONFIG(CONFIG)                                      \
   (((CONFIG) == SYSCFG_ETH1_MII) || ((CONFIG) == SYSCFG_ETH1_RMII) ||         \
    ((CONFIG) == SYSCFG_ETH1_RGMII) || ((CONFIG) == SYSCFG_ETH2_MII) ||        \
    ((CONFIG) == SYSCFG_ETH2_RMII) || ((CONFIG) == SYSCFG_ETH2_RGMII))
#endif /* SYSCFG_DUAL_ETH_SUPPORT */

If we comment out this assertion, the initialization proceeds without further errors. However, link is still down.

Biasing transformer center taps

Even with an Ethernet cable plugged in, link is down:

// Read basic status register
if (HAL_ETH_ReadPHYRegister(&eth_handle, LAN8742_ADDR,
      LAN8742_BSR, &v) != HAL_OK) {
   my_printf("PHY BSR read failed\r\n");
   return;
}

if ((v & LAN8742_BSR_LINK_STATUS) == 0u) {
   my_printf("Link is down (no cable or remote inactive)\r\n");
   return;
}

On the schematic diagram of the custom board, we notice that the RJ-45 transformer center taps (TXCT, RXCT on the J1011F21PNL connector) are decoupled to ground, but are not connected to 3.3V unlike on the eval board. The LAN8742A datasheet does not talk about it explicitly, but instead shows a schematic diagram (Figure 3-23) where the two center taps are tied together and pulled up to 3.3V via a ferrite bead.

Tying the center taps to 3.3V, we still get no link. Printing the PHY Basic Status Register, we see:

Link is down (no cable or remote inactive)
BSR = 0x7809

This means: link down, auto-negotiation not complete.

REF_CLK pin is not outputting a 50 MHz clock but instead sits at about 3.3V.

LEDs and straps

The PHY chip shares LED pins with straps.

LED1 is shared with REGOFF and is tied to the anode of the LED, which pulls down the pin such that REGOFF=0 and the regulator is enabled. We measure that VDDCR is at 1.25V, which indicates that the internal regulator started successfully. During board operation, this pin is low (close to 0V).

LED2 is shared with the nINTSEL pin, and is connected to the LED cathode. During board operation, this pin is high (close to 3.3V). Selecting nINTSEL=1 means REF_CLK In Mode, as is explained in Table 3-6: “nINT/REFCLKO is an active low interrupt output. The REF_CLK is sourced externally and must be driven on the XTAL1/CLKIN pin.”

Section 3.7.4 explains further regarding the “Clock In” mode:

In REF_CLK In Mode, the 50 MHz REF_CLK is driven on the XTAL1/CLKIN pin. This is the traditional system configuration when using RMII […]

In REF_CLK In Mode, the 50 MHz REF_CLK is driven on the XTAL1/CLKIN pin. A 50 MHz source for REF_CLK must be available external to the device when using this mode. The clock is driven to both the MAC and PHY as shown in Figure 3-7.

Furthermore, according to Section 3.8.1.6 of the PHY datasheet, the absence of a pulldown resistor on LED2/nINTSEL pin means that LED2 output is active low. That means that the anode of LED2 should have been tied to VDD2A according to Fig. 3-15, rather than ground as is currently the case.

This means we have two alternatives:

In this instance I chose the latter option and ordered PLL3Q to output 24/2*50/12=50 MHz. The link is briefly up and the green LED2 blinks:

> eth_status
Ethernet link is up
  Speed: 100 Mbps
  Duplex: full
  BSR = 0x782D, PHYSCSR = 0x1058

But strange enough, when I check the status just a moment later, the link is down again:

> eth_status
Link is down (no cable or remote inactive)
BSR = 0x7809

Checking repeatedly, sometimes it’s up, and sometimes it’s down.

I see that the current drawn from the 3.3V supply switches between 0.08A and 0.13A continuously, every second or two.

Digging in registers

Printing out some more info in both situations:

Link is down (no cable or remote inactive)
  BSR = 0x7809, PHYSCSR = 0x0040, ISFR = 0x0098, SMR = 0x60E0, SCSIR = 0x0040
SYSCFG_PMCSETR = 0x820000
> e
Ethernet link is up
  Speed: 100 Mbps
  Duplex: full
  BSR = 0x782D, PHYSCSR = 0x1058, ISFR = 0x00CA, SMR = 0x60E0, SCSIR = 0x1058
SYSCFG_PMCSETR = 0x820000

PHY Basic Status Register BSR, when link is down, shows the following status:

When link is up, BSR shows (of course) that link is up, and also that the auto-negotiate process completed.

The PHY Special Control/Status Register (PHYSCSR), when link is down, does not have a meaningful speed indication (000), or anything else. When link is up, it shows speed as 100BASE-TX full-duplex (110), and that auto-negotiation is done.

The PHY Interrupt Source Flag Register (PHYISFR), when link is down, shows Auto-Negotiation LP Acknowledge, Link Down (link status negated), and ENERGYON generated. When link is up, we get Auto-Negotiation Page Received, Auto-Negotiation LP Acknowledge, ENERGYON generated, and Wake on LAN (WoL) event detected.

The PHY Special Modes Register (PHYSMR), when link is either up or down, shows the same value: 0x60E0. This means that PHYAD=00000 (PHY address), and MODE=111 (transceiver mode of operation is set to “All capable. Auto-negotiation enabled.”.

The PHY Special Control/Status Indications Register (PHYSCSIR), when link is up, shows Reversed polarity of 10BASE-T, even though link is 100 Mbps.

SoC PMCSETR has two fields set: ETH1_SEL is set to 100, meaning RMII, and ETH1_REF_CLK_SEL is set to 1, meaning that the reference clock (RMII mode) comes from the RCC.

Solution: PLL config (again!)

Painfully obvious in retrospect, but the problem was that PLL3, from which we’ve derived the Ethernet clock, was set to fractional mode:

rcc_oscinitstructure.PLL3.PLLFRACV  = 0x1a04;
rcc_oscinitstructure.PLL3.PLLMODE   = RCC_PLL_FRACTIONAL;

If instead we derive the clock from PLL4, which is already set to integer mode, then sending the Ethernet frame just works, and the link gets up and stays up:

rcc_oscinitstructure.PLL4.PLLFRACV  = 0;
rcc_oscinitstructure.PLL4.PLLMODE   = RCC_PLL_INTEGER;
// ...
pclk.PeriphClockSelection = RCC_PERIPHCLK_ETH1;
pclk.Eth1ClockSelection   = RCC_ETH1CLKSOURCE_PLL4;

Of course! Ethernet requires a perfectly precise 50 MHz clock, up to about 50 ppm. On the eval board that was not a problem: the PHY had its own crystal, and it returned a good 50 MHz clock directly back to the SoC’s MAC.

Incoherent Thoughts

Scary Things First

Published 21 Jan 2026. Written by Jakob Kastelic.

This morning it occurred to me that I’m really not looking forward to going to the office, for I’ll have to continue doing something that I spent two days on already, and it’s still not working. I can easily think of many other such things that I’d rather not do, and as it happens each of them comes with a “positive”, or attractive aspect (written in brackets):

These are generalized examples; my real list is longer and more specific, but I won’t bore you with the details since anyone can easily write down their own, personally relevant version.

The point of these contrasts is not so much that the “bad” part of the stick is to be borne because the “good” part is worth so much more. The point is not even to try and forget about the bad part by various means (distraction, expression, repression, suppression), even though that’s what I end up doing most of the time. The point is to try and see them as a single “yin-yang” unit: black in white, white in black.

These contrasts are inevitable, so why waste time fighting them, denying their existence? Relax into the reality, let go of the fear and dread by feeling it directly until your brain gets tired of it. I’m not saying, “stop fearing the inevitable”, as the fear itself is in fact part of the inevitable. The lake would not try to hide its waves when a stone is thrown into it; its waves radiate outwards until they stop. In fact they never really stop, so the lake does not reject them.

Somewhere in the Tao Te Ching it is said that the great power of water (wearing down mountains, etc.) is because it’s not loath to take the lowest, humblest part, where no one wants to be. Elsewhere there’s the image of the malformed tree surviving, while the straight, useful ones are cut down for the carpenter. I wonder if peace can be had in the face of the above mentioned “dreadful” future situations by sinking, in each of them, to the most dreadful point. Assume the most broken, useless mental state: be angry and sad, afraid and trembling, and watch things come and go. Strength in weakness?

On a practical note: each day, do the “dreadful” thing first to avoid wasting too much time and effort doing pointless other things. Looking back, avoidance behaviors are often much more exhausting than what they supposedly protect me from. Or, in someone’s wise words: “Procrastination is not worth the time it takes.”

Embedded

LCD/CTP on Bare-Metal STM32MP135

Published 19 Jan 2026. Written by Jakob Kastelic.

In this writeup we’ll go through the steps needed to bring up the LCD/CTP peripheral on the custom STM32MP135 board.

Connections

I am using the Rocktech RK050HR01-CT LCD display, connecting to the STM32MP135FAE SoC, as follows:

LCD pin LCD signal SoC signal SoC pin Alt. Fn.
1, 2 VLED+/- PB15/TIM1_CH3N B12 AF1
8 R3 PB12/LCD_R3 D9 AF13
9 R4 PE3/LCD_R4 D13 AF13
10 R5 PF5/LCD_R5 B2 AF14
11 R6 PF0/LCD_R6 C13 AF13
12 R7 PF6/LCD_R7 G2 AF13
15 G2 PF7/LCD_G2 M1 AF14
16 G3 PE6/LCD_G3 N1 AF14
17 G4 PG5/LCD_G4 F2 AF11
18 G5 PG0/LCD_G5 D7 AF14
19 G6 PA12/LCD_G6 E3 AF14
20 G7 PA15/LCD_G7 E6 AF11
24 B3 PG15/LCD_B3 G4 AF14
25 B4 PB2/LCD_B4 H4 AF14
26 B5 PH9/LCD_B5 A9 AF9
27 B6 PF4/LCD_B6 L2 AF13
28 B7 PB6/LCD_B7 C1 AF14
30 DCLK PD9/LCD_CLK E8 AF13
31 DISP PG7 C9
32 HSYNC PE1/LCD_HSYNC B5 AF9
33 VSYNC PE12/LCD_VSYNC B4 AF9
34 DE PG6/LCD_DE A14 AF13

Backlight

The easiest thing to check is the display backlight, since it’s just a single GPIO pin to turn on/off, or a simple PWM to control the brightness via the duty cycle.

In our case, the backlight pin is connected to TIM1_CH3N, which is alternate function 1:

GPIO_InitTypeDef gpio;
gpio.Pin       = GPIO_PIN_15;
gpio.Mode      = GPIO_MODE_AF_PP;
gpio.Pull      = GPIO_NOPULL;
gpio.Speed     = GPIO_SPEED_FREQ_LOW;
gpio.Alternate = GPIO_AF1_TIM1;
HAL_GPIO_Init(GPIOB, &gpio);

ChatGPT can write the PWM configuration:

__HAL_RCC_TIM1_CLK_ENABLE();

htim1.Instance = TIM1;
htim1.Init.Prescaler         = 99U;
htim1.Init.CounterMode       = TIM_COUNTERMODE_UP;
htim1.Init.Period            = 999U;
htim1.Init.ClockDivision     = TIM_CLOCKDIVISION_DIV1;
htim1.Init.RepetitionCounter = 0;
htim1.Init.AutoReloadPreload = TIM_AUTORELOAD_PRELOAD_DISABLE;
HAL_TIM_PWM_Init(&htim1);

TIM_OC_InitTypeDef oc;
oc.OCMode       = TIM_OCMODE_PWM1;
oc.Pulse        = 500U;
oc.OCPolarity   = TIM_OCPOLARITY_HIGH;
oc.OCNPolarity  = TIM_OCNPOLARITY_HIGH;
oc.OCIdleState  = TIM_OCIDLESTATE_RESET;
oc.OCNIdleState = TIM_OCNIDLESTATE_RESET;
oc.OCFastMode   = TIM_OCFAST_DISABLE;

HAL_TIM_PWM_ConfigChannel(&htim1, &oc, TIM_CHANNEL_3);
HAL_TIMEx_PWMN_Start(&htim1, TIM_CHANNEL_3);
htim1.Instance->BDTR |= TIM_BDTR_MOE;

The only “tricky” part, or the part that AI got wrong, was that we have to use HAL_TIMEx_PWMN_Start() instead of HAL_TIM_PWM_Start(), since we’re dealing with the complementary output. With that fixed, the brightness pin showed a clean square wave output, with duty cycle adjustable in units of percent:

__HAL_TIM_SET_COMPARE(&htim1, TIM_CHANNEL_3, 
      (htim1.Init.Period + 1U) * percent / 100U);

Unfortunately, the PCB reversed all pins and the connector is single sided, so we cannot directly check if the above works on the actual display or not. Nonetheless, we can see a nice 2.088893 kHz square wave with 50 duty cycle, and we can tune it from 0% to 100%.

CTP connections

The Rocktech RK050HR01-CT LCD display includes a capacitive touchpad (CTP), connecting to the STM32MP135FAE SoC, as follows:

CPT pin CPT signal SoC signal SoC pin Alt. Fn.
1 SCL PH13/I2C5_SCL A10 AF4
8 SDA PF3/I2C5_SDA B10 AF4
4 RST PB7 A4
5 INT PH12 C2

Luckily the 6-pin CTP connector, albeit wired in reverse, has contacts on both top and bottom sides, so we can simply flip the ribbon cable. With entirely usual I2C configuration it simply works. Check out the final result here.

My GT911 driver is just under 300 lines of code; it’s very interesting that it takes ST almost 3,000 (yes, it has more features … Whatever, I don’t need them!)

stm32cubemp13-v1-2-0/STM32Cube_FW_MP13_V1.2.0/Drivers/BSP/Components/gt911$ cloc .
      12 text files.
      12 unique files.
       1 file ignored.

github.com/AlDanial/cloc v 1.90  T=0.10 s (109.2 files/s, 48189.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
CSS                              1            209             56           1446
C                                2            223            636            940
C/C++ Header                     3            159            614            421
Markdown                         2             24              0             62
HTML                             1              0              3             56
SVG                              2              0              0              4
-------------------------------------------------------------------------------
SUM:                            11            615           1309           2929
-------------------------------------------------------------------------------

My example code prints out the touch coordinates whenever the touch interrupt fires. Not much more to do, since the CTP will be used within some application which will implement more advanced features. The only reason to include this in the bootloader code is to verify that the I2C connection works.

LCD

The custom board is wired backwards, but we can verify that the code is correct on the eval board. Besides forgetting to turn the LCD_DISP signal on, it all worked. You set up a framebuffer somewhere (I just used the beginning of the DDR memory), and write bits there, and magically the picture appears on the display. For example, to display solid colors:

volatile uint8_t *lcd_fb = (volatile uint8_t *)DRAM_MEM_BASE;

for (uint32_t y = 0; y < RK043FN48H_HEIGHT; y++) {
   for (uint32_t x = 0; x < RK043FN48H_WIDTH; x++) {
      uint32_t p    = (y * RK043FN48H_WIDTH + x) * 3U;
      lcd_fb[p + 0] = b; // blue
      lcd_fb[p + 1] = g; // green
      lcd_fb[p + 2] = r; // red
   }
}

/* make sure CPU writes reach DDR before LTDC reads */
L1C_CleanDCacheAll();

40-pin adapter

Making use of an adapter from the 40-pin FFC ribbon cable to jumper wires, we can verify the signals also on the custom board. We see:

R[3:7] signal when screen set to red, otherwise low
G[3:7] signal when screen set to green, otherwise low
B[3:7] signal when screen set to blue, otherwise low
DCLK:  10 MHz
DISP:  3.3V
HSYNC: 17.6688 kHz, 92.76% duty cycle
VSYNC: 61.779 Hz, 96.5% duty cycle
DE:    16.7--16.9 kHz, ~84% duty cycle

We can see the brightness change when adjusting the duty cycle of the backlight.

Left ~2/3 of the screen shows white vertical stripes, the exact pattern of these stripes depending on what “color” the screen is set to. The right ~1/3 of the screen is black. This is to be expected, since we’re using the same settings for both displays. Here’s the settings which work fine on the eval board:

#define LCD_WIDTH  480U // LCD PIXEL WIDTH
#define LCD_HEIGHT 272U // LCD PIXEL HEIGHT
#define LCD_HSYNC  41U  // Horizontal synchronization
#define LCD_HBP    13U  // Horizontal back porch
#define LCD_HFP    32U  // Horizontal front porch
#define LCD_VSYNC  10U  // Vertical synchronization
#define LCD_VBP    2U   // Vertical back porch
#define LCD_VFP    2U   // Vertical front porch

The custom board uses a different display, so let’s try different settings:

#define LCD_WIDTH   800U
#define LCD_HEIGHT  480U
#define LCD_HSYNC   1U
#define LCD_HBP     8U
#define LCD_HFP     8U
#define LCD_VSYNC   1U
#define LCD_VBP     16U
#define LCD_VFP     16U

Now the screen is totally white, regardless of which color we send it. We notice that the LCD datasheet specifies a minimum clock frequency of 10 MHz. Note that on the STM32MP135, the LCD clock comes from PLL4Q. Raising the DCLK to 24 MHz, the screen works! We get to see all the colors. The PLL4 configuration that works for me is

rcc_oscinitstructure.PLL4.PLLState  = RCC_PLL_ON;
rcc_oscinitstructure.PLL4.PLLSource = RCC_PLL4SOURCE_HSE;
rcc_oscinitstructure.PLL4.PLLM      = 2;
rcc_oscinitstructure.PLL4.PLLN      = 50;
rcc_oscinitstructure.PLL4.PLLP      = 12;
rcc_oscinitstructure.PLL4.PLLQ      = 25;
rcc_oscinitstructure.PLL4.PLLR      = 6;
rcc_oscinitstructure.PLL4.PLLRGE    = RCC_PLL4IFRANGE_1;
rcc_oscinitstructure.PLL4.PLLFRACV  = 0;
rcc_oscinitstructure.PLL4.PLLMODE   =
RCC_PLL_INTEGER;

USB stops working

Unfortunately, just as the LCD becomes configured correctly and is able to display the solid red, green, or blue colors, I noticed that the USB MSC interface disappeared. If I comment out the LCD init code, so it does not run, then USB comes back. How could they possibly interact?

Even more interesting, the USB stops working only if both of the following functions are called: lcd_backlight_init(), which configures the backlight brightness PWM, and lcd_panel_init(), which does panel timing and pin configuration.

As it turns out, my 3.3V supply was set with a 0.1A current limit. Having enabled so many peripherals, the current draw can be a bit higher now. Increasing the current limit up to 0.2A, and everything works fine. In the steady state, after init is complete, the board draws just under 0.1A from the 3.3V supply. (For the record, I’m drawing about 0.26A from the combined 1.25V / 1.35V supply.)

Conclusion

Bringing up the LCD on the custom board ultimately came down to matching the panel’s exact timing and, critically, running the pixel clock within the range specified by the datasheet. Once the LTDC geometry and PLL4Q frequency were correct, the display worked immediately, confirming that the signal wiring and framebuffer logic were sound.

Incoherent Thoughts

Masks Are Replaceable

Published 17 Jan 2026. Written by Jakob Kastelic.

There are some animals, as well as most plants, that can grow back a lost limb. Humans are like that in relation to the mask we wear. As soon as we take off a mask, we begin to grow another one.

When refusing to play a role, one is merely playing a different role. Nevertheless, this means that one is not stuck with the same role forever; it’s only a matter of reading a new job description, learning the new skills required, and assuming the new behaviors.

Identity is a tool, not a prison.

Linux

Debugging STM32MP135 Kernel Decompression

Published 9 Jan 2026. Written by Jakob Kastelic.

This is Part 8 in the series: Linux on STM32MP135. See other articles.

My STM32MP135 board includes DDR3L RAM and initial tests shows that I can fill it up with pseudo-random data and read it back correctly. ST provides a DDR test utility with a suite of memory tests, all of which pass. I decided to take it a step further and test the memory on a more intensive real-world task: “unzipping” a compressed file.

Summary

The result of the decompression test was very bad: most of the file was uncompressed correctly, with just a few bits always wrong, and just a few of them only sometimes wrong. I spent two or three days tracing my way through the “unzip” code, instruction by instruction, to try to catch where exactly it goes wrong.

In the end, I made an embarrassing discovery: I have partially swapped byte lanes. DDR3L on this SoC has two byte lanes, each consisting of {data, mask, strobe}. I have connected the data bits correctly, but swapped the mask & strobe between the two bytes. (Six high speed traces, some on inner layers—there’s no fixing that by hand.) Had I also swapped the data bits, everything would have been fine; indeed, the eval board swaps all the wires, which led me astray. (Partially.)

Sadly, AI was of no help in this instance. Given my DDR3L wiring, I can convince it either way: the connections are good; the connections are not good. In the end, only Rev B will tell for sure.

Problem statement

In this article we will proceed with debugging boot of the compressed Linux kernel image (zImage) on a custom board populated with the STM32MP135 SoC. The starting point will be the build that runs on the evaluation board as described in the previous article.

Despite booting just fine, the zImage gets stuck on boot on the custom board, without any messages printed to the UART console. Following along with the debugger shows that the decompressor code does run, but it’s not clear where exactly it gets stuck.

Power supply

It is possible that the burst of DDR activity during the high-speed decompression draws more current than the 1.35V supply is able to provide, despite the decoupling capacitance.

Indeed, on the scope I see a 30mV drop in the 1.35V supply voltage for about 500ms. However, if I raise the supply voltage by the 30mV, the boot still gets stuck. This was with kernel being written to 0xC2008000 and the DTB to 0xC4008000, which means that relocation isn’t necessary. My interpretation is that the scope trace shows that decompression takes about half a second.

Interestingly, if the kernel is written to 0xC0008000 and DTB to 0xC2008000, in which case relocation is necessary, the 20mV supply drop is shorter, about 150ms, and is followed by 10ms of a bigger drop, 120mV. That drop is indeed enough to disturb the decompression, since raising the supply voltage setpoint to 1.38V makes the bigger voltage drop be followed by 500ms of the usual 30mV drop. My interpretation: relocation takes 150ms, followed by 500ms of decompression, but the power supply is not stiff enough for relocation/decompression.

Soldering 1000uF electrolytic capacitors to the 1.25V and 1.35V rails, the effect is that both relocation and decompression complete (according to the scope trace, i.e., the 150ms and 500ms voltage drops are visible) with the two rails at 1.35V, 1.30V, 1.25V, 1.20V, 1.15V, but not below that. Restoring the supply setpoint to 1.35V, we see that the relocation and decompression complete as expected.

In order to avoid wasting time with relocation, we will from now on load the kernel to 0xC2000000 and the device tree to 0xC4000000. The scope trace of the 1.35V rail shows a small voltage drop for 500ms (decompression).

UART print during decompression

It’s not reassuring that we get zero console output during decompression. Trying to get at least some output, I added CONFIG_DEBUG_LL=y to the .config file and accepted most of the default options suggested by make:

Kernel low-level debugging functions (read help!) (DEBUG_LL) [Y/n/?] y
  Kernel low-level debugging port
  > 1. Use STM32MP1 UART for low-level debug (STM32MP1_DEBUG_UART) (NEW)
    2. Kernel low-level debugging via EmbeddedICE DCC channel (DEBUG_ICEDCC) (NEW)
    3. Kernel low-level debug output via semihosting I/O (DEBUG_SEMIHOSTING) (NEW)
    4. Kernel low-level debugging via 8250 UART (DEBUG_LL_UART_8250) (NEW)
    5. Kernel low-level debugging via ARM Ltd PL01x Primecell UART (DEBUG_LL_UART_PL01X) (NEW)
  choice[1-5?]:
Enable flow control (CTS) for the debug UART (DEBUG_UART_FLOW_CONTROL) [N/y/?] (NEW)
Physical base address of debug UART (DEBUG_UART_PHYS) [0x40010000] (NEW)
Virtual base address of debug UART (DEBUG_UART_VIRT) [0xfe010000] (NEW)
Early printk (EARLY_PRINTK) [N/y/?] (NEW) y
Write the current PID to the CONTEXTIDR register (PID_IN_CONTEXTIDR) [N/y/?] n

However, no output appeared on the UART. Loading Image (rather than zImage) produces the early prints, but the decompression hang mystery persists.

JTAG

Note: follow along this section with the help of linusw’s article, “How the ARM32 Linux kernel decompresses”.

Let’s try to follow along the decompression using a J-Link debug probe. First, open the GDB server and connect to it:

JLinkGDBServer.exe -device STM32MP135F -if swd -port 2330
arm-none-eabi-gdb.exe -q -x load.gdb

Where the load.gdb script contains:

file build/main.elf
add-symbol-file build/compressed 0xc2000000
target remote localhost:2330
monitor reset
monitor flash device=STM32MP135F
load build/main.elf
monitor go
break handoff.S:93

Step instruction a few times till reaching just after the handoff code:

(gdb) bt
#0  0xc2000004 in _text () at arch/arm/boot/compressed/head.S:202

This shows that execution has begun at the beginning of the decompressor, in file arch/arm/boot/compressed/head.S, in the start: label. We can step through the code lines (n command in gdb) until reaching the line bne not_angel, which we have to step into (si):

(gdb) si
not_angel () at arch/arm/boot/compressed/head.S:245
245                     safe_svcmode_maskall r0

Go forward (n) a few steps till reaching the C function fdt_check_mem_start() (arch/arm/boot/compressed/fdt_check_mem_start.c), then call finish to get out of it and continue stepping through the not_angel section:

(gdb) finish
Run till exit from #0  fdt_check_mem_start (mem_start=1, fdt=0xc4000000) at
arch/arm/boot/compressed/fdt_check_mem_start.c:106
not_angel () at arch/arm/boot/compressed/head.S:312
312                     add     r4, r0, #TEXT_OFFSET
Value returned is $3 = 3221225472
(gdb) n
323                     mov     r0, pc
324                     cmp     r0, r4
325                     ldrcc   r0, .Lheadroom
326                     addcc   r0, r0, pc
327                     cmpcc   r4, r0
328                     orrcc   r4, r4, #1              @ remember we skipped cache_on
329                     blcs    cache_on

Step into cache_on and later call_cache_fn, and go through the many lines till reaching the return from __armv7_mmu_cache_on:. Thus we reach the restart: section:

(gdb) b 902
Breakpoint 3 at 0xc200055c: file arch/arm/boot/compressed/head.S, line 902.
(gdb) c
Continuing.

Breakpoint 3, __armv7_mmu_cache_on () at arch/arm/boot/compressed/head.S:902
902                     mcr     p15, 0, r0, c7, c5, 4   @ ISB
(gdb) n
903                     mov     pc, r12
(gdb) si
restart () at arch/arm/boot/compressed/head.S:331
331     restart:        adr     r0, LC1

Continue stepping through until reaching the wont_overwrite: section, and then not_relocated:, where we clear BSS. Step through that, and we reach the beginning of the decompression proper: the decompress_kernel() function in arch/arm/boot/compressed/misc.c. Interestingly, we step right past the putstr("Uncompressing Linux..."); line without seeing anything printed on the UART console.

The function decompress_kernel() calls do_decompress(), which calls __decompress which calls __gunzip. Calling finish on the latter exactly correlates with the 500ms of the voltage drop observed on the 1.35V supply as mentioned above. Now we’re back in the decompress_kernel() function, which should print " done, booting the kernel.\n" (but doesn’t, since there’s something wrong with my putstr function).

We return back to the not_relocated: section of the compressed head.S and call get_inflated_image_size to find out how large the decompressed kernel is:

not_relocated () at arch/arm/boot/compressed/head.S:636
636                     get_inflated_image_size r1, r2, r3
638                     mov     r0, r4                  @ start of inflated image
639                     add     r1, r1, r0              @ end of inflated image
(gdb) p/x $r0
$3 = 0xc0008000
(gdb) p/x $r1
$4 = 0xc1241f48
(gdb)

Subtracting the r1 and r0 values, we see that the uncompressed kernel is exactly 19111752 bytes in size, which is identical to the size of the arch/arm/boot/Image file. So far so good!

Next, the startup code cleans caches and turns them off again and jumps to __enter_kernel just like we may do directly, had we loaded the uncompressed image in memory with the bootloader. This places the pointer to the DTB into r2 and passes control to the kernel:

__enter_kernel () at arch/arm/boot/compressed/head.S:1435
1435                    mov     r0, #0                  @ must be 0
1436                    mov     r1, r7                  @ restore architecture number
1437                    mov     r2, r8                  @ restore atags pointer
1438     ARM(           mov     pc, r4          )       @ call kernel

Just before the jump to the kernel, we can check that the register values make sense: r0 and r1 are zero, r2 has the DTB address, and the decompressed kernel will run from location 0xC0008000 (= TEXT_OFFSET):

(gdb) p $r0
$5 = 0
(gdb) p $r1
$6 = 0
(gdb) p/x $r2
$8 = 0xc4000000
(gdb) p/x $r4
$9 = 0xc0008000
(gdb)

One fateful step and we’re running in the uncompressed kernel proper. Let’s load the symbols from the main kernel ELF file to see what’s going on:

(gdb) si
0xc0008000 in ?? ()
(gdb) add-symbol-file build/vmlinux 0xc0008000
add symbol table from file "build/vmlinux" at
        .text_addr = 0xc0008000
Reading symbols from build/vmlinux...
(gdb)

Interesting, just one more step and the debugger stops as some much later point:

gdb) si
0xc0114620 in perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10836
10836                   hwc->sample_period = event->attr.sample_period;
(gdb) bt
#0  0xc0114620 in perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10836
#1  perf_swevent_init_hrtimer (event=0xc0008000 <stext>) at kernel/events/core.c:10818
#2  cpu_clock_event_init (event=0xc0008000 <stext>) at kernel/events/core.c:10902
#3  0xc271e9f0 in ?? ()

But if we finish running the perf_swevent_init_hrtimer function, then somehow we end up back in arch/arm/kernel/head.S. Debugging from that point onwards appears to have gone totally insane!

Decompressor handoff to regular kernel code

Let’s start again from scratch. Set a breakpoint at the point where the uncompressed kernel is supposed to begin executing:

(gdb) b *0xc0008000
Breakpoint 6 at 0xc0008000: file arch/arm/kernel/head.S, line 501.
(gdb) c
Continuing.

Breakpoint 6, stext () at arch/arm/kernel/head.S:501
501             mov     r0, r0
(gdb) p $pc
$11 = (void (*)()) 0xc0008000 <stext>

This is strange: program counter is in the expected location, but we’re on line 501 into head.S, rather than closer to the beginning of the file. The reason is that we have incorrectly instructed GDB that the entire vmlinux starts at 0xC0008000, instead of just the first section. We can fix it by clearing the symbol file, re-loading the symbols at their natural link address, and verifying everything makes sense:

(gdb) symbol-file
Error in re-setting breakpoint 1: No source file named handoff.S.
No symbol file now.
(gdb) file build/vmlinux
Reading symbols from build/vmlinux...
(gdb) p/x &stext
$15 = 0xc0008000
(gdb) si
__hyp_stub_install () at arch/arm/kernel/hyp-stub.S:73
73              store_primary_cpu_mode  r4, r5
(gdb) finish
Run till exit from #0  __hyp_stub_install () at arch/arm/kernel/hyp-stub.S:73
stext () at arch/arm/kernel/head.S:105
105             safe_svcmode_maskall r9

Now we’re simply running through the beginning of the normal kernel start in section ENTRY(stext) in file arch/arm/kernel/head.S. By single stepping through the code, we can find the exact section where things go badly wrong:

stext () at arch/arm/kernel/head.S:162
162             badr    lr, 1f                          @ return (PIC) address
167             mov     r8, r4                          @ set TTBR1 to swapper_pg_dir
169             ldr     r12, [r10, #PROCINFO_INITFUNC]
170             add     r12, r12, r10
171             ret     r12

__v7_ca7mp_setup () at arch/arm/mm/proc-v7.S:302
302             do_invalidate_l1
0xc01197fc      302             do_invalidate_l1
0xc0119800      302             do_invalidate_l1
0xc0119804      302             do_invalidate_l1

v7_invalidate_l1 () at arch/arm/mm/cache-v7.S:40
40              mov     r0, #0
41              mcr     p15, 2, r0, c0, c0, 0   @ select L1 data cache in CSSELR
(gdb)
0x2fff2f08 in ?? ()

We see that after the last mcr instruction, the code lands up in SYSRAM instead of the DDR, from where we’ve been executing so far. That address corresponds to the vectors as have been installed by the bootloader; in particular, we have gotten into the dummy SVC handler.

Let’s examine the program instructions at the point just before where the failure occurs:

Breakpoint 7, v7_invalidate_l1 () at arch/arm/mm/cache-v7.S:40
40              mov     r0, #0
(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>:  0xe3a00000      0x2f400f10      0xffffffff      0xee300f10

Very interesting! The expected instruction, 0xe3a00000, is followed by 0x2f400f10 and 0xffffffff. The first one is the “mystery” SVC call, and the second one is simply undefined:

(gdb) set {int}0xc0000000 = 0x2f400f10
(gdb) x/i 0xc0000000
   0xc0000000:  svccs   0x00400f10
(gdb) set {int}0xc0000000 = 0xffffffff
(gdb) x/i 0xc0000000
   0xc0000000:                  @ <UNDEFINED> instruction: 0xffffffff

For comparison, here’s the instructions we expect to find from the disassembly of the ELF file:

$ arm-linux-gnueabi-objdump -d linux/vmlinux | grep -A 4 "v7_invalidate_l1"
c0118b2c <v7_invalidate_l1>:
c0118b2c:       e3a00000        mov     r0, #0
c0118b30:       ee400f10        mcr     15, 2, r0, cr0, cr0, {0}
c0118b34:       f57ff06f        isb     sy
c0118b38:       ee300f10        mrc     15, 1, r0, cr0, cr0, {0}

DDR corruption pattern

Let’s compare the binary pattern between the expected and actual instructions:

Expected: 0xee400f10 = 0b11101110010000000000111100010000
Actual:   0x2f400f10 = 0b00101111010000000000111100010000
---------------------------------------------------------
Diff:       ^^           ^^     ^

Three bits have been flipped in this instruction, changing it from mcr to svc. This could be explained if DDR is miswired or misconfigured. However, the pattern of data corruption is repeatable: reboot after reboot, the same instruction gets corrupted in exactly the same way!

To prove that the DDR is capable of holding data at this address, we can write it manually and step through the instructions without any weird jumps to vectors:

(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>:  0xe3a00000      0x2f400f10      0xffffffff      0xee300f10
(gdb) set {int}0xc0118b30 = 0xee400f10
(gdb) set {int}0xc0118b34 = 0xf57ff06f
(gdb) x/4x $pc
0xc0118b2c <v7_invalidate_l1>:  0xe3a00000      0xee400f10      0xf57ff06f      0xee300f10
(gdb) si
41              mcr     p15, 2, r0, c0, c0, 0   @ select L1 data cache in CSSELR
42              isb
43              mrc     p15, 1, r0, c0, c0, 0   @ read cache geometry from CCSIDR
45              movw    r3, #0x3ff

We can also load and run the decompressor as usual and set a breakpoint to 0xC0008000, where the uncompressed kernel is supposed to take over. Then, we simply overwrite whatever the decompressor has written from gdb:

(gdb) restore build/Image binary 0xc0008000
Restoring binary file build/Image into memory (0xc0008000 to 0xc1241f48)
(gdb) c

Nothing has been printed to the console, since apparently the decompressor disabled the console, but if we stop the debugger (Ctrl-C), we see that the kernel proceeded with the boot and finally came to a stop when mounting the root filesystem (understandable, since we haven’t given it a rootfs yet):

(gdb) bt
#0  0xc0b87034 in __timer_delay (cycles=63999) at arch/arm/lib/delay.c:50
#1  0xc0bb2238 in panic (fmt=0xc0defa0c "VFS: Unable to mount root fs on %s") at kernel/panic.c:451
#2  0xc1001878 in mount_block_root (name=0x51 <error: Cannot access memory at address 0x51>, name@entry=0xc0defaa0 "/dev/root", flags=3900) at init/do_mounts.c:432
#3  0xc1001b50 in mount_root () at init/do_mounts.c:592
#4  0xc1001cc8 in prepare_namespace () at init/do_mounts.c:644
#5  0xc1001448 in kernel_init_freeable () at init/main.c:1644
#6  0xc0bc5f18 in kernel_init (unused=<optimized out>) at init/main.c:1519
#7  0xc0100148 in ret_from_fork () at arch/arm/kernel/entry-common.S:148

Deterministic DDR corruption

Let’s assume that the data corruption is deterministic (repeatable) because it is caused by a voltage drop. Since the voltage drop corresponds to the CPU/DDR activity, the same activity causes the same voltage drop, which causes the same corruption.

Let’s check the same instruction at different supply voltages. At 1.35V, 1.30V, 1.25V, the corruption is:

0xc0118b2c <v7_invalidate_l1>:  0xe3a00000 0x2f400f10 0x00000000 0xee300f10

At 1.20V, the pattern is more interesting: the third instruction gets corrupted each time, but differently each reset:

0xc0118b2c <v7_invalidate_l1>:  0xe3a00000 0x2f400f10 0xe464f8f6 0xee300f10
# or this one:
0xc0118b2c <v7_invalidate_l1>:  0xe3a00000 0x2f400f10 0xcbfd2cb6 0xee300f10
# or this one:
0xc0118b2c <v7_invalidate_l1>:  0xe3a00000 0x2f400f10 0xaefc67e9 0xee300f10

Even more strange: restoring voltage back up to 1.35V, the third instruction now gets corrupted differently every time, while the first and last are always correct, and the second one is always corrupted the same way.

Check SD card and bootloader copy integrity

One obvious way that data corruption could happen is the if the compressed zImage was written wrong to the SD card, or if the bootloader writes it to DDR wrong. First, we check how big the zImage is, and then ask the debugger to dump the data from the DDR to a file, at the point just before the handoff from the bootloader into the decompressor:

$ ls -l linux/arch/arm/boot/zImage
-rwxr-xr-x 1 jk jk 7461288 Jan  7 11:09 linux/arch/arm/boot/zImage

Breakpoint 1, handoff_jump () at src/handoff.S:93
93         smc #0
(gdb) dump binary memory dump.bin 0xC2000000 0xC271d9a8

We see that the original image is identical to the one we obtained from the dump, so the SD card and bootloader writes are not corrupted:

9040ec8b8da5e613aa6e56060cc0cacf6779eec670c3a4123177cd07aff63300  zImage
9040ec8b8da5e613aa6e56060cc0cacf6779eec670c3a4123177cd07aff63300  dump.bin

Test DDR using STM32DDRFW-UTIL

ST provides a utility which they recommend to run as a part of any new PCB bring-up. I have done that already and did not think much of it since all tests passed. Let’s take a closer look.

My “version” of the utility can be found in this repository. I made two small changes: instead of requiring the complicated “Cube” software suite, there is a simple Makefile so that the whole utility can be compiled easily with a single make invocation. Second, I have commented out the three or so lines that initialize the STPMIC1, since my board does not use that power controller.

Let’s load the utility through the debugger, since it is running already:

(gdb) file build/fwutil.elf
Reading symbols from build/fwutil.elf...
(gdb) load
Loading section .RESET, size 0xe000 lma 0x2ffe0000
Loading section .ARM, size 0x8 lma 0x2ffee000
Loading section .init_array, size 0x4 lma 0x2ffee008
Loading section .fini_array, size 0x4 lma 0x2ffee00c
Loading section .data, size 0x7fa lma 0x2ffee010
Start address 0x2ffe0000, load size 59402
Transfer rate: 260 KB/sec, 7425 bytes/write.
(gdb) c
Continuing.

On the serial console, we are greeted with the expected prompt:

=============== UTILITIES-DDR Tool ===============
Model: STM32MP13XX_DK
RAM: DDR3-1066 bin F 1x4Gb 533MHz v1.53
0:DDR_RESET
DDR>

As the utility readme instructs us, let us enter the DDR_READY step and then execute all the tests:

DDR>step 3
step to 3:DDR_READY
1:DDR_CTRL_INIT_DONE
2:DDR_PHY_INIT_DONE
3:DDR_READY
DDR>test 0
result 1:Test Simple DataBus = Passed
result 2:Test DataBusWalking0 = Passed
result 3:Test DataBusWalking1 = Passed
result 4:Test AddressBus = Passed
result 5:Test MemDevice = Passed
result 6:Test SimultaneousSwitchingOutput = Passed
result 7:Test Noise = Passed
result 8:Test NoiseBurst = Passed
result 9:Test Random = Passed
result 10:Test FrequencySelectivePattern = Passed
result 11:Test BlockSequential = Passed
result 12:Test Checkerboard = Passed
result 13:Test BitSpread = Passed
result 14:Test BitFlip = Passed
result 15:Test WalkingZeroes = Passed
result 16:Test WalkingOnes = Passed
Result: Pass [Test All]

This takes about a second to complete, and on the scope trace monitoring the 1.35V supply we see a tiny (maybe 2-5mV) dip during this time.

After all the tests are done, we can use the save command to get the DDR parameters from the utility. Here are the dynamic ones, reporting on the status:

/* ctl.dyn */
#define DDR_STAT 0x00000001
#define DDR_INIT0 0x4002004e
#define DDR_DFIMISC 0x00000001
#define DDR_DFISTAT 0x00000001
#define DDR_SWCTL 0x00000001
#define DDR_SWSTAT 0x00000001
#define DDR_PCTRL_0 0x00000001

/* phy.dyn */
#define DDR_PIR 0x00000000
#define DDR_PGSR 0x0000001f
#define DDR_ZQ0SR0 0x80021dee
#define DDR_ZQ0SR1 0x00000000
#define DDR_DX0GSR0 0x00008001
#define DDR_DX0GSR1 0x00000000
#define DDR_DX0DLLCR 0x40000000
#define DDR_DX0DQTR 0xffffffff
#define DDR_DX0DQSTR 0x3db02001
#define DDR_DX1GSR0 0x00008001
#define DDR_DX1GSR1 0x00000000
#define DDR_DX1DLLCR 0x40000000
#define DDR_DX1DQTR 0xffffffff
#define DDR_DX1DQSTR 0x3db02001

All the other parameters returned from the utility are identical to the values already used in the bootloader. Thus, I hope I can assume that the DDR configuration in the bootloader is identical to the one used in the bootloader.

When does data get corrupted

Above we have found that while decompression appears to finish successfully, it in fact leaves behind lots of partially corrupted data. The uncompressed kernel starts executing, only the trip into the SVC handler because of a corrupted instruction. Now, let’s try to track down exactly when the data first gets corrupted.

As seen above, in the current configuration, decompression takes place in the __gunzip routine (decompress_inflate.c). The decompression is done by zlib_inflate() (lib/zlib_inflate/inflate.c). First, clear the memory location that we’re interested in observing:

set {unsigned int}0xc0118b2c = 0x0
set {unsigned int}0xc0118b30 = 0x0
set {unsigned int}0xc0118b34 = 0x0
set {unsigned int}0xc0118b38 = 0x0

Verify it has been cleared:

(gdb) x/4x 0xc0118b2c
0xc0118b2c:     0x00000000      0x00000000      0x00000000      0x00000000

Some interesting breakpoints:

(gdb) b *0xc2001878
Breakpoint 20 at 0xc2001878: file arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c, line 63.
(gdb) b *0xc2001fa4
Breakpoint 34 at 0xc2001fa4: file arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c, line 582.

As it turns out, the corruption appears after the second call to inflate_fast:

(gdb) c
Continuing.

Breakpoint 36, zlib_inflate (strm=0xc271ea44, strm@entry=0xc271e9c0, flush=1072676126, flush@entry=0) at arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c:582
582                     inflate_fast(strm, out);
(gdb) x/4x 0xc0118b2c
0xc0118b2c:     0x00000000      0x00000000      0x00000000      0x00000000
(gdb) c
Continuing.

Breakpoint 36, zlib_inflate (strm=0xc271ea44, strm@entry=0xc271e9c0, flush=1072590367, flush@entry=0) at arch/arm/boot/compressed/../../../../lib/zlib_inflate/inflate.c:582
582                     inflate_fast(strm, out);
(gdb) x/4x 0xc0118b2c
0xc0118b2c:     0xe3a00000      0x2f400f10      0xffedecfd      0xee300f1

While we press c (or continue) in GDB, inflate_fast() runs and very briefly (about 3.5ms), a voltage drop of about 30–40mV is observed on the 1.35V supply. In the same period, VREF_DDR0, VREF_DDR1, and VREF_DDR2 droops are barely perceptible.

We can go a step further and set a watchpoint, so the debugger triggers on the first access of the given memory location:

(gdb) watch *(uint32_t *)0xc0118b2c
Hardware watchpoint 38: *(uint32_t *)0xc0118b2c

Set the memory locations to zero as before, and after the watchpoint triggers, single step through the execution and each time check the memory. Skipping ahead many such steps, we see how the value gets progressively filled in:

0xc0118b2c:     0xe3a00000      0x00000000      0x00000000      0x00000000
0xc0118b2c:     0xe3a00000      0x00000010      0x00000000      0x00000000
0xc0118b2c:     0xe3a00000      0x00000f10      0x00000000      0x00000000
0xc0118b2c:     0xe3a00000      0x00400f10      0x00000000      0x00000000
0xc0118b2c:     0xe3a00000      0x2f400f10      0x00000000      0x00000000

We see how it fills up in steps of half byte: zero, 10, 0f, 40, 2f. That final 2f is erroneous; it should be ee as we have seen previously in the disassembly of vmlinux.

The code loop that populates this word can be found in lib/zlib_inflate/inffast.c, lines 119 through 308; in particular, the line that wrote the incorrect 2f is number 247, in the middle of this section:

/* Align out addr */
if (!((long)(out - 1) & 1)) {
   *out++ = *from++;
   len--;
}

Key insight: 8-bit corruption

Let’s recap the situation so far. DDR appears to work as far as my own tests are concerned: I can fill the memory with pseudo-random data and read it all back correctly. The STM32DDRFW-UTIL tests all pass. The kernel runs if it’s loaded into memory uncompressed, but the decompression fails. Remembering further back, when writing the bootloader I had to force all DDR writes to be 32-bit aligned. All of this brings to mind the quote from Jay Carlson:

if your design doesn’t work, length-tuning is probably the last thing you should be looking at. For starters, make sure you have all the pins connected properly — even if the failures appear intermittent. For example, accidentally swapping byte lane strobes / masks (like I’ve done) will cause 8-bit operations to fail without affecting 32-bit operations. Since the bulk of RAM accesses are 32-bit, things will appear to kinda-sorta work.

Let’s take a good hard look at the connections on my custom board (Rev A) between the memory chip (MT41K256M16TW-107:P TR) and the SoC (STM32MP135FAE):

DDR pin DDR signal SoC signal SoC pin Notes
M2 BA0 BA0 G17
N8 BA1 BA1 L16
M3 BA2 BA2 G13
N3 A0 A0 G16
P7 A1 A1 K15
P3 A2 A2 F17
N2 A3 A3 G15
P8 A4 A4 M14
P2 A5 A5 E16
R8 A6 A6 M17
R2 A7 A7 G14
T8 A8 A8 L15
R3 A9 A9 F16
L7 A10/AP A10 J14
R7 A11 A11 K13
N7 A12/BC# A12 K17
T3 A13 A13 F14
T7 A14 A14 L17
D3 UDM DQM0 D15
E7 LDM DQM1 N14
B7 UDQS# DQS0N C16
C7 UDQS DQS0P C17
G3 LDQS# DQS1N R16
F3 LDQS DQS1P R17
E3 DQ0 DQ4 B16
F7 DQ1 DQ2 C13
F2 DQ2 DQ0 B17
F8 DQ3 DQ5 D16
H3 DQ4 DQ3 D17
H8 DQ5 DQ7 E15
G2 DQ6 DQ1 C15
H7 DQ7 DQ6 E14
D7 DQ8 DQ8 N16
C3 DQ9 DQ9 P17
C8 DQ10 DQ10 N15
C2 DQ11 DQ15 T16
A7 DQ12 DQ11 P15
A2 DQ13 DQ12 R15
B8 DQ14 DQ13 P16
A3 DQ15 DQ14 T17
K3 CASN CASN J15
K9 CKE CKE K14 10k pulldown
K7 CK# CLKN J17 100R to CK at DDR
J7 CK CLKP J16
L2 CS# CSN H16
K1 ODT ODT H15
J3 RAS# RASN H17
T2 RESET# RESETN E17 10k pulldown
L3 WE# WEN H13

Let’s check carefully what the DDR datasheet considers “upper” vs “lower”:

DQ[7:0] Lower byte of bidirectional data bus for the x16 configuration.

DQ[15:8] Upper byte of bidirectional data bus for the x16 configuration.

In other words, we should have mapped DQ[7:0] together with the DDR signals LDM and LDQS, while the upper byte DQ[15:8] should have been placed together with UDM and USDQS. Looking at the table above, we see that the mask/strobe signals are swapped:

DDR:UDM → SoC:DQM0
DDR:LDM → SoC:DQM1

But the data bits are not swapped, so this is incorrect:

DDR:DQ[7:0]  → SoC[7:0]  (scrambled)
DDR:DQ[15:8] → SoC[15:8] (scrambled)

My confusion can be traced back to the eval board design, which similarly swaps the mask/strobe wires, except they also (correctly) swap the two DQ lanes. AI seems to be of little use: I can easy convince them either way regarding the correctness of my “semi-byte swap”.

Simple software test for DDR correctness

We saw above that the official ST DDR utility did not detect any problems with my incorrectly-wired DDR. After some prompting, Gemini 3 gave me the following test:

void ddr_align_test(int argc, uint32_t arg1, uint32_t arg2, uint32_t arg3)
{
    (void)argc; (void)arg1; (void)arg2; (void)arg3;
    uint32_t sctlr;

    // 1. READ SCTLR
    __asm__ volatile("mrc p15, 0, %0, c1, c0, 0" : "=r" (sctlr));
    
    // 2. DISABLE CACHE (Bit 2) AND MMU (Bit 0)
    uint32_t sctlr_disabled = sctlr & ~((1 << 2) | (1 << 0));
    __asm__ volatile("mcr p15, 0, %0, c1, c0, 0" : : "r" (sctlr_disabled));
    __asm__ volatile("isb sy"); // Instruction sync barrier

    my_printf("!!! CACHE DISABLED !!! Testing raw hardware wires...\r\n");

    volatile uint8_t *p8 = (volatile uint8_t *)0xc0001000;
    
    // Perform a partial write
    p8[0] = 0xAA;
    __asm__ volatile("dsb sy"); // Force pin toggle
    
    if (p8[0] != 0xAA) {
        my_printf("FAILURE DETECTED: Byte 0 is 0x%02x (expected 0xAA)\r\n", p8[0]);
    } else {
        my_printf("SUCCESS: Byte 0 worked without cache.\r\n");
    }

    // 3. RE-ENABLE CACHE
    __asm__ volatile("mcr p15, 0, %0, c1, c0, 0" : : "r" (sctlr));
    __asm__ volatile("isb sy");
}

On the evaluation board, the printout is:

Eval board: !!! CACHE DISABLED !!! Testing raw hardware wires... 
SUCCESS: Byte 0 worked without cache.

On my board:

!!! CACHE DISABLED !!! Testing raw hardware wires...
FAILURE DETECTED: Byte 0 is 0x55 (expected 0xAA)

Next steps

While the explanation in the previous section (swapped byte lanes) seems plausible enough to stop debugging at this point and wait for “Rev B”, in the process I noted other possible avenues to explore:

LSB swizzling

Just because we found one issue with my connections, it does not mean we have found all of them. From the same article by Jay Carlson:

Because DDR memory doesn’t care about the order of the bits getting stored, you can swap individual bits — except the least-significant one if you’re using write-leveling — in each byte lane with no issues.

I have not been able to find any evidence of the LSB swapping restriction in ST literature (datasheet, reference manual, app notes). Indeed, one app note[1] just says that the DDR3L connection features “two swappable bytes, and swappable bits in the same byte”.

However, the MT41K DDR3L datasheet includes a section on Write Leveling which explains what’s up:

For better signal integrity, DDR3 SDRAM memory modules have adopted fly-by topology for the commands, addresses, control signals, and clocks. Write leveling is a scheme for the memory controller to adjust or de-skew the DQS strobe (DQS, DQS#) to CK relationship at the DRAM with a simple feedback feature provided by the DRAM. Write leveling is generally used as part of the initialization process, if required. For normal DRAM operation, this feature must be disabled. […]

When write leveling is enabled, the rising edge of DQS samples CK, and the prime DQ outputs the sampled CK’s status. The prime DQ for a x4 or x8 configuration is DQ0 with all other DQ (DQ[7:1]) driving LOW. The prime DQ for a x16 configuration is DQ0 for the lower byte and DQ8 for the upper byte.

So, just in case, we should make sure not to “swizzle” the two LSBs in each byte.

All Articles in This Series


  1. Application note AN5692: DDR memory routing guidelines for STM32MP13x product lines. January 2023. ↩︎