2019-01-28

Blinking the LED Part 2 - Clearing the .bss and the Peripherals Base Address

by Andre Leiradella

The `start` Function

When we write a regular C program, the C runtime takes care of a number of things for us. One of those things is clearing the .bss section, which is where the uninitialized, statically allocated variables reside, as well as statically allocated variables that are initialized to values represented by all bits being 0.

The operating system or the C runtime make sure that this section is filled with zeroes before execution reaches main. Since we don’t have that kind of help, we have to zero the .bss section ourselves.

Our program only worked until now because stack0, which is where the stack is created, doesn’t require any kind of initialization.

To zero .bss, we need to know where in memory it starts and ends. Fortunately, the linker script can create symbols with that information for us:

  .bss : {
    . = ALIGN(4);
    bss_start = .;
    *(.bss .bss.*)
    *(COMMON)
    . = ALIGN(4);
    bss_end = .;
  }

We can now use bss_start and bss_end in C code to zero the .bss section, which is the first thing we will do in start to make sure everything relying on it works:

static void __attribute__((noinline)) start(
  const uint32_t machine_type,
  const uint32_t atags_addr) {

  // Zero the .bss section.
  extern uint32_t bss_start, bss_end;

  for (uint32_t* i = &bss_start; i < &bss_end; i++) {
    *i = 0;
  }
}

The linker creates symbols as if they were variables at the locations specified in the linker script, but without any associated storage. This can be checked by looking at the output of objdump (I’ve removed most of the output for brevity):

$ arm-none-eabi-objdump -x kernel.elf

SYMBOL TABLE:
0000c040 g       .bss	00000000 bss_end
00008040 g       .bss	00000000 bss_start

The zeros right before the symbols’ names are their sizes, which are zero.

The “correct” way to declared them in C would be extern void bss_start, bss_end;, and i would be void*, but then the compiler warns us about taking the address of expressions of type void. To avoid these warnings, and also the casts needed to write to *i and update it, we declare them as uint32_t. This does not allocate space for them, if we use their values we’ll see values from whatever is in there.

I couldn’t find any reference for this, the explanation is based purely on tests and common sense.

The choice of uint32_t is not random, it has the benefit of letting us declare i as uint32_t* and write four aligned bytes in one go. That’s why we align the location counter to a four-byte boundary before declaring bss_start and bss_end in the linker script.

If we compile now, we’ll be able to check in kernel.lst that .bss is indeed being cleared:

00008000 <kmain>:
e59f3008 	ldr	r3, [pc, #8]	; 8010 <kmain+0x10>
e1a0d003 	mov	sp, r3
eb000001 	bl	8014 <start.isra.0>
    800c:	eafffffe 	b	800c <kmain+0xc>
0000c040 	.word	0x0000c040

00008014 <start.isra.0>:
e3a01000 	mov	r1, #0
e59f3014 	ldr	r3, [pc, #20]	; 8034 <start.isra.0+0x20>
    801c:	e59f2014 	ldr	r2, [pc, #20]	; 8038 <start.isra.0+0x24>
e1530002 	cmp	r3, r2
3a000000 	bcc	802c <start.isra.0+0x18>
e12fff1e 	bx	lr
    802c:	e4831004 	str	r1, [r3], #4
eafffffa 	b	8020 <start.isra.0+0xc>
00008040 	.word	0x00008040
0000c040 	.word	0x0000c040

The funny function name start.isra.0 is related to compiler optimizations that take place with -Os, as explained in this Stack Overflow answer.

Small Optimization

Now that start has something in it and the compiler generates code to call it, lets see something interesting from the ABI.

Lets change the loop that clears the .bss section to use the two parameters passed from kmain:

  for (uint32_t i = machine_type; i < atags_addr; i += 4) {
    *(uint32_t*)i = 0;
  }

This code doesn’t do anything useful nor correct, it’s there just to make sure the compiler uses the parameters passed from kmain.

If we look to kmain in the listing, we’ll see it moving values between registers before calling start:

00008000 <kmain>:
e1a00001 	mov	r0, r1
e59f300c 	ldr	r3, [pc, #12]	; 8018 <kmain+0x18>
e1a0d003 	mov	sp, r3
    800c:	e1a01002 	mov	r1, r2
eb000001 	bl	801c <start>
eafffffe 	b	8014 <kmain+0x14>
0000c038 	.word	0x0000c038

kmain receives zero, machine_type, and atags_addr in r0, r1, and r2 from the stub. However, start only wants machine_type and atags_addr, which then need to go into r0 and r1. The compiler must move r1 to r0 and r2 to r1 to put the values in the correct registers before calling main. Maybe we can avoid this by declaring start to also receive zero?

static void __attribute__((noinline)) start(
  const uint32_t zero,
  const uint32_t machine_type,
  const uint32_t atags_addr) {

  ...
}

Unfortunately (or fortunately!), the compiler sees that zero is not being used in start, and that start is only being called from kmain, and optimizes away this parameter as if it doesn’t exist. Lets use it in our loop:

  for (uint32_t i = machine_type; i < atags_addr; i += 4) {
    *(uint32_t*)i = zero;
  }

We can see in the listing that the compiler was able to greatly reduce the generated code:

00008000 <kmain>:
    8000:	e59f3008 	ldr	r3, [pc, #8]	; 8010 <kmain+0x10>
    8004:	e1a0d003 	mov	sp, r3
    8008:	eb000001 	bl	8014 <start>
    800c:	eafffffe 	b	800c <kmain+0xc>
    8010:	0000c028 	.word	0x0000c028

00008014 <start>:
    8014:	e1510002 	cmp	r1, r2
    8018:	3a000000 	bcc	8020 <start+0xc>
    801c:	e12fff1e 	bx	lr
    8020:	e4810004 	str	r0, [r1], #4
    8024:	eafffffa 	b	8014 <start>

Not only values in registers don’t need to be moved around in kmain, but the compiler didn’t need to initialize a register to 0 since r0 is being used to clear the .bss. Lets fix the code to use bss_start and bss_end:

  for (uint32_t* i = &bss_start; i < &bss_end; i++) {
    *i = zero;
  }

This compiles to:

00008000 <kmain>:
    8000:	e59f3008 	ldr	r3, [pc, #8]	; 8010 <kmain+0x10>
    8004:	e1a0d003 	mov	sp, r3
    8008:	eb000001 	bl	8014 <start.isra.0>
    800c:	eafffffe 	b	800c <kmain+0xc>
    8010:	0000c038 	.word	0x0000c038

00008014 <start.isra.0>:
    8014:	e59f3014 	ldr	r3, [pc, #20]	; 8030 <start.isra.0+0x1c>
    8018:	e59f2014 	ldr	r2, [pc, #20]	; 8034 <start.isra.0+0x20>
    801c:	e1530002 	cmp	r3, r2
    8020:	3a000000 	bcc	8028 <start.isra.0+0x14>
    8024:	e12fff1e 	bx	lr
    8028:	e4830004 	str	r0, [r3], #4
    802c:	eafffffa 	b	801c <start.isra.0+0x8>
    8030:	00008038 	.word	0x00008038
    8034:	0000c038 	.word	0x0000c038

I know this is an example of early and micro-optimization, but I can’t help myself.

Blinking the LED

The activity LED on the Raspberry Pi boards is connected to a general-purpose input/output (GPIO) pin. All it’s needed to turn it on and off is to program its pin as an output pin, and set the pin level to high and low.

The GPIO is a memory-mapped peripheral, so we need to know both its memory location and its memory layout to be able to program the pins and set their level. The GPIO memory layout is documented in BCM2835 ARM Peripherals, chapter 6 - General Purpose I/O (GPIO).

The memory location is physical address 0x20000000, but this document is for boards based on the BCM2835 SoC and the address has changed to physical address 0x3e000000 on boards based on later SoCs, as documented in QA7 - ARM Quad A7 core. We have to find a way to discover the correct address at runtime, but lets tackle the memory layout first.

GPIO Memory Layout

The most common way use a memory-mapped peripheral is to create a structure with the exact same layout as the peripheral, and then declare a variable pointing to a structure of that type at the memory base of the peripheral. The Raspberry Pi GPIO could be declared as:

typedef struct __attribute__((packed)) {
  uint32_t GPFSEL0;    // 0x00
  uint32_t GPFSEL1;    // 0x04
  uint32_t GPFSEL2;    // 0x08
  uint32_t GPFSEL3;    // 0x0c
  uint32_t GPFSEL4;    // 0x10
  uint32_t GPFSEL5;    // 0x14
  uint32_t reserved1;  // 0x18
  uint32_t GPSET0;     // 0x1c
  uint32_t GPSET1;     // 0x20
  uint32_t reserved2;  // 0x24
  uint32_t GPCLR0;     // 0x28
  uint32_t GPCLR1;     // 0x2c
  uint32_t reserved3;  // 0x30
  uint32_t GPLEV0;     // 0x34
  uint32_t GPLEV1;     // 0x38
  uint32_t reserved4;  // 0x3c
  uint32_t GPEDS0;     // 0x40
  uint32_t GPEDS1;     // 0x44
  uint32_t reserved5;  // 0x48
  uint32_t GPREN0;     // 0x4c
  uint32_t GPREN1;     // 0x50
  uint32_t reserved6;  // 0x54
  uint32_t GPFEN0;     // 0x58
  uint32_t GPFEN1;     // 0x5c
  uint32_t reserved7;  // 0x60
  uint32_t GPHEN0;     // 0x64
  uint32_t GPHEN1;     // 0x68
  uint32_t reserved8;  // 0x6c
  uint32_t GPLEN0;     // 0x70
  uint32_t GPLEN1;     // 0x74
  uint32_t reserved9;  // 0x78
  uint32_t GPAREN0;    // 0x7c
  uint32_t GPAREN1;    // 0x80
  uint32_t reserved10; // 0x84
  uint32_t GPAFEN0;    // 0x88
  uint32_t GPAFEN1;    // 0x8c
  uint32_t reserved11; // 0x90
  uint32_t GPPUD;      // 0x94
  uint32_t GPPUDCLK0;  // 0x98
  uint32_t GPPUDCLK1;  // 0x9c
}
gpio_t;

volatile gpio_t* get_gpio(void) {
  // Return a pointer to the GPIO base.
}

This is very reasonable and would be my choice to use memory-mapped peripherals, but lets take a look at this piece of code:

void wait(void) {
  volatile gpio_t* gpio = get_gpio();
  uint32_t gppud;

  do {
    gppud = gpio->GPPUD;
  }
  while ((gppud & UINT32_C(0x40000000)) != 0);
}

GPPUD is not intended to be used this way, this code is just to illustrate the issue.

Using the compilation flags in the Makefile, we get this code:

00000000 <test>:
e92d4010 	push	{r4, lr}
ebfffffe 	bl	0 <get_gpio>
e5d03094 	ldrb	r3, [r0, #148]	; 0x94
   c:	e5d03095 	ldrb	r3, [r0, #149]	; 0x95
e5d03096 	ldrb	r3, [r0, #150]	; 0x96
e5d03097 	ldrb	r3, [r0, #151]	; 0x97
e3130040 	tst	r3, #64	; 0x40
  1c:	1afffff9 	bne	8 <test+0x8>
e8bd8010 	pop	{r4, pc}

I won’t even try to understand why the compiler has done this, but what matters is that, instead of reading four bytes in one go, the code reads the least-significant byte first, then the second LBS, the third, and then the MSB, which is then tested against 0x40.

While this works for addresses that have memory semantics, here we’re dealing with a peripheral. Many peripherals have collateral effects when read and/or written, i.e. reading from a peripheral’s register may clear it’s contents. If that happens with GPPUD, the loop condition could miss the test value and spin forever.

To force atomically reading and writing to peripheral registers that are four bytes wide, we’ll write two more little assembly functions:

#ifndef MEMIO_H__
#define MEMIO_H__

#include <stdint.h>

inline uint32_t mem_read32(const uint32_t address) {
  uint32_t result;

  __asm volatile(
    "ldr %[result], [%[address]]\n"
    : [result] "=r" (result)
    : [address] "r" (address)
  );

  return result;
}

inline void mem_write32(const uint32_t address, const uint32_t value) {
  __asm volatile(
    "str %[value], [%[address]]\n"
    :
    : [address] "r" (address), [value] "r" (value)
  );
}

#endif /* MEMIO_H__ */

If we then use mem_read32 in our example, the resulting code does what we need:

00000000 <test>:
e92d4010 	push	{r4, lr}
ebfffffe 	bl	0 <get_gpio>
e2800094 	add	r0, r0, #148	; 0x94
   c:	e5903000 	ldr	r3, [r0]
e3130101 	tst	r3, #1073741824	; 0x40000000
1afffffc 	bne	c <test+0xc>
e8bd8010 	pop	{r4, pc}

Since we won’t be able to directly access the registers using the structure, we’ll just define constants for each register with the offset of the register to the base address of the peripheral. An added benefit is that we don’t need to remember to use __attribute__((packed)) to define the structure, and we don’t need to add reserved fields where there’s a hole in the peripheral’s register file.

Peripheral Base Address

The Raspberry Pi has a peripheral named Auxiliary, which has one “mini” UART (it lacks some functionalities), and two SPI masters. It’s fully documented in BCM2835 ARM Peripherals, but it has a hidden register that produces 0x61757830 (the ASCII characters "aux0") when read. By probing this register we can find the base address of all memory-mapped peripherals, since they’re all at fixed distance from one another.

Lets add a function to memio to initialize and retrieve the peripherals’ base address:

#ifndef MEMIO_H__
#define MEMIO_H__

#include <stdint.h>

extern const uint32_t g_baseio;

int memio_init(void);

// mem_read32 and mem_write32 omitted.

#endif /* MEMIO_H__ */

#include <stdint.h>

uint32_t g_baseio;

static inline uint32_t read32(const uint32_t address) {
  uint32_t result;

  __asm volatile(
    "ldr %[result], [%[address]]\n"
    : [result] "=r" (result)
    : [address] "r" (address)
  );

  return result;
}

int memio_init(void) {
  // Test for peripherals at 0x20000000.
  if (read32(UINT32_C(0x20215010)) == UINT32_C(0x61757830)) {
    g_baseio = UINT32_C(0x20000000);
    return 0;
  }

  // Test for peripherals at 0x3f000000.
  if (read32(UINT32_C(0x3f215010)) == UINT32_C(0x61757830)) {
    g_baseio = UINT32_C(0x3f000000);
    return 0;
  }

  return -1;
}

Lets not forget to add memio.c to the Makefile:

OBJS = kmain.o memio.o

Notice that in the header, g_baseio is declared as const, but not in the C file. As long as we don’t include the header in the .c, the compiler doesn’t know any better than to allocate a symbol named g_baseio with four bytes in the .bss section in memio.o. Other compilation units that include memio.h and use the symbol will only have an entry for it as undefined, and relocation records to patch addresses where needed.

The linker in its turn will happily resolve every compilation unit looking for this symbol, because it doesn’t have any information about it being const or not, for all it knows it’s only a symbol.

$ arm-none-eabi-objdump -x memio.o

SYMBOL TABLE:
00000004       O *COM*	00000004 g_baseio

$ arm-none-eabi-objdump -x kmain.o

SYMBOL TABLE:
00000000         *UND*	00000000 g_baseio


RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE              VALUE 
0000003c R_ARM_ABS32       g_baseio

Now we can use g_baseio whenever we need the peripherals base address, but we can’t inadvertently change it.

Now that the environment for C code fully setup, and we know the base address of the peripherals, we can use the GPIO to blink the LED. Or can we?

Source Code

The source code listed here can be found here.

Back