Issue
I am working on a bare metal OS for the Raspberry Pi model B, which features an ARM1176-JZF-S processor. While working on implementing the sine function for a math library, I encountered something very strange, which I've whittled down to a small-ish minimum reproducible example.
The following code counts up from zero to four, and prints out each number with spaces in between:
mov r4, #0 // Initialize counter to 0
c_loop$:
ldr r0, =IntString // Convert counter to a string
mov r1, r4
bl int_to_str
ldr r0, =IntString // Print the string
ldr r1, =0x00000FF0 // (Green text on black background)
bl print
ldr r0, =Space // Print a space
ldr r1, =0x00000FF0 // (Green text on black background)
bl print
mov r5, #0x1000000 // Pause for a beat
c_pause$:
subs r5, #1
bne c_pause$
add r4, #1 // Increment counter
cmp r4, #5 // Repeat until counter = 5
blt c_loop$
halt: // Wait forever
b halt
The functions int_to_str
and print
were both written by me, and work fine. To be clear, they are not printing to any kind of output stream; they just write pixels in the shape of numbers directly to a frame buffer, which I got from the GPU through the mailbox system. The label IntString
is a space for me to store the conversion of the counter to a string so I can print it out, and the label Space
points to a string that's just a single space. This code works as intended and I see the numbers displayed on the screen.
Here's what's odd. Have a look at this floating-point operation:
vadd.f32 s2, s0, s1 // What the heck is happening here?
When I add this into the loop right before the line where I increment the counter, I get different behavior entirely. Rather than printing "0, 1, 2, 3, 4", I now I see "0, 1, 0, 1, 0, 1, ..." repeating forever. Why is this happening? Why does the floating point instruction have any effect on this code at all?
Important additional info: A while ago, I was working on some code to draw a Mandelbrot fractal to the screen, using floating point arithmetic to do the calculations. Back then I believed that my Raspberry Pi had a Cortex A7 processor (which is what the newer models have) and I turned to the Cortex A7 Floating-Point Unit Technical Reference Manual which says that:
To use the Cortex-A7 FPU in Secure state and Non-secure state, first define the NSACR and then define the CPACR and FPEXC registers to enable the Cortex-A7 FPU.
It gave the following code snippet to accomplish this task:
MRC p15, 0, r0, c1, c1, 2
ORR r0, r0, #3<<10 // enable fpu
MCR p15, 0, r0, c1, c1, 2
LDR r0, =(0xF << 20)
MCR p15, 0, r0, c1, c0, 2
MOV r3, #0x40000000
VMSR FPEXC, r3
For some reason, this worked, and my Mandelbrot fractal appeared. Anyway, this snippet is present the program I'm working on today, directly above the code shown. When I remove it, I get different unexpected behavior. The program prints "0, 0, 0, ..." -- an infinite series of just 0's instead of 0's and 1's.
More details: My best guess about what's going on here is that the s0 and s1 floating point registers initially contain garbage, and that adding them together can raise an exception. This would explain a detail I haven't mentioned yet, which is that the code occasionally works even with the floating point instruction included -- maybe one time in five.
In order to test this theory, I tried setting all registers involved to zero right before the counting loop begins:
mov r0, #0
vmov s0, r0
vmov s1, r0
vmov s2, r0
And lo and behold, the loop worked again. However, as a further test, I decided to set both s0 and s1 to the maximum value a float can hold, reasoning that this should yield an overflow error and cause the unexpected behavior to return:
ldr r0, =0b01111111011111111111111111111111
vmov s0, r0
vmov s1, r0
vmov s2, r0
But this too leads to the correct counting behavior!
I'm at a loss for what's going on here. What's causing this?
Update: I've just noticed an issue. The code I'm using to assemble .s files into .o files is this:
arm-none-eabi-as -o $@ $< -mfpu=vfpv4 -mcpu=cortex-a72 -mfloat-abi=hard
But this has two issues. One, the vfpv4
is incorrect as the model B features VFPv2, and two, cortex-a72
is incorrect as the model B features an ARM1176-JZF-S.
Fixing the first of these two issues doesn't change any of the behavior mentioned above (I re-tried each example and got the same results). The second issue seems more serious, however, since the man page for arm-none-eabi-as
doesn't list the model B's processor type as one of the options. I will investigate further and post an update once I know more.
Solution
I have fixed this now. This web page explains what needs to be done to set up floating point numbers, and I was missing this part of the process:
@; load the status register
fmrx r0, fpscr
@; enable flush-to-zero (bit 24)
orr r0, #0x01000000
@; disable traps (bits 8-12 and bit 15)
bic r0, #0x9f00
@; save the status register
fmxr fpscr, r0
The page explains:
The default floating point mode on the ARM11 is to implement the most common floating point operations in hardware, and delgate to software for special cases. This is done by raising an unsupported operation exception, called a trap, in which you the programmer are supposed to figure out what went wrong (e.g., an underflow), calculate the correct result, and resume the program.
If, like me, you don't feel like implementing a bunch of floating point operations, there is an alternative: RunFast mode, or Flush-to-zero mode (which nearly means the same thing). This is a pure hardware floating point implementation which is not-quite IEEE 754-compliant. [...]
I haven't implemented any such handlers, so it looks like this configuration is what I need. I don't have a full mental picture of why this was causing the exact problem I was having, but I'm no longer surprised that there was a problem.
Answered By - MegaWidget Answer Checked By - David Goodson (WPSolving Volunteer)