Tuesday, September 11, 2012

...more tricks, more speed!

Very recently, I've further optimized the VGA smooth scaling that DSx86 uses, and Pate already announced on his blog the next release will feature the faster code. If you haven't read about the first wave of optimizations, please read about it in my August posts.

This time I exploited two more tricks, and the result is that the speed increased from 79% faster to 114% faster compared to the original code. Of course, this isn't bad at all.

The former optimization comes from the observation that there's no way to use a base register plus a shifted register offset addressing scheme when accessing halfwords, whereas this is very common when accessing words instead. This means we need a separate shift instruction to calculate the offset if we want to access a halfword in a lookup table while, on the contrary, we can access a word in a lookup table using a single instruction. Thus, if we define a 256-words temporary space in the stack (it takes 1 KB) for the lookup table and copy there each palette RGB values as whole words, we will later save one instruction per input pixel when we access them. So this fragment of code:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color
lsl r5, #1         @ calculate offset (3rd pixel)
ldrh r5, [r11, r5] @ read third pixel RGB color

turns into this shorter one:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
ldr r3, [r11, r3, lsl #2] @ read first pixel RGB color
ldr r4, [r11, r4, lsl #2] @ read second pixel RGB color
ldr r5, [r11, r5, lsl #2] @ read second pixel RGB color

Since the code performs 5 lookups per loop in total, this optimization saved 5 instructions, shortening the whole loop to 27 instructions only and increasing the speed to 98%.

The latter optimization done uses the very well known trick of the loop unrolling. Since the results are so good, I thought it was worth spending some code space. The loop has been unrolled 8 times, thus now it processes 40 input pixel each iteration before encountering the costly (3 cycles) branch instruction. Even this simple optimization proved to be very effective in terms of performance improvement.


  1. I'm trying to figure out what you measure to claim that you got "114% faster" ... that clearly can't be the amount of time needed to complete a task. Amount of frames/pixels you can render within a given amount of time, maybe ?

    1. I'm counting the cycles taken to do the work using both the original and the optimized code, and I calculate the speed increase using the following formula: (float)((cycles_original/cycles_optimized)-1)*100
      So for example if cycles_optimized it's half cycles_original, the formula gives 100. "114% faster" means that cycles_optimized it's even less than half cycles_original.