Saturday, August 25, 2012

a couple speed improving tricks

(This post is a follow-up on my Quick color averaging post. Please read that post first)

Developing a weighted average that uses only 7 ARM assembler instructions instead of using 8 instructions to get the same result was just the tip of the iceberg. To achieve the highest speed possible when resizing a 320 pixel wide image into a 256 pixel wide one, which effectively means converting a PC VGA "Mode 13h" (256 colors) image into a 256 pixel wide 15bpp (32K colors) image on the DS, we should try to speed up every step of the whole conversion process.

For instance, retrieving each pixel's RGB values means reading a byte from the source image that is a pixel in the VGA screen, and accessing the corresponding offset within the palette by performing a lookup table read. So the ARM assembler code for reading the first two RGB values might look like this:

ldrb r3, [r1], #1  @ read first pixel value
lsl r3, #1         @ calculate offset
ldrh r3, [r11, r3] @ read first pixel RGB color
ldrb r4, [r1], #1  @ read second pixel value
lsl r4, #1         @ calculate offset
ldrh r4, [r11, r4] @ read second pixel RGB color

The code above is correct, but it doesn't take into account register interlocking. The ARM946E processor has a 5 stages pipeline, and its loading instructions require that the Memory stage be completed before you can use the target register. This means that there would be a so-called single-cycle load-use interlock if you load a word from memory to a register and you use that register right in the next instruction. In other words, the processor needs to insert a 1-cycle 'pause' before the Execute stage of each of the lsl instructions. Unfortunately, in our code we're reading a single byte from memory instead of a whole word, and things get even worse. Loading a byte (or a halfword) from memory into a register additionally requires the Write stage, thus triggering a two-cycle load-use interlock if the following instruction needs to use the register just loaded, as it happens in our code. (see section 7.12.1 of the ARM9E-S Core Technical Reference, PDF)

Simply reordering the instructions will save us lots of wasted cycles:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value - we need it later
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color

Another thing we have to take into account is that in the DS the color palette is stored in a rather slow memory, and non-sequential accesses to this memory are even slower. According to GBATek, a single 16-bits non-sequential access to palette RAM takes four 33.5 MHz cycles, which translates into eight CPU cycles, because the ARM9 runs at 67 MHz. Palette RAM isn't even cacheable (it's the default setting with DevKitARM; however, I don't suggest that you change this setting even if you actually can) and a lookup is needed for each pixel of the PC VGA "Mode 13h" screen. With a resolution of 320x200 pixel, this happens 64000 times per frame.

To speed up all those lookups, we can copy the palette into a faster memory right before starting our conversion routine. DTCM (Data Tightly-Coupled Memory) is just the right choice. It's a very fast memory: it has single-cycle access time even with non-sequential accesses, but it isn't very large being in fact only 16 KB total. The program's stack resides on it (again, it's a DevKitARM default setting, and once more I don't recommend changing it) but we actually need only 512 bytes to copy the 256 halfwords. So we temporarely allocate that half kilobyte on top of the stack and copy the palette there. Then we will perform all our lookups being sure there will be no slowdown. Actually, this has surely been the most effective change applied to the code in terms of performance improvement.

The last code optimization uses a peculiar kind of SIMD. ARM9 isn't a SIMD CPU, so it can't process multiple data with a single instruction unlike most processors in use nowadays. However, since we have 32 bits registers in there and we need to process 16 bits operands, we could stuff two operands per register and process double-operands as if they were normal operands. Of course, we have to be sure that we don't mix them up. This 'trick' is called SWAR - SIMD Within A Register.

Since in our code we have to perform two weighted averages for each stripe of 5 pixels that we want to convert into 4, we can actually perform the two weighted averages at the same time. Obviously, there's a little overhead: we need to move the operands together before performing the operations and separate them afterwards. This requires 4 ARM assembler instructions. So we can perform two weighted averages in just 11 instructions.

The resulting code, after all these changes, turned out to be 79% faster. Now it processes 179 pixels in the same time that it took the previous code to process only 100 pixels.

In the next post I'll tell you how to obtain the same graphical output without virtually using any CPU resources.

Wednesday, August 08, 2012

Quick color averaging

During my vacation back in May 2011 I was stuck for 4 days between an unexpected incredible snowstorm on one side and the eruption of the Grímsvötn volcano on the other side, of course in Iceland. Well... I had a lot of time and very little things to do, so I spent some time trying to figure out the fastest method of calculating a weighted average between two RGB colors, a and b, so that the result would be (3a + b)/4.

What for? Because I had already started being interested in DSx86, a PC emulator for Nintendo DS. If you've never tried this amazing homebrew, I suggest that you do so as soon as possible. DSx86 author 'Pate', in his May 15 blog post was seeking for suggestions on how to perform a faster weighted average between two colors. His then method was to run a normal average twice, to achieve a weighted one: tmp = (a+b)/2 then avg = (a+tmp)/2.

So what's the reason why I'm writing this post now? Well... time passes and memories start to fade, so I wanted to write down my thoughts and share them before they are gone completely. You know, I'm growing older ;)

If we can define a + b = (a ^ b) + ((a & b) << 1) as it appears in the following truth table:

a  b   a+b
0  0   00
0  1   01
1  0   01
1  1   10

then the average formula will be

(a + b)/2 = ((a ^ b)>>1) + (a & b).

Since our colors are halfwords (16 bits) where 5 bits are reserved for each RGB component, such as xBBBBBGGGGGRRRRR, the right shifting would make the least significant bit of the blue and green components fall into the bits reserved for the green and red components respectively, we should actually mask each lsb of (a ^ b) result before shifting. Thus we will obtain

(a + b)/2 = (((a ^ b) & ~0x421) >>1) + (a & b)

which is an accurate average of two RGB colors obtained without having to calculate each component average separately (please, read the very interesting Quick colour averaging article on CompuPhase web site).

Similarly, we can define 3a + b as

a  b   3a+b
0  0   000
0  1   001
1  0   011
1  1   100

which can be expressed as (a ^ b) + ((a & ~b)<<1) + ((a & b)<<2). To obtain the weighted average, we still have to divide it by 4, which results in

(3a + b)/4 = (a ^ b)>>2 + ((a & ~b)>>1) + (a & b)

Again, the shifts here would make the least significant bits fall into the other components, so we have to clear the least significant bit for the 1-bit right shift and clear two least significant bits for the 2-bit right shift. Finally, we get

(3a + b)/4 = (((a ^ b) & ~0xC63) >>2) + (((a & ~b) & ~0x421) >>1) + (a & b)

The normal average was implemented using 4 ARM assembler instructions, and had to be done twice. On the contrary, the weighted average calculated as per my expression can be coded using 7 ARM instructions only, which allows to save 1 cycle per weighted average. Not bad if you consider that all 200 320-pixel-wide lines of the VGA screen have to be converted into 256-pixel-wide lines to fit the DS screen up to 60 times per second. To do this, you need to perform two weighted averages every 5 pixels.

There are some other nice tricks I used to speed up things even more... but I'll detail those in the next post because I'd prefer to contain this one to quick color averaging subject only.