Wednesday, March 20, 2013

One hundred twenty-seven shades of grey

Nothing to deal with the bestseller, assured.
As you may already know, the DS 2D core supports paletted and direct colors, both of which are expressed using 5 bits per primary color (red, green, blue). This is 15bpp, also known as HiColor mode.
Having 'only' five bits per primary color means that there are just 32 different shades of gray that can be defined including the darkest - black, and the brightest - white.


Back in 2008, while reading GBATek specifications, I had found out that the DS screens were 18bpp LCD panels, so I started a topic on gbadev forum suggesting that there might be a way to exploit this. It turned out that using the hardware alpha blending capabilities of the 2D core you can indeed force the hardware to show real 18bpp images. Fellow forum member Cydrak has even posted a very good demo then, and here's his original forum post with the link to his 18bpp demo.
So, having now 6 bits per primary color, it's possible to display 64 shades of grey. The improvement is significant, even if banding is still noticeable.


Pushing this thing further has been tickling me since then, so very recently I decided to give it a try. Of course, the hardware limit of 18bpp can't be overcome, and the display can't show more colors than what it's capable of. However, having a 2D core that can generate 60 frames per second, we can exploit the human eye persistence of vision. The idea here is that if we display two slightly different images alternatively at a sufficient high rate (60 times per second is surely high enough), our retinas will just perceive a sort of an average of the two images. And that's what we get: 127 shades of grey. Even if you know that the additional 63 shades aren't really there, they're simply the result of our perception.


Note: the second and third image in this post are fake: there are no emulators that can show 18bpp and, of course, there's no way of making any emulator show the image that your eyes perceive. So, I suggest that you test the demo yourself on your DS. You can download it here.

Tuesday, September 11, 2012

...more tricks, more speed!

Very recently, I've further optimized the VGA smooth scaling that DSx86 uses, and Pate already announced on his blog the next release will feature the faster code. If you haven't read about the first wave of optimizations, please read about it in my August posts.

This time I exploited two more tricks, and the result is that the speed increased from 79% faster to 114% faster compared to the original code. Of course, this isn't bad at all.

The former optimization comes from the observation that there's no way to use a base register plus a shifted register offset addressing scheme when accessing halfwords, whereas this is very common when accessing words instead. This means we need a separate shift instruction to calculate the offset if we want to access a halfword in a lookup table while, on the contrary, we can access a word in a lookup table using a single instruction. Thus, if we define a 256-words temporary space in the stack (it takes 1 KB) for the lookup table and copy there each palette RGB values as whole words, we will later save one instruction per input pixel when we access them. So this fragment of code:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color
lsl r5, #1         @ calculate offset (3rd pixel)
ldrh r5, [r11, r5] @ read third pixel RGB color

turns into this shorter one:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
ldr r3, [r11, r3, lsl #2] @ read first pixel RGB color
ldr r4, [r11, r4, lsl #2] @ read second pixel RGB color
ldr r5, [r11, r5, lsl #2] @ read second pixel RGB color

Since the code performs 5 lookups per loop in total, this optimization saved 5 instructions, shortening the whole loop to 27 instructions only and increasing the speed to 98%.

The latter optimization done uses the very well known trick of the loop unrolling. Since the results are so good, I thought it was worth spending some code space. The loop has been unrolled 8 times, thus now it processes 40 input pixel each iteration before encountering the costly (3 cycles) branch instruction. Even this simple optimization proved to be very effective in terms of performance improvement.

Monday, September 03, 2012

Hardware generated smooth scaling

In my August posts I focused on the improvements done with DSx86's ARM ASM smooth scaling routine where I did my best to make it as fast as possible, knowing that every CPU cycle saved there would turn useful in the emulation main loop. Then it took me a few more months to realize that actually the same result can be achieved by properly programming the NDS 2D graphical core. So here's how I did it.

The smooth scaling routine takes groups of five 256-color pixels on the same line and turns them into four 32K-color pixels on the DS screen by performing many palette lookups and regular/weighted averages, as we've seen already. The DS 2D core, on the other hand, can perform alpha blending between two backgrounds, without requiring any effort from the CPU. This alpha blending feature can achieve nothing less than an average between each pixel of the first background and the corresponding pixel on the second background, returning a 32K-color image.(1) Additionally, the 2D core can also perform background scaling. We need to exploit both these features.

Let's define the 5 original pixels as p0-p4, and the resulting 4 output pixels as r0-r3. What we need to get is:

r0 as the sum of 3/4 p0 and 1/4 p1
r1 as the sum of 1/2 p1 and 1/2 p2
r2 as the sum of 1/4 p2 and 3/4 p3
r3 as p4

If we could blend 4 backgrounds together we could simply copy specific pixels in the 4 backgrounds to obtain this (please check that each output pixel is exactly as expected):

BG0: p0 p1 p2 p4
BG1: p0 p1 p3 p4
BG2: p0 p2 p3 p4
BG3: p1 p2 p3 p4

Since the 2D core can do background scaling, we don't even need to copy specific pixels. Each background can be generated the way we need it starting from the unmodified original image stored in Video RAM using the scaling features. Thus, we program the 2D core to skip one source pixel each group of five, and choose which pixel has to be skipped.
For example, to generate each of the backgrounds (the code does that for BG2), we have to program the background affine matrix to scale a 320-pixel wide image in a 256-pixel wide background:

REG_BG2PA = (320 << 8) / 256;
REG_BG2PB = 0;
REG_BG2PC = 0;
REG_BG2PD = (1 << 8);


Then we should tell the 2D core to skip pixel p1. This is accomplished by using the reference point X coordinate register:

REG_BG2X = (3 << 8) / 4;

You can think of this register as if it was a sort of a counter of the fractional part. We initialize it to a precise value (3/4, in this case) and after each output pixel has been generated, 1/4 gets added to this counter. (It's because 320 divided by 256 gives 1 plus a fractional part of 1/4). When the counter reaches the unit, the scaling process skips one pixel of the original image, and in this case this will happen after processing one pixel. We can also tell the 2D core to use the same 320x200 bitmap for all the backgrounds, then program different reference point X coordinate values for each background.

Unfortunately, what we can't ask the 2D core is to blend all 4 backgrounds at the same time. However, we can make it blend 2 of these backgrounds each frame, and blend the other 2 backgrounds the next frame, at 60 frames per second.(2) The LCD screen and our retinas will average the 2 generated images, providing in fact the expected result.

DSx86 actually uses a slightly different implementation. It performs vertical scaling at the same time (200 lines down to 192 in VGA "Mode 13h" and 240 lines down to 192 in VGA "Mode X", using different affine matrices) in the so-called 'Jitter' mode.

(1) The DS screen output supports 18bpp color, and alpha blending is probably performed with even more precision.
(2) Since only BG2 and BG3 support bitmap backgrounds, the code will blend these two, redefining them as needed on each frame.

Saturday, August 25, 2012

a couple speed improving tricks

(This post is a follow-up on my Quick color averaging post. Please read that post first)

Developing a weighted average that uses only 7 ARM assembler instructions instead of using 8 instructions to get the same result was just the tip of the iceberg. To achieve the highest speed possible when resizing a 320 pixel wide image into a 256 pixel wide one, which effectively means converting a PC VGA "Mode 13h" (256 colors) image into a 256 pixel wide 15bpp (32K colors) image on the DS, we should try to speed up every step of the whole conversion process.

For instance, retrieving each pixel's RGB values means reading a byte from the source image that is a pixel in the VGA screen, and accessing the corresponding offset within the palette by performing a lookup table read. So the ARM assembler code for reading the first two RGB values might look like this:

ldrb r3, [r1], #1  @ read first pixel value
lsl r3, #1         @ calculate offset
ldrh r3, [r11, r3] @ read first pixel RGB color
ldrb r4, [r1], #1  @ read second pixel value
lsl r4, #1         @ calculate offset
ldrh r4, [r11, r4] @ read second pixel RGB color

The code above is correct, but it doesn't take into account register interlocking. The ARM946E processor has a 5 stages pipeline, and its loading instructions require that the Memory stage be completed before you can use the target register. This means that there would be a so-called single-cycle load-use interlock if you load a word from memory to a register and you use that register right in the next instruction. In other words, the processor needs to insert a 1-cycle 'pause' before the Execute stage of each of the lsl instructions. Unfortunately, in our code we're reading a single byte from memory instead of a whole word, and things get even worse. Loading a byte (or a halfword) from memory into a register additionally requires the Write stage, thus triggering a two-cycle load-use interlock if the following instruction needs to use the register just loaded, as it happens in our code. (see section 7.12.1 of the ARM9E-S Core Technical Reference, PDF)

Simply reordering the instructions will save us lots of wasted cycles:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value - we need it later
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color

Another thing we have to take into account is that in the DS the color palette is stored in a rather slow memory, and non-sequential accesses to this memory are even slower. According to GBATek, a single 16-bits non-sequential access to palette RAM takes four 33.5 MHz cycles, which translates into eight CPU cycles, because the ARM9 runs at 67 MHz. Palette RAM isn't even cacheable (it's the default setting with DevKitARM; however, I don't suggest that you change this setting even if you actually can) and a lookup is needed for each pixel of the PC VGA "Mode 13h" screen. With a resolution of 320x200 pixel, this happens 64000 times per frame.

To speed up all those lookups, we can copy the palette into a faster memory right before starting our conversion routine. DTCM (Data Tightly-Coupled Memory) is just the right choice. It's a very fast memory: it has single-cycle access time even with non-sequential accesses, but it isn't very large being in fact only 16 KB total. The program's stack resides on it (again, it's a DevKitARM default setting, and once more I don't recommend changing it) but we actually need only 512 bytes to copy the 256 halfwords. So we temporarely allocate that half kilobyte on top of the stack and copy the palette there. Then we will perform all our lookups being sure there will be no slowdown. Actually, this has surely been the most effective change applied to the code in terms of performance improvement.

The last code optimization uses a peculiar kind of SIMD. ARM9 isn't a SIMD CPU, so it can't process multiple data with a single instruction unlike most processors in use nowadays. However, since we have 32 bits registers in there and we need to process 16 bits operands, we could stuff two operands per register and process double-operands as if they were normal operands. Of course, we have to be sure that we don't mix them up. This 'trick' is called SWAR - SIMD Within A Register.

Since in our code we have to perform two weighted averages for each stripe of 5 pixels that we want to convert into 4, we can actually perform the two weighted averages at the same time. Obviously, there's a little overhead: we need to move the operands together before performing the operations and separate them afterwards. This requires 4 ARM assembler instructions. So we can perform two weighted averages in just 11 instructions.

The resulting code, after all these changes, turned out to be 79% faster. Now it processes 179 pixels in the same time that it took the previous code to process only 100 pixels.

In the next post I'll tell you how to obtain the same graphical output without virtually using any CPU resources.

Wednesday, August 08, 2012

Quick color averaging

During my vacation back in May 2011 I was stuck for 4 days between an unexpected incredible snowstorm on one side and the eruption of the Grímsvötn volcano on the other side, of course in Iceland. Well... I had a lot of time and very little things to do, so I spent some time trying to figure out the fastest method of calculating a weighted average between two RGB colors, a and b, so that the result would be (3a + b)/4.

What for? Because I had already started being interested in DSx86, a PC emulator for Nintendo DS. If you've never tried this amazing homebrew, I suggest that you do so as soon as possible. DSx86 author 'Pate', in his May 15 blog post was seeking for suggestions on how to perform a faster weighted average between two colors. His then method was to run a normal average twice, to achieve a weighted one: tmp = (a+b)/2 then avg = (a+tmp)/2.

So what's the reason why I'm writing this post now? Well... time passes and memories start to fade, so I wanted to write down my thoughts and share them before they are gone completely. You know, I'm growing older ;)

If we can define a + b = (a ^ b) + ((a & b) << 1) as it appears in the following truth table:

a  b   a+b
0  0   00
0  1   01
1  0   01
1  1   10

then the average formula will be

(a + b)/2 = ((a ^ b)>>1) + (a & b).

Since our colors are halfwords (16 bits) where 5 bits are reserved for each RGB component, such as xBBBBBGGGGGRRRRR, the right shifting would make the least significant bit of the blue and green components fall into the bits reserved for the green and red components respectively, we should actually mask each lsb of (a ^ b) result before shifting. Thus we will obtain

(a + b)/2 = (((a ^ b) & ~0x421) >>1) + (a & b)

which is an accurate average of two RGB colors obtained without having to calculate each component average separately (please, read the very interesting Quick colour averaging article on CompuPhase web site).

Similarly, we can define 3a + b as

a  b   3a+b
0  0   000
0  1   001
1  0   011
1  1   100

which can be expressed as (a ^ b) + ((a & ~b)<<1) + ((a & b)<<2). To obtain the weighted average, we still have to divide it by 4, which results in

(3a + b)/4 = (a ^ b)>>2 + ((a & ~b)>>1) + (a & b)

Again, the shifts here would make the least significant bits fall into the other components, so we have to clear the least significant bit for the 1-bit right shift and clear two least significant bits for the 2-bit right shift. Finally, we get

(3a + b)/4 = (((a ^ b) & ~0xC63) >>2) + (((a & ~b) & ~0x421) >>1) + (a & b)

The normal average was implemented using 4 ARM assembler instructions, and had to be done twice. On the contrary, the weighted average calculated as per my expression can be coded using 7 ARM instructions only, which allows to save 1 cycle per weighted average. Not bad if you consider that all 200 320-pixel-wide lines of the VGA screen have to be converted into 256-pixel-wide lines to fit the DS screen up to 60 times per second. To do this, you need to perform two weighted averages every 5 pixels.

There are some other nice tricks I used to speed up things even more... but I'll detail those in the next post because I'd prefer to contain this one to quick color averaging subject only.

Saturday, June 30, 2012

MP3 streaming on ARM7

Even if it's not something I would personally use in my homebrew... it's possible to program the DS 'secondary' processor (called 'ARM7' for short) to stream (I mean play) MP3s directly from the storage. The good thing is that your 'main' processor (the ARM9) will be free to work on more interesting tasks such as elaborating your game logic, while the music goes in the background. The bad thing is that your ARM7 33MHz isn't exactly that super-powered beast, and has no hardware decoding capabilities... so the whole software decoding work of an MP3 can be quite demanding for its somehow limited power.

It all started with this post on gbadev forum. Extrapolating parts of elhobbs' great work on cquake, forum user 'hacker013' created an example (that can be easily turned into a library, if this matters to you...) that makes it very easy to stream a (stereo only, be aware) MP3 audio file on a DS. The provided code, unfortunately, was playing only the left channel of the MP3 file it was streaming, so I made some changes that actually made the code play both left and right channels using two separate DS hardware channels. I've put the whole modified code here, in case you need it.

To say the truth, I hardly see any reason why you should ever use such a thing in your homebrew. MP3 files are quite big (say 1 MB for each minute of a 44100 samples/sec, stereo, 128Kbps encoded file) and the ARM7 processor will surely choke if you try to make it decode a 44.1KHz stereo (CD sample rate) MP3 encoded at more than 128Kbps. Even a common 44.1KHz 128Kbps stereo MP3 could be enough to choke the CPU, sooner or later... at least that's what I found during my tests. Things gets better with stereo MP3s having 32KHz sample rate that is also close to the DS audio output, which is 32768 samples per second. In my tests I could play some 32KHz stereo MP3s encoded up to 256Kbps with no problems.

Anyway, in my opinion, 'tracked music' is the way to go on a DS. MaxMod library, or libXM7, which I wrote myself some time ago, can produce very good quality music even with very little ARM7 CPU load. MaxMod comes with the devkitARM & libnds, and it supports MOD/S3M/XM/IT formats with hardware and software audio mixing. On the other hand, libXM7 only supports MOD and XM formats with hardware mixing only, but the MOD and XM compatibility is very accurate, and it supports the whole range of effects that MOD/XM tunes can use.

Sunday, May 06, 2012

"Peaches the Wale"

A few days ago I received an e-mail from a tiny whale. Well, I have to admit it isn't something I've seen really often. They call her "Peaches the Wale" [sic], and she's a musician who recently composed some MOD tunes on her Commodore Amiga. You can see her in this video.


Now she's been invited to have some concerts around, and she realized it wouldn't be very feasible to drag the Amiga with her... so she found on the Internet the XM/MOD player I wrote for the Nintendo DS using libXM7 library, but she needed some additional features:
  • the module should load and be ready for replay, instead of starting immediately
  • it should be possible to stop and restart the module from the current pattern
  • it should be possible to skip to next or to previous pattern while stopped or even while playing
  • the program should visually show the number of the current pattern and the total number of patterns in the module
  • the music output should be mono, for her DJ mixer.
I really couldn't refuse giving her my help, so I compiled my player again in a different version with the requested features. I called it XM7dj and of course it's available for everybody who might need it. Download it here. I hope you enjoy!