A DS Homebrewer's Diary

Wednesday, June 07, 2017

Some more news from 2016-2017

Another year has passed and here's my timely post. I bet some of you thought I would forget, right? ;)

Well, during the past months I've mostly been busy nurturing my SEGA Master System / Game Gear development kit and libraries, devkitSMS/SMSlib, and the time spent on that is bearing fruits. For example, the 2017 SMS Power! Coding Competition winning entry is a game developed by a fellow forum user eruiz00 using it - the game is called Astro Force, it's a vertical shoot-em-up, and it's a great game! It has many levels, lots of enemies, bosses, music and SFXs. The author is also distributing the game C source code, so if you feel inclined to see how he did it, you can see that for yourself.

Here are some screenshots, and the ROM (with the sources) can be downloaded from this page.

Astro Force - SEGA Master System homebrew game by eruiz00

The game is also using my SN76489 audio library, PSGlib, as does another great homebrew that won the second place in the very same aforementioned competition: a team of seriously talented guys (Psidum, Calindro, RushJet1 and Sim1) made a 25 fps FMV (full motion video) of the famous B&W video "Bad Apple" at full SMS resolution, 256x192 pixels. You can check out the video and audio at this YouTube recording. It's worth every second, it's just amazing what they achieved.

Besides the 8-bit world, there are some other news. The most important one for me is that we finally released Waimanu Grinding Blocks Adventure (for the Gameboy Advance) on a physical cartridge (!!!). This became possible thanks to our publisher, Piko Interactive. You can read all the details (and find out how to get a copy of it!) on Disjointed Studio's blog post. We're so excited and we can't wait to hold a copy of that in our hands!

Waimanu GBA boxes stacked up - picture by @thebitstation

Speaking about Disjointed Studio, we're finally approaching the beta phase of our new game Weka Invaders... well, unfortunately I can't tell yet how soon it will be released.

Finally, it's true I'm posting on this blog very infrequently. Nowadays I'm (slightly) more active on Twitter, so you can follow me @i_am_sverx for news between my blog posts.

See you!

Wednesday, June 08, 2016

Past Year's News

Ahoy!
well, what should I say? I didn't post anything in a while - so let's recap what has happened since May 2015, which is a whopping 13 months ago. Sit down and relax... well, I promise it won't take too long anyway.

First and foremost, last November we Disjointed Studio guys released our first SEGA Master System game. It's called Waimanu: Scary Monsters Saga (as our habit goes, the subtitle initials match with those of the console the game will be running on). You can download it here for free, and play it on your console using a flashcart/adapter such as the Master EverDrive, or on an emulator (MEKA and Emulicious are the ones I suggest).
Here are two screenshots for your viewing pleasure:

WaimanuSMS - title and menu screen

WaimanuSMS - in game (area 1)

This is the first (and so far the only) game I've ever written completely in assembly... Zilog Z80 assembly specifically. I had fun doing it... well, sort of.

Following the release of WaimanuSMS, the well known British magazine Retro*GAMER featured an interview with yet another Homebrew Hero, as they dub it, and he was nobody else than... yours truly :-)
You can read my ramblings in the 151st issue, if you can still find it. And I have to admit I'd been waiting for this to happen for quite a long time... I had purposely taken the picture that appears in the article when I was in Portland, OR (known for its breweries too)... and that was in July 2013. Oh, well... I can't really complain.

All this apart, I also have spent much of my time during the last 13 months to enhance my SEGA Master System development kit and library, devkitSMS/SMSlib, which has recently reached what I would call a mature stage.
Speaking of this, another thing that made me very proud happened a while ago, at the end of October 2015: the homebrew rockstars known as The Mojon Twins released their first SEGA Master System game, and they made it using my devkit. The game is called Moggy Master and it's a simple 1-or-2-player game they did to test the kit and the library, according to their blog post. They hope to code and release more games for the SMS in the future, and I hope so too. Also, we worked together to create a library for the SEGA SG-1000, the Master System forerunner, and we called that SGlib. It's now a part of the devkit.

Finally, last March, during the SMSPower! 2016 Coding Competition, two more games based on my devkit and libraries were finally released along with two projects I've been working on myself. These games are haroldoop's DataStorm, a port of an Atari 2600 shoot 'em up called Turmoil, and Pedro76 and Nivarel's Master of the Labyrinth, a dungeon crawler. The first of my own projects is MARKanoIIId, an Arkanoid clone, which is still simply a hypnotic interactive demo and not yet a full game. It features two great tunes by Tomy, a Finnish musician and PSG essayer, and sleek graphics by Kagesan, a German homebrewer who also won this year's competition with his marvelous Bara Burū.

The other project is Disjointed Studio's new effort (and a very early beta back then), which goes under the working title of Weka Invaders. Waimanu is once again defending the Earth from the next wave of alien invasion, but this time he happens to be carrying a huge weapon on his shoulders. We're actively working on this project in these very days... unfortunately, I still can't reveal any planned release date.

Well, I think that's it. I should promise it won't take me another year for the next post but... well, you already know how lazy I am, right?

Wednesday, May 13, 2015

Putting some more fuel into SMS homebrew

Here I am again... well, almost another year has passed since my last words here. "Doesn't time fly when you're enjoying yourself?"[*]

I spent most of my time these last months writing code and building tools aimed to the homebrew on the SEGA Master System (SMS for short) and SEGA Game Gear.
Writing homebrew games/programs on SMS still means writing ASM code almost from scratch every time.
Yes, of course you can reuse some of the code you've already written, but still there isn't a big deal of shared ready-to-use code, snippets apart. Even if it features background music and sound effects support, the previously only existing audio library (the very good Mod2PSG2, which plays music written on the tracker with the same name) unfortunately never provided a way to export sound effects from its tracker, thus forcing developers to build their sound effects with hex editors.

So I wrote PSGlib, and the tools to convert VGM files into tunes and sound effects that the homebrewer wants to use in their program. VGM files can be produced by some well know trackers such as DefleMask and the very same Mod2PSG2 tracker too. After you convert and compress them into the PSG format, everything you have to do is just to start them (music and SFX) at the right time. Now the majority of SMS homebrew arising these days uses it, the complete list is here.

This happened mostly in 2014, still. And even if writing Z80 ASM code can be very entertaining (seriously!), I decided to try to build a development kit to write SMS homebrew in C.
Of course, first I needed a compiler. SDCC turned out to be the choice. It's a free open source optimizing C compiler that also targets the Zilog Z80 processor, among many others. So the core component I needed was already available.
Since the processor is only one of many components of the SMS, I needed to write code and tools to make it possible to use the SMS as the target, and to write a library on top of that development kit to enable programmers to use the underlying hardware straight from their C programs.

So last January devkitSMS (the kit) and SMSlib (the library) were born.
The library includes functions to handle the display hardware features, such as hardware scrolling for example, background and sprites, and supports software sprite clipping based on a user defined window. It has functions to handle colors and palettes, tiles and tilemaps, both normal keypad and Genesis/MegaDrive 3/6 buttons pad, the pause key and it also has ROM mappers support.
PSGlib then also incarnated as a additional C library, so that it also can be used with the devkit.

And that's pretty much it.

Sunday, August 31, 2014

One year with no (posted) news

Hello!

Yes, I'm alive and kicking :)

Sorry I may have appeared MIA... I just had nothing interesting to post; it's been a year since I virtually don't have anything going on involving my lovely DS or the little GBA.

Anyway a few months ago I recovered my younger brother's SEGA Master System II and also his SEGA Game Gear, both of which were forgotten long ago... fortunately, they both are still almost perfectly working.

In case you've never heard of them, they're basically the same hardware, even if the latter is a portable system featuring a color LCD display, more colors and stereo audio. What we're talking about here are 3.5 MHz Zilog Z80 powered 8 bit systems, not exactly something you can compare to GBA 16.5 MHz and DS 66 MHz 32 bit ARM processors horsepower.

Still it's already proving to be very intriguing to write code for these consoles.

First, you have to deal with the processor. The Z80 is an 8 bit CPU, as I said, so actually it can handle only rather small numbers. It even has no instruction to perform multiplication, not to mention division. Shifting a register requires twice the time it takes to make an addition, and you can shift bits left or right by one position only. The fastest operations, such as the addition, require 4 clock cycles (here's the complete instruction set). Many basic operations are available on selected registers only.

The memory. The system features 8 KB of RAM, built into the console. However, the code runs from ROM: there is a chip inside each and every cartridge, so there are also no loading times. ROM size can be up to 1 MB, but everything bigger than 48 KB requires bank switching to access the upper part.

There's virtually no other option but to code in assembler. WLA DX is currently the assembler of choice.

Then there's the hardware responsible for the graphic, the Video Display Processor (VDP), which has very limited capability, again I mean compared to the DS/GBA. Basically here you've got a single background made up of a grid of 32x24 tiles, each with 16 colors either from the first or the second palette, which is the one that is also used by the sprites. The hardware also isn't capable of displaying more than 64 different colors, and I don't mean at the same moment. Finally, up to 64 16-colors sprites (each 8x8 or 8x16 pixels dimensions) are available, but only up to 8 will be drawn on the same scanline. Sprites unfortunately cannot be flipped neither horizontally nor vertically. The VDP still has the quite powerful feature of supporting hardware scrolling of the background in both X and Y directions.

The background tiles, the map, the sprites graphics and the Sprite Attribute Table (SAT) all share the same VRAM space, which is only 16 KB total. The most troublesome issue here is that you can write to VRAM in specific moments only (when video is disabled or during VBlank). If you write to VRAM at the wrong moment, your data will simply be discarded, so it's very easy to end up with corrupted graphics.

Finally, Japanese version of the SEGA Master System apart, the system generates music and sound effects from its PSG chip (the Texas Instruments SN76489 Programmable Sound Generator), which has 4 mono audio channels. Each of the first three channels can output a true square wave (50% duty cycle) of a given frequency, while the fourth channel can output noise only. Volume of each channel can be set to one of 16 attenuation levels on a logarithmic scale.

There are some ready-to-use tools and libraries to compose and replay modules using the PSG chip, but none of these libraries is currently supporting both music and sound effects. So I decided to try implementing a solution to be able to have sound effects over background music, even with the very limited number of available channels.

The result is PSGlib. It plays VGM tunes written for the SN76489 (the tunes need to be converted to a specific format), and it supports sound effects on the third square wave channel and/or on the noise channel, avoiding any collision with the music that would be probably trying to use the same channels. More details about the library may follow in a separate post, eventually.

And... that's pretty much everything so far.

Sunday, August 25, 2013

sverx in GBA homebrew land...

(I wrote this post quite a long time ago, during the port of Waimanu Daring Slides to GBA. It has been sitting here for a while, sorry for the delay. You can find the result of my pretty hard work here, on Disjointed Studio blog)

When Nintendo were planning the DS, they decided that the new console would be compatible with the previous one, the Game Boy Advance (GBA for short). So they have put in the same processor - an ARM7TDMI, which is the only processor in the GBA and the 'secondary' processor in the DS when running DS native code. They have also put in an evolution of the same 2D core, giving it many new powerful features. So the GBA, from the point of view of a DS homebrewer, is not completely different, but there are lots of differences that you should keep in mind if you decide to adventure yourself into GBA homebrew land.
So here's an overview of the GBA 2D core features comparing them to the DS... of course, without mentioning things you may easily notice such as that the GBA has got one screen only with a resolution of 240x160 pixels, whereas DS has two 256x192 pixels screens and 2 separate 2D cores. Of course, the GBA has no 3D core at all.
Please note that the list isn't comprehensive, and I'm describing differences pertaining to the graphical 2D core only.

- The DS supports up to 4 backgrounds at the same time. Two of them can only be 'normal' backgrounds (no rotation and scaling is supported on these backgrounds), but you can choose how you want the other two backgrounds to be. So your options are to have all 4 normal backgrounds or you can have 3 normal backgrounds and 1 that supports rotation and scaling (known as 'rotscale' or 'affine' background) or even have 2 normal and 2 rotscale backgrounds. With GBA, you can have 4 normal backgrounds too, but if you need rotscale backgrounds, you have to give up two normal backgrounds for each rotscale background you want to use. So you'll eventually have 2 normal backgrounds and just one rotscale background or 2 rotscale backgrounds with no other backgrounds at all.
- The DS also features 'extended' rotscale backgrounds, which are rotscale backgrounds supporting up to 1024 different tiles, and each tile can be eventually flipped horizontally/vertically and/or use one of 16 separate 16-color or 256-color palettes. On the GBA there's no such 'extended' rotscale thing, and 'regular' rotscale backgrounds are 256 colors backgrounds that supports up to 256 different tiles only, with no flipping and of course no palette selection.
- Extended rotscale backgrounds on DS can also become bitmap backgrounds, making it possible to have bitmaps over (or under!) text/rotscale backgrounds, or even a bitmap over another. On the other hand, the GBA has very few bitmap oriented features. You can have only a single bitmap background, even if you have 3 choices of what to show in it. You can show a 240x160 15bpp (32 thousand colors) bitmap with a single framebuffer since there's not enough VRAM to have two of them - such a bitmap requires 75 KB. The second choice is a 240x160 256 colors bitmap with double framebuffer, and the last choice is quite a bizarre 160x128 15bpp double framebuffer bitmap background.
- The DS main 2D core (but not the sub core) has a 'large bitmap mode', featuring a single 1024x512 bitmap background. Of course, on the GBA that doesn't exists.
- Palettes: the DS has 16 256-color additional palettes (known as 'extended' palettes) for backgrounds plus another 16 for sprites, besides the regular 256 colors palette for the backgrounds and the regular one for the sprites. Both of these regular palettes also be used as if they were 16 separate 16-color palettes, for 16-color tiles and sprites. The GBA features the very same regular palettes, but there are no 'extended' palettes.
- The GBA has only a total of 96 KB of video RAM, of which 64 KB are dedicated to background maps and tiles, and 32 KB dedicated to sprites. This means, for example, that only 512 different 256-color tiles for sprites can be stored here, even if the GBA 2D core could use up to 1024 different tiles. Also, when choosing a bitmap mode, only 16 KB are left for sprites as the first 80 KB of VRAM are bound to the background bitmap framebuffer(s).
- On the GBA the sprites will always overlap with each other according to their order in the OAM (Object Attribute Memory). This means that sprite number 0 will be always 'on top' of sprite number 1, even if the latter is bearing higher priority than the former. On the contrary, on the DS the priority also works sprite-on-sprite, not simply sprite-on-background.
- Bitmap Objects (also known as 15bpp sprites) don't exist on GBA.

All that said, please don't let this scare you. It's really a lot of fun to code on that little neat machine, and it will surely give you lots of satisfaction.

Wednesday, March 20, 2013

One hundred twenty-seven shades of grey

Nothing to deal with the bestseller, assured.
As you may already know, the DS 2D core supports paletted and direct colors, both of which are expressed using 5 bits per primary color (red, green, blue). This is 15bpp, also known as HiColor mode.
Having 'only' five bits per primary color means that there are just 32 different shades of gray that can be defined including the darkest - black, and the brightest - white.

Back in 2008, while reading GBATek specifications, I had found out that the DS screens were 18bpp LCD panels, so I started a topic on gbadev forum suggesting that there might be a way to exploit this. It turned out that using the hardware alpha blending capabilities of the 2D core you can indeed force the hardware to show real 18bpp images. Fellow forum member Cydrak has even posted a very good demo then, and here's his original forum post with the link to his 18bpp demo.
So, having now 6 bits per primary color, it's possible to display 64 shades of grey. The improvement is significant, even if banding is still noticeable.

Pushing this thing further has been tickling me since then, so very recently I decided to give it a try. Of course, the hardware limit of 18bpp can't be overcome, and the display can't show more colors than what it's capable of. However, having a 2D core that can generate 60 frames per second, we can exploit the human eye persistence of vision. The idea here is that if we display two slightly different images alternatively at a sufficient high rate (60 times per second is surely high enough), our retinas will just perceive a sort of an average of the two images. And that's what we get: 127 shades of grey. Even if you know that the additional 63 shades aren't really there, they're simply the result of our perception.

Note: the second and third image in this post are fake: there are no emulators that can show 18bpp and, of course, there's no way of making any emulator show the image that your eyes perceive. So, I suggest that you test the demo yourself on your DS. You can download it here.

Tuesday, September 11, 2012

...more tricks, more speed!

Very recently, I've further optimized the VGA smooth scaling that DSx86 uses, and Pate already announced on his blog the next release will feature the faster code. If you haven't read about the first wave of optimizations, please read about it in my August posts.

This time I exploited two more tricks, and the result is that the speed increased from 79% faster to 114% faster compared to the original code. Of course, this isn't bad at all.

The former optimization comes from the observation that there's no way to use a base register plus a shifted register offset addressing scheme when accessing halfwords, whereas this is very common when accessing words instead. This means we need a separate shift instruction to calculate the offset if we want to access a halfword in a lookup table while, on the contrary, we can access a word in a lookup table using a single instruction. Thus, if we define a 256-words temporary space in the stack (it takes 1 KB) for the lookup table and copy there each palette RGB values as whole words, we will later save one instruction per input pixel when we access them. So this fragment of code:

ldrb r3, [r1], #1 @ read first pixel value
ldrb r4, [r1], #1 @ read second pixel value
ldrb r5, [r1], #1 @ read third pixel value
lsl r3, #1 @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1 @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color
lsl r5, #1 @ calculate offset (3rd pixel)
ldrh r5, [r11, r5] @ read third pixel RGB color

turns into this shorter one:

ldrb r3, [r1], #1 @ read first pixel value
ldrb r4, [r1], #1 @ read second pixel value
ldrb r5, [r1], #1 @ read third pixel value
ldr r3, [r11, r3, lsl #2] @ read first pixel RGB color
ldr r4, [r11, r4, lsl #2] @ read second pixel RGB color
ldr r5, [r11, r5, lsl #2] @ read second pixel RGB color

Since the code performs 5 lookups per loop in total, this optimization saved 5 instructions, shortening the whole loop to 27 instructions only and increasing the speed to 98%.

The latter optimization done uses the very well known trick of the loop unrolling. Since the results are so good, I thought it was worth spending some code space. The loop has been unrolled 8 times, thus now it processes 40 input pixel each iteration before encountering the costly (3 cycles) branch instruction. Even this simple optimization proved to be very effective in terms of performance improvement.

Monday, September 03, 2012

Hardware generated smooth scaling

In my August posts I focused on the improvements done with DSx86's ARM ASM smooth scaling routine where I did my best to make it as fast as possible, knowing that every CPU cycle saved there would turn useful in the emulation main loop. Then it took me a few more months to realize that actually the same result can be achieved by properly programming the NDS 2D graphical core. So here's how I did it.

The smooth scaling routine takes groups of five 256-color pixels on the same line and turns them into four 32K-color pixels on the DS screen by performing many palette lookups and regular/weighted averages, as we've seen already. The DS 2D core, on the other hand, can perform alpha blending between two backgrounds, without requiring any effort from the CPU. This alpha blending feature can achieve nothing less than an average between each pixel of the first background and the corresponding pixel on the second background, returning a 32K-color image.(1) Additionally, the 2D core can also perform background scaling. We need to exploit both these features.

Let's define the 5 original pixels as p0-p4, and the resulting 4 output pixels as r0-r3. What we need to get is:

r0 as the sum of 3/4 p0 and 1/4 p1
r1 as the sum of 1/2 p1 and 1/2 p2
r2 as the sum of 1/4 p2 and 3/4 p3
r3 as p4

If we could blend 4 backgrounds together we could simply copy specific pixels in the 4 backgrounds to obtain this (please check that each output pixel is exactly as expected):

BG0: p0 p1 p2 p4
BG1: p0 p1 p3 p4
BG2: p0 p2 p3 p4
BG3: p1 p2 p3 p4

Since the 2D core can do background scaling, we don't even need to copy specific pixels. Each background can be generated the way we need it starting from the unmodified original image stored in Video RAM using the scaling features. Thus, we program the 2D core to skip one source pixel each group of five, and choose which pixel has to be skipped.
For example, to generate each of the backgrounds (the code does that for BG2), we have to program the background affine matrix to scale a 320-pixel wide image in a 256-pixel wide background:

REG_BG2PA = (320 << 8) / 256;
REG_BG2PB = 0;
REG_BG2PC = 0;
REG_BG2PD = (1 << 8);

Then we should tell the 2D core to skip pixel p1. This is accomplished by using the reference point X coordinate register:

REG_BG2X = (3 << 8) / 4;

You can think of this register as if it was a sort of a counter of the fractional part. We initialize it to a precise value (3/4, in this case) and after each output pixel has been generated, 1/4 gets added to this counter. (It's because 320 divided by 256 gives 1 plus a fractional part of 1/4). When the counter reaches the unit, the scaling process skips one pixel of the original image, and in this case this will happen after processing one pixel. We can also tell the 2D core to use the same 320x200 bitmap for all the backgrounds, then program different reference point X coordinate values for each background.

Unfortunately, what we can't ask the 2D core is to blend all 4 backgrounds at the same time. However, we can make it blend 2 of these backgrounds each frame, and blend the other 2 backgrounds the next frame, at 60 frames per second.(2) The LCD screen and our retinas will average the 2 generated images, providing in fact the expected result.

DSx86 actually uses a slightly different implementation. It performs vertical scaling at the same time (200 lines down to 192 in VGA "Mode 13h" and 240 lines down to 192 in VGA "Mode X", using different affine matrices) in the so-called 'Jitter' mode.

(1) The DS screen output supports 18bpp color, and alpha blending is probably performed with even more precision.
(2) Since only BG2 and BG3 support bitmap backgrounds, the code will blend these two, redefining them as needed on each frame.

Saturday, August 25, 2012

a couple speed improving tricks

(This post is a follow-up on my Quick color averaging post. Please read that post first)

Developing a weighted average that uses only 7 ARM assembler instructions instead of using 8 instructions to get the same result was just the tip of the iceberg. To achieve the highest speed possible when resizing a 320 pixel wide image into a 256 pixel wide one, which effectively means converting a PC VGA "Mode 13h" (256 colors) image into a 256 pixel wide 15bpp (32K colors) image on the DS, we should try to speed up every step of the whole conversion process.

For instance, retrieving each pixel's RGB values means reading a byte from the source image that is a pixel in the VGA screen, and accessing the corresponding offset within the palette by performing a lookup table read. So the ARM assembler code for reading the first two RGB values might look like this:

ldrb r3, [r1], #1 @ read first pixel value
lsl r3, #1 @ calculate offset
ldrh r3, [r11, r3] @ read first pixel RGB color
ldrb r4, [r1], #1 @ read second pixel value
lsl r4, #1 @ calculate offset
ldrh r4, [r11, r4] @ read second pixel RGB color

The code above is correct, but it doesn't take into account register interlocking. The ARM946E processor has a 5 stages pipeline, and its loading instructions require that the Memory stage be completed before you can use the target register. This means that there would be a so-called single-cycle load-use interlock if you load a word from memory to a register and you use that register right in the next instruction. In other words, the processor needs to insert a 1-cycle 'pause' before the Execute stage of each of the lsl instructions. Unfortunately, in our code we're reading a single byte from memory instead of a whole word, and things get even worse. Loading a byte (or a halfword) from memory into a register additionally requires the Write stage, thus triggering a two-cycle load-use interlock if the following instruction needs to use the register just loaded, as it happens in our code. (see section 7.12.1 of the ARM9E-S Core Technical Reference, PDF)

Simply reordering the instructions will save us lots of wasted cycles:

ldrb r3, [r1], #1 @ read first pixel value
ldrb r4, [r1], #1 @ read second pixel value
ldrb r5, [r1], #1 @ read third pixel value - we need it later
lsl r3, #1 @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1 @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color

Another thing we have to take into account is that in the DS the color palette is stored in a rather slow memory, and non-sequential accesses to this memory are even slower. According to GBATek, a single 16-bits non-sequential access to palette RAM takes four 33.5 MHz cycles, which translates into eight CPU cycles, because the ARM9 runs at 67 MHz. Palette RAM isn't even cacheable (it's the default setting with DevKitARM; however, I don't suggest that you change this setting even if you actually can) and a lookup is needed for each pixel of the PC VGA "Mode 13h" screen. With a resolution of 320x200 pixel, this happens 64000 times per frame.

To speed up all those lookups, we can copy the palette into a faster memory right before starting our conversion routine. DTCM (Data Tightly-Coupled Memory) is just the right choice. It's a very fast memory: it has single-cycle access time even with non-sequential accesses, but it isn't very large being in fact only 16 KB total. The program's stack resides on it (again, it's a DevKitARM default setting, and once more I don't recommend changing it) but we actually need only 512 bytes to copy the 256 halfwords. So we temporarely allocate that half kilobyte on top of the stack and copy the palette there. Then we will perform all our lookups being sure there will be no slowdown. Actually, this has surely been the most effective change applied to the code in terms of performance improvement.

The last code optimization uses a peculiar kind of SIMD. ARM9 isn't a SIMD CPU, so it can't process multiple data with a single instruction unlike most processors in use nowadays. However, since we have 32 bits registers in there and we need to process 16 bits operands, we could stuff two operands per register and process double-operands as if they were normal operands. Of course, we have to be sure that we don't mix them up. This 'trick' is called SWAR - SIMD Within A Register.

Since in our code we have to perform two weighted averages for each stripe of 5 pixels that we want to convert into 4, we can actually perform the two weighted averages at the same time. Obviously, there's a little overhead: we need to move the operands together before performing the operations and separate them afterwards. This requires 4 ARM assembler instructions. So we can perform two weighted averages in just 11 instructions.

The resulting code, after all these changes, turned out to be 79% faster. Now it processes 179 pixels in the same time that it took the previous code to process only 100 pixels.

In the next post I'll tell you how to obtain the same graphical output without virtually using any CPU resources.

Wednesday, August 08, 2012

Quick color averaging

During my vacation back in May 2011 I was stuck for 4 days between an unexpected incredible snowstorm on one side and the eruption of the Grímsvötn volcano on the other side, of course in Iceland. Well... I had a lot of time and very little things to do, so I spent some time trying to figure out the fastest method of calculating a weighted average between two RGB colors, a and b, so that the result would be (3a + b)/4.

What for? Because I had already started being interested in DSx86, a PC emulator for Nintendo DS. If you've never tried this amazing homebrew, I suggest that you do so as soon as possible. DSx86 author 'Pate', in his May 15 blog post was seeking for suggestions on how to perform a faster weighted average between two colors. His then method was to run a normal average twice, to achieve a weighted one: tmp = (a+b)/2 then avg = (a+tmp)/2.

So what's the reason why I'm writing this post now? Well... time passes and memories start to fade, so I wanted to write down my thoughts and share them before they are gone completely. You know, I'm growing older ;)

If we can define a + b = (a ^ b) + ((a & b) << 1) as it appears in the following truth table:

a b a+b
0 0 00
0 1 01
1 0 01
1 1 10

then the average formula will be

(a + b)/2 = ((a ^ b)>>1) + (a & b).

Since our colors are halfwords (16 bits) where 5 bits are reserved for each RGB component, such as xBBBBBGGGGGRRRRR, the right shifting would make the least significant bit of the blue and green components fall into the bits reserved for the green and red components respectively, we should actually mask each lsb of (a ^ b) result before shifting. Thus we will obtain

(a + b)/2 = (((a ^ b) & ~0x421) >>1) + (a & b)

which is an accurate average of two RGB colors obtained without having to calculate each component average separately (please, read the very interesting Quick colour averaging article on CompuPhase web site).

Similarly, we can define 3a + b as

a b 3a+b
0 0 000
0 1 001
1 0 011
1 1 100

which can be expressed as (a ^ b) + ((a & ~b)<<1) + ((a & b)<<2). To obtain the weighted average, we still have to divide it by 4, which results in

(3a + b)/4 = (a ^ b)>>2 + ((a & ~b)>>1) + (a & b)

Again, the shifts here would make the least significant bits fall into the other components, so we have to clear the least significant bit for the 1-bit right shift and clear two least significant bits for the 2-bit right shift. Finally, we get

(3a + b)/4 = (((a ^ b) & ~0xC63) >>2) + (((a & ~b) & ~0x421) >>1) + (a & b)

The normal average was implemented using 4 ARM assembler instructions, and had to be done twice. On the contrary, the weighted average calculated as per my expression can be coded using 7 ARM instructions only, which allows to save 1 cycle per weighted average. Not bad if you consider that all 200 320-pixel-wide lines of the VGA screen have to be converted into 256-pixel-wide lines to fit the DS screen up to 60 times per second. To do this, you need to perform two weighted averages every 5 pixels.

There are some other nice tricks I used to speed up things even more... but I'll detail those in the next post because I'd prefer to contain this one to quick color averaging subject only.

Saturday, June 30, 2012

MP3 streaming on ARM7

Even if it's not something I would personally use in my homebrew... it's possible to program the DS 'secondary' processor (called 'ARM7' for short) to stream (I mean play) MP3s directly from the storage. The good thing is that your 'main' processor (the ARM9) will be free to work on more interesting tasks such as elaborating your game logic, while the music goes in the background. The bad thing is that your ARM7 33MHz isn't exactly that super-powered beast, and has no hardware decoding capabilities... so the whole software decoding work of an MP3 can be quite demanding for its somehow limited power.

It all started with this post on gbadev forum. Extrapolating parts of elhobbs' great work on cquake, forum user 'hacker013' created an example (that can be easily turned into a library, if this matters to you...) that makes it very easy to stream a (stereo only, be aware) MP3 audio file on a DS. The provided code, unfortunately, was playing only the left channel of the MP3 file it was streaming, so I made some changes that actually made the code play both left and right channels using two separate DS hardware channels. I've put the whole modified code here, in case you need it.

To say the truth, I hardly see any reason why you should ever use such a thing in your homebrew. MP3 files are quite big (say 1 MB for each minute of a 44100 samples/sec, stereo, 128Kbps encoded file) and the ARM7 processor will surely choke if you try to make it decode a 44.1KHz stereo (CD sample rate) MP3 encoded at more than 128Kbps. Even a common 44.1KHz 128Kbps stereo MP3 could be enough to choke the CPU, sooner or later... at least that's what I found during my tests. Things gets better with stereo MP3s having 32KHz sample rate that is also close to the DS audio output, which is 32768 samples per second. In my tests I could play some 32KHz stereo MP3s encoded up to 256Kbps with no problems.

Anyway, in my opinion, 'tracked music' is the way to go on a DS. MaxMod library, or libXM7, which I wrote myself some time ago, can produce very good quality music even with very little ARM7 CPU load. MaxMod comes with the devkitARM & libnds, and it supports MOD/S3M/XM/IT formats with hardware and software audio mixing. On the other hand, libXM7 only supports MOD and XM formats with hardware mixing only, but the MOD and XM compatibility is very accurate, and it supports the whole range of effects that MOD/XM tunes can use.

Sunday, May 06, 2012

"Peaches the Wale"

A few days ago I received an e-mail from a tiny whale. Well, I have to admit it isn't something I've seen really often. They call her "Peaches the Wale" [sic], and she's a musician who recently composed some MOD tunes on her Commodore Amiga. You can see her in this video.

Now she's been invited to have some concerts around, and she realized it wouldn't be very feasible to drag the Amiga with her... so she found on the Internet the XM/MOD player I wrote for the Nintendo DS using libXM7 library, but she needed some additional features:

the module should load and be ready for replay, instead of starting immediately
it should be possible to stop and restart the module from the current pattern
it should be possible to skip to next or to previous pattern while stopped or even while playing
the program should visually show the number of the current pattern and the total number of patterns in the module
the music output should be mono, for her DJ mixer.

I really couldn't refuse giving her my help, so I compiled my player again in a different version with the requested features. I called it XM7dj and of course it's available for everybody who might need it. Download it here. I hope you enjoy!

Thursday, April 26, 2012

about wi-fi capabilities

This took me much longer than planned, really.
First, I had to turn my wife's EeePC into a perfect wi-fi packets capturing machine. I did this by preparing a Live USB 'persistent' Ubuntu on a flash drive USB key, and installing aircrack-ng and Wireshark on it. This made it possible for me to capture every wi-fi packet in the air including wi-fi management packets, which were the ones I was mainly interested in.
Then I needed a wi-fi access point (or router) because I don't have one at home. Just when I was going to visit my brother and run some tests at his home, his router broke. Lucky, uh? Fortunately then I could borrow one from a co-worker, so I could go on.

The first test I planned to run was to capture the full association process between my DS Lite and the access point using a regular DS wi-fi enabled game. During this process the DS informs the router about his wi-fi capabilities, and I wanted to gather that information. Stephen Stair (sgstair), dswifi library author, says that the DS doesn't seem capable of transmitting packets at data rates other than 1 Mbps or 2 Mbps, but admits he never investigated about the receiving speed capabilities... so I decided to start from here.

So my DS Lite informs the router that it can operate at all four 802.11b data rates: 1, 2, 5.5 and 11 Mbps. Nice! I could also see from the captured packets that the DS never sends any packet to the router at data rates higher than 2 Mbps, so sgstair was right about that.
Knowing that the associated device (my DS Lite) can operate at data rates up to 11 Mbps, the router tried to communicate with it using that speed... with no luck at all. After sending some packets and receiving no acknowledgments, the router resent the packets using the lowest data rate possible (1 Mbps) and of course the DS acknowledged that. At this point the router (at least the Netgear I'm using for these tests) decided it wasn't worth to continue sending packets at the highest speed and switched to sending at 5.5 Mbps instead. However, no luck again as the DS unfortunately didn't acknowledge a single packet sent at that data rate.

So it really looks like the DS is only capable of sending and receiving packets at rates up to 2 Mbps, not faster... but it also looks like the WFC-enabled game I'm using isn't sending correct capability informations to the router. But that's just not true. The reason is that the router I'm using requires that the equipment willing to communicate be able to do so using every data rate in the required subset of data rates, which are the ones with high-order bit (0x80) set. This subset is called 'BSSBasicRateSet' in case you want to check yourself (I really had to download the whole big bloated 1233 pages "IEEE 802.11-2007 Standard" document to check that!) In short, the DS lies to the router so that it doesn't refuse the connection by providing the following error: "Association denied due to requesting [device] not supporting all of the data rates in the BSSBasicRateSet parameter".

That's a pity, really. But anyway it was somewhat fun.

Thursday, February 23, 2012

The Unequivocal Answer

Countless times I've read forum topics where the same question was being asked that rarely provided the same answer... often raising a lot of other questions instead.
"What's the fastest?" (we're speaking about memory copy operations here; otherwise, refer to Wikipedia about zoology.)

Since I really believe that it all depends on your needs, I decided to write a program to help you decide what's faster (or 'better') for you. The program copies 64000 bytes allocated in main RAM to a new location that can be either in main RAM or in video RAM. You can change the target by pressing the X key. The program performes the copying many times using different methods applied to different conditions.

First, it does a standard memcpy() 4 times: one copying from a 32 byte boundary address (so that it gets 'cache aligned'), the next from the uncached mirror of the same address, the third from a 'cache unaligned address' and the last one from the uncached mirror of that unaligned address. Of course, copying data from an uncached memory location can lead to some unexpected results, so if you decide that you're going to use that, please remember that your processor cache may contain some data that is still unwritten. Use libnds' DC_FlushRange() when appropriate.

As a second test, the program does the copy using a small asm function I wrote. The function simply loads 32 bytes from memory into 8 CPU registers (a single ldmia opcode), then writes these registers contents to the target memory (an stmia opcode). Loop that 2000 times and the 64000 bytes copy is done. Again, the program runs this test using the previously mentioned 4 different conditions.

Loop unrolling is an often used technique to increase speed, so the program also runs a (cached and aligned) test with a 2x unrolled version of the above function and a test with a 4x unrolled version of it.

Third test: since memory access times are much higher for non-sequencial accesses, the program uses a custom asm function that loads 10 CPU registers in a row (again with a single ldmia opcode) to take advantage of the higher sequential/total reads (and then writes) ratio. The program does the copy reading from both the cached main memory address and the uncached mirror address of the same memory location.

Fourth: going on with the previous observation, the program does the copy using a small (512 bytes) DTCM scratch area. So it loads half kilobyte in this fast memory (thus reading sequentially 128 words from main memory) and then it copies all the scratch area contents to the target address. Similarly, the test runs reading both from the cached main memory address and the uncached mirror of the same address.

Last, the program copies those 64000 bytes using a dedicated hardware - the DMA. This means that the CPU isn't even aware of the copy going on, and this might trigger some problems if there are still unwritten bytes in the cache, as already noted before. Theoretically the CPU should be able to work while the DMA does its task. However, in reality, the DMA locks the bus for its exclusive use, so the CPU can go on working only as long as it doesn't need any bus access. Then it will stall, waiting for the DMA transfer to finish.

Here are some screenshots taken while running the program on my DS Lite.

Figures could change if the program runs on a different DS model, and I would be interested in seeing those figures if they do change a lot. Pressing one of the shoulder keys will save a bitmap on your memory card: if the selected target is main RAM a 'memcpy_mainram.bmp' file will be created, while if the selected target is video RAM then the created file will be 'memcpy_videoram.bmp'.

Finally, here are some considerations you might find interesting:

copying using DMA is both the slowest option (when copying to main RAM) and the fastest one (when copying to video RAM)

memcpy() gets overtaken by almost every other method, so I guess one should use it only when prototyping or when performance is not an issue

reading from an uncached address seems to give some % of boost when using ldmia/stmia; it just makes things worse when using memcpy()

Loop unrolling doesn't give any advantage when copying to main RAM; on the contrary, it effectively speeds up a bit when copying to video RAM

reading from a cache unaligned source address can slow down things a bit, especially when reading from a cached address

using a DTCM temporary copy doesn't help

Never trust results taken from any emulator. At least I didn't find any emulator that could provide results resembling the real ones.

If you'd like to run the tests on your own DS, here's the program to download.

Monday, January 23, 2012

Telling DS models apart

In the years the DS went through some upgrades, which made it lighter and with better LCD quality (the DS Lite). Then the available memory has been increased as well as the CPU speed (the DSi), and bigger screens were introduced later (the DSi XL). Finally, a stereoscopic screen was implemented (the 3DS, which is actually more like a new console with DS compatibility). Nintendo also joint ventured a Chinese entrepreneur to distribute its products in China under the iQue name. So we actually have also iQue DS, iQue DS Lite and iQue DSi, all featuring additional support for the Chinese language, which is not included in Nintendo's 'own' products.

Homebrew programs are not aware of the different models. Most of the programs that do run in DSi mode, using the ARM9 CPU at double speed (134 MHz) and the quadrupled main memory size (16 MB), aren't even aware of that. Few programs, however, detect if they're running on a DS Lite or newer, so they can drive the LCD brightness, which isn't possible on the 'original' DS.

The GBAtek document gives some information that could be useful for distinguishing the DS Lite from the 'original' DS model (often called DS 'phat' nowadays). It also contains a minimum of information regarding both iQue DS and iQue DS Lite models, but it lacks anything from DSi on. So I tried to collect as much data as possible from friends and on-line DS user communities... I would have expected some more help from the latter, but... well, ok.

From the details I collected it seems that the 01Dh location ('Console type') in the firmware header could also be used to detect if the model is a DSi, but unfortunately, it doesn't give any hint to distinguish if it's a 'normal' DSi or a DSi XL... or if it's a 3DS emulating a DSi. They all appear the same (the value there is always 0x57). I then noticed that at the 02Fh location ('Wifi version') I get different values for different console models: all the DSi models I have had tested have 0x0F, the DSi XL models have 0x18, and 3DS's have 0x1C.

I thus made a program (download it here) that should tell you which DS model you're holding in your hand. In case it's a DSi / DSi XL / 3DS the program will also tell you if your homebrew cartridge is working in DSi mode or not.

Unfortunately, I still don't know anything about the iQue DSi, and I also don't know if there is any other possible value for 'Wifi version' except for the ones I collected, so I actually speculated that any different value found should be assimilated to the closest lower valid value. If you have an iQue DSi or if you're running my program and it misses the correct answer, I'd really appreciate if you could run this little program and let me know what it writes on screen and which console model the program is running on.

Finally, here's the source code of the detection routine, which has to run on the ARM7. It uses the boolean variable __dsimode that is provided directly by libnds. I wrote the code using a sort of a 'successive refinement' so that if it misses the target it shouldn't miss it by too much. Hopefully. Use it as you wish in your own program, at your own risk.

#define MODEL_NINTENDO_DS       0
#define CHINESE_SUPPORT         1
#define MODEL_IQUE_DS           (MODEL_NINTENDO_DS+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DS_LITE  2
#define MODEL_IQUE_DS_LITE      (MODEL_NINTENDO_DS_LITE+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DSI      4
// #define MODEL_IQUE_DSI       (MODEL_NINTENDO_DSI+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DSI_XL   6
#define MODEL_NINTENDO_DSI_LL   MODEL_NINTENDO_DSI_XL
#define MODEL_NINTENDO_3DS      8

#define MODE_DS                 0
#define MODE_DSI                1

typedef union {
  struct {
    u8 model;
    u8 flags;
    u8 padding[2];
  };
  u32 packed;
} HwInfo;

extern bool __dsimode;
    
u32 getDSmodel() {

  HwInfo hwinfo;
  u8 fw1D;
  u8 fw2F;

  // reset
  hwinfo.packed=0;

  // read two firmware bytes we might need
  readFirmware (0x1D, &fw1D, 1);
  readFirmware (0x2F, &fw2F, 1);

  // check if we're in DSi mode or not
  if (__dsimode) {
    hwinfo.flags=MODE_DSI;
    hwinfo.model=MODEL_NINTENDO_DSI;  // shortcut if we're in DSi mode
  } else {
    hwinfo.flags=MODE_DS;
    hwinfo.model=MODEL_NINTENDO_DS;
  }
  
  // check if it's a DS Lite
  if ((hwinfo.model==MODEL_NINTENDO_DS) && (readPowerManagement(4) & 0x40))
    hwinfo.model=MODEL_NINTENDO_DS_LITE;

  // check if it's a DSi
  if ((hwinfo.model==MODEL_NINTENDO_DS_LITE) && (fw1D & 0x04))
    hwinfo.model=MODEL_NINTENDO_DSI;

  // check if it's a IQue DS
  if ((hwinfo.model==MODEL_NINTENDO_DS) && (fw1D!=0xFF) && (fw1D & 0x03))
    hwinfo.model=MODEL_IQUE_DS;
    
  // check if it's a IQue DS Lite
  if ((hwinfo.model==MODEL_NINTENDO_DS_LITE) && (fw1D & 0x03))
    hwinfo.model=MODEL_IQUE_DS_LITE;
  
  // NO iQUE DSi detection, yet...
  
  // check if it's a 3DS
  if ((hwinfo.model==MODEL_NINTENDO_DSI) && (fw2F>=0x1c))
    hwinfo.model=MODEL_NINTENDO_3DS;
    
  // check if it's a DSi XL
  if ((hwinfo.model==MODEL_NINTENDO_DSI) && (fw2F>=0x18))
    hwinfo.model=MODEL_NINTENDO_DSI_XL;
    
  return (hwinfo.packed);
}

Monday, January 09, 2012

Mandelbrot fractals

I've always loved fractals... I mean, who doesn't?

Last Wednesday, after watching this video on Vimeo (which I suggest that you check out with your speakers on) I recalled I had written a Mandelbrot set explorer program once on my 386. It was some time ago, back in 1997 I guess. It took me a while back then.

This time I started from this pseudocode provided by Wikipedia's 'Mandelbrot set' page, which implements the so called "escape time" algorithm. You set a maximum number of iterations and do some math. If after those iterations the resulting point still falls very close to the axis origin (0,0) the algorithm will color the point in black. If the point leaves the area, it will instead mark the point using a different color, depending on how many iterations were needed before leaving the origin's surrounding area.

You wouldn't expect to see such amazing images, really.

The program I made starts showing a small area around the origin: from (-2.5,-1.5) to (+1.5,+1.5)... this gives the first image. Clicking on any point of the touchscreen, it will start recalculating a new image, taking the touched point as a new center of the image and zooming in 2x. After 5 clicks you will see for instance what's shown on the second image. Not bad, right?

Then you can press any of the two shoulder buttons to zoom out, or click on the start button to reset the program to the initial setting.

Anyway the Nintendo DS has some limitations. Of course it doesn't feature a superscalar quadcore 2GHz+ processor. It only has a 67 MHz ARM946E, which also has no floating point unit at all, so each operation on floating point variables doesn't turn into a single (co)processor opcode, but into a series of integer operation. So, to keep the image generation time acceptable, I had to limit the number of maximum iterations of the aforementioned algorithm (actually to a very small value: 32). I also had to opt for single precision floating point variables, the fastest choice available.

The following two images show both the limits. After some zooming, you won't see any other color following the 'pink, cyan, green' sequence... and if you keep on zooming in, you'll see that the calculations will start to lose precision, and the resulting image quality will become poor.

Of course I might consider working on a version that increases the number of iterations upon request OR 'when necessary'. I might also consider switching from single precision floating point variables to double precision... or, maybe even better, consider using an arbitrary precision floating point numbers library, such as GMP (The GNU Multiple Precision Arithmetic Library).

Well, let's see :)

In the meanwhile, if you want to play it yourself, you can download the program here.

Fell free to leave a comment about it if you want.

Wednesday, January 04, 2012

Welcome!

Hi! This is my first post of my first personal blog ever and, if I don't get bored too quickly, I already know I'll be smiling when reading this again in some 10 or 20 years. :)

Here I will publish some news about me and my passion since some years: Nintendo DS Homebrewing. Yes, I know it's 2012 and it's already the sunset for the NDS, but in these years I've been loving that console -and I still love it- so much that I feel I need to share.

Well, see you later! :)