Showing posts with label ASM. Show all posts
Showing posts with label ASM. Show all posts

Wednesday, May 13, 2015

Putting some more fuel into SMS homebrew

Here I am again... well, almost another year has passed since my last words here. "Doesn't time fly when you're enjoying yourself?"[*]

I spent most of my time these last months writing code and building tools aimed to the homebrew on the SEGA Master System (SMS for short) and SEGA Game Gear.
Writing homebrew games/programs on SMS still means writing ASM code almost from scratch every time.
Yes, of course you can reuse some of the code you've already written, but still there isn't a big deal of shared ready-to-use code, snippets apart. Even if it features background music and sound effects support, the previously only existing audio library (the very good Mod2PSG2, which plays music written on the tracker with the same name) unfortunately never provided a way to export sound effects from its tracker, thus forcing developers to build their sound effects with hex editors.

So I wrote PSGlib, and the tools to convert VGM files into tunes and sound effects that the homebrewer wants to use in their program. VGM files can be produced by some well know trackers such as DefleMask and the very same Mod2PSG2 tracker too. After you convert and compress them into the PSG format, everything you have to do is just to start them (music and SFX) at the right time. Now the majority of SMS homebrew arising these days uses it, the complete list is here.

This happened mostly in 2014, still. And even if writing Z80 ASM code can be very entertaining (seriously!), I decided to try to build a development kit to write SMS homebrew in C.
Of course, first I needed a compiler. SDCC turned out to be the choice. It's a free open source optimizing C compiler that also targets the Zilog Z80 processor, among many others. So the core component I needed was already available.
Since the processor is only one of many components of the SMS, I needed to write code and tools to make it possible to use the SMS as the target, and to write a library on top of that development kit to enable programmers to use the underlying hardware straight from their C programs.

So last January devkitSMS (the kit) and SMSlib (the library) were born.
The library includes functions to handle the display hardware features, such as hardware scrolling for example, background and sprites, and supports software sprite clipping based on a user defined window. It has functions to handle colors and palettes, tiles and tilemaps, both normal keypad and Genesis/MegaDrive 3/6 buttons pad, the pause key and it also has ROM mappers support.
PSGlib then also incarnated as a additional C library, so that it also can be used with the devkit.

And that's pretty much it.

Sunday, August 31, 2014

One year with no (posted) news

Hello!
Yes, I'm alive and kicking :)

Sorry I may have appeared MIA... I just had nothing interesting to post; it's been a year since I virtually don't have anything going on involving my lovely DS or the little GBA.

Anyway a few months ago I recovered my younger brother's SEGA Master System II and also his SEGA Game Gear, both of which were forgotten long ago... fortunately, they both are still almost perfectly working.
In case you've never heard of them, they're basically the same hardware, even if the latter is a portable system featuring a color LCD display, more colors and stereo audio. What we're talking about here are 3.5 MHz Zilog Z80 powered 8 bit systems, not exactly something you can compare to GBA 16.5 MHz and DS 66 MHz 32 bit ARM processors horsepower.

Still it's already proving to be very intriguing to write code for these consoles.

First, you have to deal with the processor. The Z80 is an 8 bit CPU, as I said, so actually it can handle only rather small numbers. It even has no instruction to perform multiplication, not to mention division. Shifting a register requires twice the time it takes to make an addition, and you can shift bits left or right by one position only. The fastest operations, such as the addition, require 4 clock cycles (here's the complete instruction set). Many basic operations are available on selected registers only.

The memory. The system features 8 KB of RAM, built into the console. However, the code runs from ROM: there is a chip inside each and every cartridge, so there are also no loading times. ROM size can be up to 1 MB, but everything bigger than 48 KB requires bank switching to access the upper part.

There's virtually no other option but to code in assembler. WLA DX is currently the assembler of choice.

Then there's the hardware responsible for the graphic, the Video Display Processor (VDP), which has very limited capability, again I mean compared to the DS/GBA. Basically here you've got a single background made up of a grid of 32x24 tiles, each with 16 colors either from the first or the second palette, which is the one that is also used by the sprites. The hardware also isn't capable of displaying more than 64 different colors, and I don't mean at the same moment. Finally, up to 64 16-colors sprites (each 8x8 or 8x16 pixels dimensions) are available, but only up to 8 will be drawn on the same scanline. Sprites unfortunately cannot be flipped neither horizontally nor vertically. The VDP still has the quite powerful feature of supporting hardware scrolling of the background in both X and Y directions.

The background tiles, the map, the sprites graphics and the Sprite Attribute Table (SAT) all share the same VRAM space, which is only 16 KB total. The most troublesome issue here is that you can write to VRAM in specific moments only (when video is disabled or during VBlank). If you write to VRAM at the wrong moment, your data will simply be discarded, so it's very easy to end up with corrupted graphics.

Finally, Japanese version of the SEGA Master System apart, the system generates music and sound effects from its PSG chip (the Texas Instruments SN76489 Programmable Sound Generator), which has 4 mono audio channels. Each of the first three channels can output a true square wave (50% duty cycle) of a given frequency, while the fourth channel can output noise only. Volume of each channel can be set to one of 16 attenuation levels on a logarithmic scale.

There are some ready-to-use tools and libraries to compose and replay modules using the PSG chip, but none of these libraries is currently supporting both music and sound effects. So I decided to try implementing a solution to be able to have sound effects over background music, even with the very limited number of available channels.

The result is PSGlib. It plays VGM tunes written for the SN76489 (the tunes need to be converted to a specific format), and it supports sound effects on the third square wave channel and/or on the noise channel, avoiding any collision with the music that would be probably trying to use the same channels. More details about the library may follow in a separate post, eventually.

And... that's pretty much everything so far.

Tuesday, September 11, 2012

...more tricks, more speed!

Very recently, I've further optimized the VGA smooth scaling that DSx86 uses, and Pate already announced on his blog the next release will feature the faster code. If you haven't read about the first wave of optimizations, please read about it in my August posts.

This time I exploited two more tricks, and the result is that the speed increased from 79% faster to 114% faster compared to the original code. Of course, this isn't bad at all.

The former optimization comes from the observation that there's no way to use a base register plus a shifted register offset addressing scheme when accessing halfwords, whereas this is very common when accessing words instead. This means we need a separate shift instruction to calculate the offset if we want to access a halfword in a lookup table while, on the contrary, we can access a word in a lookup table using a single instruction. Thus, if we define a 256-words temporary space in the stack (it takes 1 KB) for the lookup table and copy there each palette RGB values as whole words, we will later save one instruction per input pixel when we access them. So this fragment of code:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color
lsl r5, #1         @ calculate offset (3rd pixel)
ldrh r5, [r11, r5] @ read third pixel RGB color

turns into this shorter one:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
ldr r3, [r11, r3, lsl #2] @ read first pixel RGB color
ldr r4, [r11, r4, lsl #2] @ read second pixel RGB color
ldr r5, [r11, r5, lsl #2] @ read second pixel RGB color

Since the code performs 5 lookups per loop in total, this optimization saved 5 instructions, shortening the whole loop to 27 instructions only and increasing the speed to 98%.

The latter optimization done uses the very well known trick of the loop unrolling. Since the results are so good, I thought it was worth spending some code space. The loop has been unrolled 8 times, thus now it processes 40 input pixel each iteration before encountering the costly (3 cycles) branch instruction. Even this simple optimization proved to be very effective in terms of performance improvement.