Saturday, June 30, 2012

MP3 streaming on ARM7

Even if it's not something I would personally use in my homebrew... it's possible to program the DS 'secondary' processor (called 'ARM7' for short) to stream (I mean play) MP3s directly from the storage. The good thing is that your 'main' processor (the ARM9) will be free to work on more interesting tasks such as elaborating your game logic, while the music goes in the background. The bad thing is that your ARM7 33MHz isn't exactly that super-powered beast, and has no hardware decoding capabilities... so the whole software decoding work of an MP3 can be quite demanding for its somehow limited power.

It all started with this post on gbadev forum. Extrapolating parts of elhobbs' great work on cquake, forum user 'hacker013' created an example (that can be easily turned into a library, if this matters to you...) that makes it very easy to stream a (stereo only, be aware) MP3 audio file on a DS. The provided code, unfortunately, was playing only the left channel of the MP3 file it was streaming, so I made some changes that actually made the code play both left and right channels using two separate DS hardware channels. I've put the whole modified code here, in case you need it.

To say the truth, I hardly see any reason why you should ever use such a thing in your homebrew. MP3 files are quite big (say 1 MB for each minute of a 44100 samples/sec, stereo, 128Kbps encoded file) and the ARM7 processor will surely choke if you try to make it decode a 44.1KHz stereo (CD sample rate) MP3 encoded at more than 128Kbps. Even a common 44.1KHz 128Kbps stereo MP3 could be enough to choke the CPU, sooner or later... at least that's what I found during my tests. Things gets better with stereo MP3s having 32KHz sample rate that is also close to the DS audio output, which is 32768 samples per second. In my tests I could play some 32KHz stereo MP3s encoded up to 256Kbps with no problems.

Anyway, in my opinion, 'tracked music' is the way to go on a DS. MaxMod library, or libXM7, which I wrote myself some time ago, can produce very good quality music even with very little ARM7 CPU load. MaxMod comes with the devkitARM & libnds, and it supports MOD/S3M/XM/IT formats with hardware and software audio mixing. On the other hand, libXM7 only supports MOD and XM formats with hardware mixing only, but the MOD and XM compatibility is very accurate, and it supports the whole range of effects that MOD/XM tunes can use.

Sunday, May 06, 2012

"Peaches the Wale"

A few days ago I received an e-mail from a tiny whale. Well, I have to admit it isn't something I've seen really often. They call her "Peaches the Wale" [sic], and she's a musician who recently composed some MOD tunes on her Commodore Amiga. You can see her in this video.


Now she's been invited to have some concerts around, and she realized it wouldn't be very feasible to drag the Amiga with her... so she found on the Internet the XM/MOD player I wrote for the Nintendo DS using libXM7 library, but she needed some additional features:
  • the module should load and be ready for replay, instead of starting immediately
  • it should be possible to stop and restart the module from the current pattern
  • it should be possible to skip to next or to previous pattern while stopped or even while playing
  • the program should visually show the number of the current pattern and the total number of patterns in the module
  • the music output should be mono, for her DJ mixer.
I really couldn't refuse giving her my help, so I compiled my player again in a different version with the requested features. I called it XM7dj and of course it's available for everybody who might need it. Download it here. I hope you enjoy!

Thursday, April 26, 2012

about wi-fi capabilities

This took me much longer than planned, really.
First, I had to turn my wife's EeePC into a perfect wi-fi packets capturing machine. I did this by preparing a Live USB 'persistent' Ubuntu on a flash drive USB key, and installing aircrack-ng and Wireshark on it. This made it possible for me to capture every wi-fi packet in the air including wi-fi management packets, which were the ones I was mainly interested in.
Then I needed a wi-fi access point (or router) because I don't have one at home. Just when I was going to visit my brother and run some tests at his home, his router broke. Lucky, uh? Fortunately then I could borrow one from a co-worker, so I could go on.

The first test I planned to run was to capture the full association process between my DS Lite and the access point using a regular DS wi-fi enabled game. During this process the DS informs the router about his wi-fi capabilities, and I wanted to gather that information. Stephen Stair (sgstair), dswifi library author, says that the DS doesn't seem capable of transmitting packets at data rates other than 1 Mbps or 2 Mbps, but admits he never investigated about the receiving speed capabilities... so I decided to start from here.


So my DS Lite informs the router that it can operate at all four 802.11b data rates: 1, 2, 5.5 and 11 Mbps. Nice! I could also see from the captured packets that the DS never sends any packet to the router at data rates higher than 2 Mbps, so sgstair was right about that.
Knowing that the associated device (my DS Lite) can operate at data rates up to 11 Mbps, the router tried to communicate with it using that speed... with no luck at all. After sending some packets and receiving no acknowledgments, the router resent the packets using the lowest data rate possible (1 Mbps) and of course the DS acknowledged that. At this point the router (at least the Netgear I'm using for these tests) decided it wasn't worth to continue sending packets at the highest speed and switched to sending at 5.5 Mbps instead. However, no luck again as the DS unfortunately didn't acknowledge a single packet sent at that data rate.


So it really looks like the DS is only capable of sending and receiving packets at rates up to 2 Mbps, not faster... but it also looks like the WFC-enabled game I'm using isn't sending correct capability informations to the router. But that's just not true. The reason is that the router I'm using requires that the equipment willing to communicate be able to do so using every data rate in the required subset of data rates, which are the ones with high-order bit (0x80) set. This subset is called 'BSSBasicRateSet' in case you want to check yourself (I really had to download the whole big bloated 1233 pages "IEEE 802.11-2007 Standard" document to check that!) In short, the DS lies to the router so that it doesn't refuse the connection by providing the following error: "Association denied due to requesting [device] not supporting all of the data rates in the BSSBasicRateSet parameter".

That's a pity, really. But anyway it was somewhat fun.

Thursday, February 23, 2012

The Unequivocal Answer

Countless times I've read forum topics where the same question was being asked that rarely provided the same answer... often raising a lot of other questions instead.
"What's the fastest?" (we're speaking about memory copy operations here; otherwise, refer to Wikipedia about zoology.)



Since I really believe that it all depends on your needs, I decided to write a program to help you decide what's faster (or 'better') for you. The program copies 64000 bytes allocated in main RAM to a new location that can be either in main RAM or in video RAM. You can change the target by pressing the X key. The program performes the copying many times using different methods applied to different conditions.



First, it does a standard memcpy() 4 times: one copying from a 32 byte boundary address (so that it gets 'cache aligned'), the next from the uncached mirror of the same address, the third from a 'cache unaligned address' and the last one from the uncached mirror of that unaligned address. Of course, copying data from an uncached memory location can lead to some unexpected results, so if you decide that you're going to use that, please remember that your processor cache may contain some data that is still unwritten. Use libnds' DC_FlushRange() when appropriate.



As a second test, the program does the copy using a small asm function I wrote. The function simply loads 32 bytes from memory into 8 CPU registers (a single ldmia opcode), then writes these registers contents to the target memory (an stmia opcode). Loop that 2000 times and the 64000 bytes copy is done. Again, the program runs this test using the previously mentioned 4 different conditions.



Loop unrolling is an often used technique to increase speed, so the program also runs a (cached and aligned) test with a 2x unrolled version of the above function and a test with a 4x unrolled version of it.



Third test: since memory access times are much higher for non-sequencial accesses, the program uses a custom asm function that loads 10 CPU registers in a row (again with a single ldmia opcode) to take advantage of the higher sequential/total reads (and then writes) ratio. The program does the copy reading from both the cached main memory address and the uncached mirror address of the same memory location.



Fourth: going on with the previous observation, the program does the copy using a small (512 bytes) DTCM scratch area. So it loads half kilobyte in this fast memory (thus reading sequentially 128 words from main memory) and then it copies all the scratch area contents to the target address. Similarly, the test runs reading both from the cached main memory address and the uncached mirror of the same address.



Last, the program copies those 64000 bytes using a dedicated hardware - the DMA. This means that the CPU isn't even aware of the copy going on, and this might trigger some problems if there are still unwritten bytes in the cache, as already noted before. Theoretically the CPU should be able to work while the DMA does its task. However, in reality, the DMA locks the bus for its exclusive use, so the CPU can go on working only as long as it doesn't need any bus access. Then it will stall, waiting for the DMA transfer to finish.



Here are some screenshots taken while running the program on my DS Lite.







Figures could change if the program runs on a different DS model, and I would be interested in seeing those figures if they do change a lot. Pressing one of the shoulder keys will save a bitmap on your memory card: if the selected target is main RAM a 'memcpy_mainram.bmp' file will be created, while if the selected target is video RAM then the created file will be 'memcpy_videoram.bmp'.



Finally, here are some considerations you might find interesting:





  • copying using DMA is both the slowest option (when copying to main RAM) and the fastest one (when copying to video RAM)


  • memcpy() gets overtaken by almost every other method, so I guess one should use it only when prototyping or when performance is not an issue


  • reading from an uncached address seems to give some % of boost when using ldmia/stmia; it just makes things worse when using memcpy()


  • Loop unrolling doesn't give any advantage when copying to main RAM; on the contrary, it effectively speeds up a bit when copying to video RAM


  • reading from a cache unaligned source address can slow down things a bit, especially when reading from a cached address


  • using a DTCM temporary copy doesn't help


Never trust results taken from any emulator. At least I didn't find any emulator that could provide results resembling the real ones.





If you'd like to run the tests on your own DS, here's the program to download.



Monday, January 23, 2012

Telling DS models apart

In the years the DS went through some upgrades, which made it lighter and with better LCD quality (the DS Lite). Then the available memory has been increased as well as the CPU speed (the DSi), and bigger screens were introduced later (the DSi XL). Finally, a stereoscopic screen was implemented (the 3DS, which is actually more like a new console with DS compatibility). Nintendo also joint ventured a Chinese entrepreneur to distribute its products in China under the iQue name. So we actually have also iQue DS, iQue DS Lite and iQue DSi, all featuring additional support for the Chinese language, which is not included in Nintendo's 'own' products.

Homebrew programs are not aware of the different models. Most of the programs that do run in DSi mode, using the ARM9 CPU at double speed (134 MHz) and the quadrupled main memory size (16 MB), aren't even aware of that. Few programs, however, detect if they're running on a DS Lite or newer, so they can drive the LCD brightness, which isn't possible on the 'original' DS.

The GBAtek document gives some information that could be useful for distinguishing the DS Lite from the 'original' DS model (often called DS 'phat' nowadays). It also contains a minimum of information regarding both iQue DS and iQue DS Lite models, but it lacks anything from DSi on. So I tried to collect as much data as possible from friends and on-line DS user communities... I would have expected some more help from the latter, but... well, ok.

From the details I collected it seems that the 01Dh location ('Console type') in the firmware header could also be used to detect if the model is a DSi, but unfortunately, it doesn't give any hint to distinguish if it's a 'normal' DSi or a DSi XL... or if it's a 3DS emulating a DSi. They all appear the same (the value there is always 0x57). I then noticed that at the 02Fh location ('Wifi version') I get different values for different console models: all the DSi models I have had tested have 0x0F, the DSi XL models have 0x18, and 3DS's have 0x1C.

I thus made a program (download it here) that should tell you which DS model you're holding in your hand. In case it's a DSi / DSi XL / 3DS the program will also tell you if your homebrew cartridge is working in DSi mode or not.

Unfortunately, I still don't know anything about the iQue DSi, and I also don't know if there is any other possible value for 'Wifi version' except for the ones I collected, so I actually speculated that any different value found should be assimilated to the closest lower valid value. If you have an iQue DSi or if you're running my program and it misses the correct answer, I'd really appreciate if you could run this little program and let me know what it writes on screen and which console model the program is running on.

Finally, here's the source code of the detection routine, which has to run on the ARM7. It uses the boolean variable __dsimode that is provided directly by libnds. I wrote the code using a sort of a 'successive refinement' so that if it misses the target it shouldn't miss it by too much. Hopefully. Use it as you wish in your own program, at your own risk.

#define MODEL_NINTENDO_DS       0
#define CHINESE_SUPPORT 1
#define MODEL_IQUE_DS (MODEL_NINTENDO_DS+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DS_LITE 2
#define MODEL_IQUE_DS_LITE (MODEL_NINTENDO_DS_LITE+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DSI 4
// #define MODEL_IQUE_DSI (MODEL_NINTENDO_DSI+CHINESE_SUPPORT)
#define MODEL_NINTENDO_DSI_XL 6
#define MODEL_NINTENDO_DSI_LL MODEL_NINTENDO_DSI_XL
#define MODEL_NINTENDO_3DS 8

#define MODE_DS 0
#define MODE_DSI 1

typedef union {
struct {
u8 model;
u8 flags;
u8 padding[2];
};
u32 packed;
} HwInfo;

extern bool __dsimode;

u32 getDSmodel() {

HwInfo hwinfo;
u8 fw1D;
u8 fw2F;

// reset
hwinfo.packed=0;

// read two firmware bytes we might need
readFirmware (0x1D, &fw1D, 1);
readFirmware (0x2F, &fw2F, 1);

// check if we're in DSi mode or not
if (__dsimode) {
hwinfo.flags=MODE_DSI;
hwinfo.model=MODEL_NINTENDO_DSI; // shortcut if we're in DSi mode
} else {
hwinfo.flags=MODE_DS;
hwinfo.model=MODEL_NINTENDO_DS;
}

// check if it's a DS Lite
if ((hwinfo.model==MODEL_NINTENDO_DS) && (readPowerManagement(4) & 0x40))
hwinfo.model=MODEL_NINTENDO_DS_LITE;

// check if it's a DSi
if ((hwinfo.model==MODEL_NINTENDO_DS_LITE) && (fw1D & 0x04))
hwinfo.model=MODEL_NINTENDO_DSI;

// check if it's a IQue DS
if ((hwinfo.model==MODEL_NINTENDO_DS) && (fw1D!=0xFF) && (fw1D & 0x03))
hwinfo.model=MODEL_IQUE_DS;

// check if it's a IQue DS Lite
if ((hwinfo.model==MODEL_NINTENDO_DS_LITE) && (fw1D & 0x03))
hwinfo.model=MODEL_IQUE_DS_LITE;

// NO iQUE DSi detection, yet...

// check if it's a 3DS
if ((hwinfo.model==MODEL_NINTENDO_DSI) && (fw2F>=0x1c))
hwinfo.model=MODEL_NINTENDO_3DS;

// check if it's a DSi XL
if ((hwinfo.model==MODEL_NINTENDO_DSI) && (fw2F>=0x18))
hwinfo.model=MODEL_NINTENDO_DSI_XL;

return (hwinfo.packed);
}

Monday, January 09, 2012

Mandelbrot fractals

I've always loved fractals... I mean, who doesn't?

Last Wednesday, after watching this video on Vimeo (which I suggest that you check out with your speakers on) I recalled I had written a Mandelbrot set explorer program once on my 386. It was some time ago, back in 1997 I guess. It took me a while back then.

This time I started from this pseudocode provided by Wikipedia's 'Mandelbrot set' page, which implements the so called "escape time" algorithm. You set a maximum number of iterations and do some math. If after those iterations the resulting point still falls very close to the axis origin (0,0) the algorithm will color the point in black. If the point leaves the area, it will instead mark the point using a different color, depending on how many iterations were needed before leaving the origin's surrounding area.

You wouldn't expect to see such amazing images, really.

The program I made starts showing a small area around the origin: from (-2.5,-1.5) to (+1.5,+1.5)... this gives the first image. Clicking on any point of the touchscreen, it will start recalculating a new image, taking the touched point as a new center of the image and zooming in 2x. After 5 clicks you will see for instance what's shown on the second image. Not bad, right?

Then you can press any of the two shoulder buttons to zoom out, or click on the start button to reset the program to the initial setting.

Anyway the Nintendo DS has some limitations. Of course it doesn't feature a superscalar quadcore 2GHz+ processor. It only has a 67 MHz ARM946E, which also has no floating point unit at all, so each operation on floating point variables doesn't turn into a single (co)processor opcode, but into a series of integer operation. So, to keep the image generation time acceptable, I had to limit the number of maximum iterations of the aforementioned algorithm (actually to a very small value: 32). I also had to opt for single precision floating point variables, the fastest choice available.

The following two images show both the limits. After some zooming, you won't see any other color following the 'pink, cyan, green' sequence... and if you keep on zooming in, you'll see that the calculations will start to lose precision, and the resulting image quality will become poor.

Of course I might consider working on a version that increases the number of iterations upon request OR 'when necessary'. I might also consider switching from single precision floating point variables to double precision... or, maybe even better, consider using an arbitrary precision floating point numbers library, such as GMP (The GNU Multiple Precision Arithmetic Library).

Well, let's see :)

In the meanwhile, if you want to play it yourself, you can download the program here.

Fell free to leave a comment about it if you want.


Wednesday, January 04, 2012

Welcome!

Hi! This is my first post of my first personal blog ever and, if I don't get bored too quickly, I already know I'll be smiling when reading this again in some 10 or 20 years. :)

Here I will publish some news about me and my passion since some years: Nintendo DS Homebrewing. Yes, I know it's 2012 and it's already the sunset for the NDS, but in these years I've been loving that console -and I still love it- so much that I feel I need to share.

Well, see you later! :)