A DS Homebrewer's Diary: alignment

Countless times I've read forum topics where the same question was being asked that rarely provided the same answer... often raising a lot of other questions instead.
"What's the fastest?" (we're speaking about memory copy operations here; otherwise, refer to Wikipedia about zoology.)

Since I really believe that it all depends on your needs, I decided to write a program to help you decide what's faster (or 'better') for you. The program copies 64000 bytes allocated in main RAM to a new location that can be either in main RAM or in video RAM. You can change the target by pressing the X key. The program performes the copying many times using different methods applied to different conditions.

First, it does a standard memcpy() 4 times: one copying from a 32 byte boundary address (so that it gets 'cache aligned'), the next from the uncached mirror of the same address, the third from a 'cache unaligned address' and the last one from the uncached mirror of that unaligned address. Of course, copying data from an uncached memory location can lead to some unexpected results, so if you decide that you're going to use that, please remember that your processor cache may contain some data that is still unwritten. Use libnds' DC_FlushRange() when appropriate.

As a second test, the program does the copy using a small asm function I wrote. The function simply loads 32 bytes from memory into 8 CPU registers (a single ldmia opcode), then writes these registers contents to the target memory (an stmia opcode). Loop that 2000 times and the 64000 bytes copy is done. Again, the program runs this test using the previously mentioned 4 different conditions.

Loop unrolling is an often used technique to increase speed, so the program also runs a (cached and aligned) test with a 2x unrolled version of the above function and a test with a 4x unrolled version of it.

Third test: since memory access times are much higher for non-sequencial accesses, the program uses a custom asm function that loads 10 CPU registers in a row (again with a single ldmia opcode) to take advantage of the higher sequential/total reads (and then writes) ratio. The program does the copy reading from both the cached main memory address and the uncached mirror address of the same memory location.

Fourth: going on with the previous observation, the program does the copy using a small (512 bytes) DTCM scratch area. So it loads half kilobyte in this fast memory (thus reading sequentially 128 words from main memory) and then it copies all the scratch area contents to the target address. Similarly, the test runs reading both from the cached main memory address and the uncached mirror of the same address.

Last, the program copies those 64000 bytes using a dedicated hardware - the DMA. This means that the CPU isn't even aware of the copy going on, and this might trigger some problems if there are still unwritten bytes in the cache, as already noted before. Theoretically the CPU should be able to work while the DMA does its task. However, in reality, the DMA locks the bus for its exclusive use, so the CPU can go on working only as long as it doesn't need any bus access. Then it will stall, waiting for the DMA transfer to finish.

Here are some screenshots taken while running the program on my DS Lite.

Figures could change if the program runs on a different DS model, and I would be interested in seeing those figures if they do change a lot. Pressing one of the shoulder keys will save a bitmap on your memory card: if the selected target is main RAM a 'memcpy_mainram.bmp' file will be created, while if the selected target is video RAM then the created file will be 'memcpy_videoram.bmp'.

Finally, here are some considerations you might find interesting:

copying using DMA is both the slowest option (when copying to main RAM) and the fastest one (when copying to video RAM)

memcpy() gets overtaken by almost every other method, so I guess one should use it only when prototyping or when performance is not an issue

reading from an uncached address seems to give some % of boost when using ldmia/stmia; it just makes things worse when using memcpy()

Loop unrolling doesn't give any advantage when copying to main RAM; on the contrary, it effectively speeds up a bit when copying to video RAM

reading from a cache unaligned source address can slow down things a bit, especially when reading from a cached address

using a DTCM temporary copy doesn't help

Never trust results taken from any emulator. At least I didn't find any emulator that could provide results resembling the real ones.

If you'd like to run the tests on your own DS, here's the program to download.

A DS Homebrewer's Diary

Thursday, February 23, 2012

The Unequivocal Answer