Just skimmed over the new emulator discussion rev, didn't see this trick mentioned.
Under linux (and, to a limited degree, under win2k and up theoretically) it is possible to emulate the PRG bankswitching with the x86's MMU. With this trick, any access that hits PRG or ram can be implemented as a direct access.
As an upside, it makes N. Bradley's non-bankswitched 6502 core usable for a NES emulator.
You still have to detect mapper writes, PPU/APU register accesses, and depending on how you approach things, the RAM area.
All speed advantages go out the window if you're attempting to emulate the consequences of executing code from areas that are not RAM, SRAM, or PRG.
Yes, I know this is obscenely platform dependent.
Now for the interesting parts.
On the OS end, it is basically an abuse of the mmap(2) syscall. There are two approaches, one for ancient (pre 2.3) kernels, one for newer ones (the kernel developers removed support for the old trick around 2.2.30-something)
The remapping mmap() calls need MAP_FIXED, and MAP_PRIVATE IIRC. You can use mmap() with MAP_ANON to grab chunks of address space that aren't backed by a file.
The old trick:
open up /proc/self/mem, and /dev/zero. Use mmap() with /dev/zero to allocate chunks of memory for the PRG, SRAM, and 4k for the RAM. Also grab a 64k chunk of address space.
Use mmap() with /proc/self/mem to remap pages from the PRG, SRAM, and RAM areas into that 64k chunk of memory.
The new trick:
Fairly similar, but /proc/self/mem lost mmap() support. The new way requires you to link against librt to get POSIX SHM. librt exists from at least glibc 2.1.x and on. IIRC, you use shm_open(2) to grab a chunk of shared memory that can hold the RAM, SRAM, and PRG. Grab a 64k chunk of address space, then use mmap() on the SHM block to map things in.
This requires POSIX shm, because it returns a file descriptor. The old SysV SHM is not compatible with mmap. Using a file with the old trick to do the remapping would potentially do bad things when the kernel attempts to write back the various pages you write to.
The CPU core end:
For NB's core, I left the opcode fetch handler alone. I had write handlers for the ram area, registers, and mapper area. Reads that did not hit registers went direct to main memory. I made some slight modifications to the direct ZP and stack accesses, so that all writes also wrote to that address ^ 0x800, to handle that half of the ram mirror.
The ram area write handler doubled up writes the same way that the ZP and stack accesses did -- I worked under the assumption that reads were more abundant than writes, so the cost of the extra write would be acceptable, due to not having to mask down any reads.
Why this works:
The CPU's cache, by definition, has to tag lines by their physical address. The remaps most likely DO cause a TLB flush, but I'm fairly certain the savings win out.
Other ways you could go with this:
I doubt it would be an efficiency win, but if one were so inclined, you could override the SEGV signal, mark the pages you want to catch accesses to as either read only or no-access, and eliminate the memory space check on your memaccess handler.
Generally, the first 128MB or so of the process address space is unused, you CAN map your 64k in at virtual address 0. Your core can then even forego remapping the pointer values, and just use them directly. Unless you take special care to keep the upper halves of your registers clean, you will need to use movzx when getting addresses, or use 16 bit addressing -- but movzx is fast on the P6 and up.
The PPU pages are swapped out a bit too frequently for this to win out, I *think*. I did not try this on them. Using this for mappers like the MMC3, with a 1k resolution takes a bit of trickery, probably pre-decoding the CHR into 8bpp. Yes, it's rather wasteful, given that 6 of those bits are unused, but it does bump the MMC3's minimum bank size up to an x86 friendly 4k, and leaves the lines ready for a movq.