IMO, the best way to render NES graphics is to:
-set up a 32-scanline buffer in memory. This will serve as your working image buffer. It is limited to 32 scanlines to greatly increase the chances of it staying in the processor's L1 cache.
-design a tile rendering algorithm which can accept a scanline range to render in (doing all 8 scanlines yeilds best performance, but the game will be in charge of this).
-to avoid the relatively large overhead in the bit assembling & palette lookup part of pattern table processing, follow these steps:
*reorganize the game's pattern table data so that sequential pixels in a tile's bitmap are now 8 bits apart, and store both bit planes bits together. This will allow you to store 4 pixels in a single byte. This reordering will also complicate your $2007 handling code when the game writes to pattern tables, but this is really nothing in terms of lost performance.
*become familiar with x86 MMX instructions. Since the MMX registers are 8 bytes wide, and you can operate on the data in them as if they were 8 seperate byte values, this allows you to do add and compare operations on 8 bytes at a time. Additionally, now that your bitmap data is arranged in memory so that sequential bitmap pixels are byte-aligned, this means that when you load 8 bytes of your pattern into an MMX register, every byte contains the pixel data in the order it needs to be in when it's stored out to memory.
*for the most inner loop of your playfield renderer, study the following code (this code will produce 8 sequential 8-bit pixels- in only 10 clocks on an Athlon!). However, one thing this code lacks is compensation for unaligned pixel storage (which will happen with any fine scroll value other than 0). The penalty is probably cheap on Athlons, but on P3s, it's pretty significant.
;appropriate pattern table data has been fetched & loaded into mm0.
;most inner loop begins here.
;these instructions duplicate the data into 3 other registers.
;these instructions use (signed) greater than comparisons to determine whether to set 0, 1, 2 or all 3 registers to either all 0's, or all 1's, based on the upper 2 bits of each byte in the registers. This will be used later to select the 4-color value to use using logical instructions.
pcmpgtpb mm1,Level1; 4040404040404040
pcmpgtpb mm2,Level2; 8080808080808080
pcmpgtpb mm3,Level3; C0C0C0C0C0C0C0C0
;these instructions mask the selected data with the palette data of each element of the selected palette, referenced via EAX. Note that this technique requires a single byte palette entry to be duplicated across 8 sequential bytes. Also, XORs are used to merge the selected data. Since more than one palette data type may be selected, it is required that your $2007 handler XOR some of these palette elements with each other in advance so that when this routine executes, it cancels out the unwanted values.
;the rest of the code is straightforward. the next instruction shifts the master copy of the data left 2 positions, therefore shifting the next scanline of pixels in the tile bitmap into the 2 MSB positions of each byte element.
pxor mm1,mm2; a final merge operation
;the store out of the pixels, and the pixel pointer scanline increment
;Finally, this instruction simply exists to perform the address wrap-around calculation needed for a 32-scanline buffer.
-use 2 extra bits in each pixel byte to indicate playfield transparency status, and object present status. This is required, because of the way the PPU prioritizes OBJs & the PF. So, after you've drawn 32 scanlines worth of playfield tiles, render objects on top (which fall within the range of the first 32 scanlines), and render them in the order from 0->63 (not 63->0, as you may had guessed- this will be explained later). Before writing object pixels, read in the pixels overtop the area where the object is to be placed, and use MMX logical instructions to "choose" which data to use between the playfield, and the objects. Basically you can use the same code to render object pixels as is listed above for the playfield, but a few things need to be added.
*conditional byte swapping of all 8 bytes will be neccessary to implement horizontal inversion to objects. Since MMX instructions don't provide an easy way of doing this, you'll be stuck with implementing your own algorithm here.
*the following lines of pseudo-code demonstrate the logic behind choosing between OBJ or PF pixel data to output to the image buffer. Although IF statements are used here, it is expected that you would convert this into the equivelant MMX code to have it operate on 8 pixels simultaniously.
DestPixel.data := SrcOBJpixel.data
DestPixel.OBJxpCond := FALSE
So, as you can see, the destination's OBJxpCond is marked as false, even if the object's pixel is not meant to be drawn. This is to prevent the pixels of lower priority (numerically higher-numbered) objects from being drawn in those locations.
This may raise the question, "Why do you render objects in the order of 0->63 (effectively requiring 2 bits for transparency status), when you can render them in the opposite direction (which only requires 1 bit for transparency status)?" The answer is because of what happens on a priority clash (see the "PPU pixel priority quirk" section of the "2C02 technical operation" document). Rendering objects in order of 0->63 is the only way to emulate this PPU feature properly (and some games DO depend on the functionality of this, as it provides a way to force the playfield to hide foreground priority object pixels). Otherwise (for 63->0), it would be neccessary to merge objects to a frame buffer filled with the current transparency color, and then, merge playfield data with the buffer as well. Granted, this technique will only require 1 transparency (background priority) status bit per pixel, but since merge operations are slow, and this technique requires way more of them (since now it's required for the PF rendering also), this technique is inferior to the aforementioned one.
-The final tip is to buffer all data that your CPU sends to the PPU (with cycle count info), and process it at the end of the frame. Why? because this reduces the complexity of your emulator's program control, and it also helps keep data structures in the cache more organized, which results in better performance.