NESDev and Strangulation Records messageboards
Forum Index | FAQ | New User | Login | Search

Previous ThreadView All ThreadsNext Thread*Show in Threaded Mode


SubjectPixel Writing new  
Posted byLaughy
Posted on8/19/03 03:38 AM
From IP67.121.107.37  



I was wondering how people handle the pure pixel core of their ppus. Since the instructions within this core are executing more than 3,500 times a millisecond, it is imperative they run as fast as possible. I imagine the fastest one could hope to do their pixel writing would be:

get 1st pixel color from the tables and shift it left by one
get 2nd pixel color from the tables and add it to the 1st
add the total to the color for the 8x8 pixel square the pixel is in
Get the NES color by going into memory and taking Memory[$3F00 + Color]
Index this color into an RGB Table
Set a screen pixel buffer to the Integer RGB Table value

I was wondering if anyone had any tricks or such they use to speed up one of these steps significantly, with assembly or otherwise. Is it possible to do these at any time before rendering of the pixels to speed up the drawing?

Thanks guys. :)




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on8/19/03 06:18 AM



IMO, the best way to render NES graphics is to:

-set up a 32-scanline buffer in memory. This will serve as your working image buffer. It is limited to 32 scanlines to greatly increase the chances of it staying in the processor's L1 cache.

-design a tile rendering algorithm which can accept a scanline range to render in (doing all 8 scanlines yeilds best performance, but the game will be in charge of this).

-to avoid the relatively large overhead in the bit assembling & palette lookup part of pattern table processing, follow these steps:

*reorganize the game's pattern table data so that sequential pixels in a tile's bitmap are now 8 bits apart, and store both bit planes bits together. This will allow you to store 4 pixels in a single byte. This reordering will also complicate your $2007 handling code when the game writes to pattern tables, but this is really nothing in terms of lost performance.

*become familiar with x86 MMX instructions. Since the MMX registers are 8 bytes wide, and you can operate on the data in them as if they were 8 seperate byte values, this allows you to do add and compare operations on 8 bytes at a time. Additionally, now that your bitmap data is arranged in memory so that sequential bitmap pixels are byte-aligned, this means that when you load 8 bytes of your pattern into an MMX register, every byte contains the pixel data in the order it needs to be in when it's stored out to memory.

*for the most inner loop of your playfield renderer, study the following code (this code will produce 8 sequential 8-bit pixels- in only 10 clocks on an Athlon!). However, one thing this code lacks is compensation for unaligned pixel storage (which will happen with any fine scroll value other than 0). The penalty is probably cheap on Athlons, but on P3s, it's pretty significant.


;appropriate pattern table data has been fetched & loaded into mm0.

;most inner loop begins here.
;these instructions duplicate the data into 3 other registers.
movq mm1,mm0
movq mm2,mm0
movq mm3,mm0

;these instructions use (signed) greater than comparisons to determine whether to set 0, 1, 2 or all 3 registers to either all 0's, or all 1's, based on the upper 2 bits of each byte in the registers. This will be used later to select the 4-color value to use using logical instructions.
pcmpgtpb mm1,Level1; 4040404040404040
pcmpgtpb mm2,Level2; 8080808080808080
pcmpgtpb mm3,Level3; C0C0C0C0C0C0C0C0

;these instructions mask the selected data with the palette data of each element of the selected palette, referenced via EAX. Note that this technique requires a single byte palette entry to be duplicated across 8 sequential bytes. Also, XORs are used to merge the selected data. Since more than one palette data type may be selected, it is required that your $2007 handler XOR some of these palette elements with each other in advance so that when this routine executes, it cancels out the unwanted values.
pand mm1,[eax+00]
pand mm2,[eax+08]
pand mm3,[eax+16]
pxor mm1,[eax+24]
pxor mm2,mm3

;the rest of the code is straightforward. the next instruction shifts the master copy of the data left 2 positions, therefore shifting the next scanline of pixels in the tile bitmap into the 2 MSB positions of each byte element.
psllq mm0,2
pxor mm1,mm2; a final merge operation

;the store out of the pixels, and the pixel pointer scanline increment
movq RndrBuf[ebx],mm1
add ebx,LineLen

;Finally, this instruction simply exists to perform the address wrap-around calculation needed for a 32-scanline buffer.
and ebx,BufSize-1;


-use 2 extra bits in each pixel byte to indicate playfield transparency status, and object present status. This is required, because of the way the PPU prioritizes OBJs & the PF. So, after you've drawn 32 scanlines worth of playfield tiles, render objects on top (which fall within the range of the first 32 scanlines), and render them in the order from 0->63 (not 63->0, as you may had guessed- this will be explained later). Before writing object pixels, read in the pixels overtop the area where the object is to be placed, and use MMX logical instructions to "choose" which data to use between the playfield, and the objects. Basically you can use the same code to render object pixels as is listed above for the playfield, but a few things need to be added.

*conditional byte swapping of all 8 bytes will be neccessary to implement horizontal inversion to objects. Since MMX instructions don't provide an easy way of doing this, you'll be stuck with implementing your own algorithm here.

*the following lines of pseudo-code demonstrate the logic behind choosing between OBJ or PF pixel data to output to the image buffer. Although IF statements are used here, it is expected that you would convert this into the equivelant MMX code to have it operate on 8 pixels simultaniously.

IF(SrcOBJpixel.xpCond=FALSE)THEN
IF((DestPixel.OBJxpCond=TRUE)AND((DestPixel.PFxpCond=TRUE)OR(SrcOBJpixel.Pri=foreground)))THEN
DestPixel.data := SrcOBJpixel.data
FI
DestPixel.OBJxpCond := FALSE
FI

So, as you can see, the destination's OBJxpCond is marked as false, even if the object's pixel is not meant to be drawn. This is to prevent the pixels of lower priority (numerically higher-numbered) objects from being drawn in those locations.

This may raise the question, "Why do you render objects in the order of 0->63 (effectively requiring 2 bits for transparency status), when you can render them in the opposite direction (which only requires 1 bit for transparency status)?" The answer is because of what happens on a priority clash (see the "PPU pixel priority quirk" section of the "2C02 technical operation" document). Rendering objects in order of 0->63 is the only way to emulate this PPU feature properly (and some games DO depend on the functionality of this, as it provides a way to force the playfield to hide foreground priority object pixels). Otherwise (for 63->0), it would be neccessary to merge objects to a frame buffer filled with the current transparency color, and then, merge playfield data with the buffer as well. Granted, this technique will only require 1 transparency (background priority) status bit per pixel, but since merge operations are slow, and this technique requires way more of them (since now it's required for the PF rendering also), this technique is inferior to the aforementioned one.

-The final tip is to buffer all data that your CPU sends to the PPU (with cycle count info), and process it at the end of the frame. Why? because this reduces the complexity of your emulator's program control, and it also helps keep data structures in the cache more organized, which results in better performance.




SubjectRe: Pixel Writing new  
Posted byLaughy
Posted on8/19/03 06:58 AM
From IP67.125.157.140  



Well I'm aghast to say the least, and I hope you copy/pasted that from somewhere instead of typing it all out! That is in fact extremely fast and shows great ability in rendering.

I admit I've never used MMX before, but the ability to operate on 8 bytes at a time definately gives a good argument to start :).

I handled the object priority using your second inferior method. Thanks for showing me the other method.




SubjectRe: Pixel Writing new  
Posted byTimW
Posted on8/20/03 06:38 AM
From IP209.179.54.114  



wow, I wish I knew what the hell that ment, lol. Do you happen to know of a good mmx faq??




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on8/20/03 2:33 PM



yeah. Visit the AMD & Intel homepages.




SubjectRe: Pixel Writing new  
Posted byTimW
Posted on8/28/03 6:25 PM
From IP209.178.144.18  



I've been thinking about this. why not use a lookup table?? tell me what you think of this.

you got 8 pixels per line of a tile, you have "two" bytes of pattern table data. and two bits of attribute data.
so there are only 256*256*4 permiutations of this. you could store every possible combination of bitmapped data in an array of about 3 to 4 meg. you whould have to make the array linear, ie one dimension to avoid the overhead of multiplication.

check it
struct {
dword *pIndex = &patterntabbyte1;
byte patterntabbyte1;
byte patterntabbyte2;
byte attrib;
byte bpadding = 0;
}

when it comes time to draw the tile,
fill in the fields
Byte *p = lookup[*pIndex];
now p, points to an array of 8 bytes which is the converted bitmapped data.
now you got the 8 bytes(the palette entries) with only 3 assignments, and attribute byte fetch, which could be implimented as a lookup aswell.




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on8/28/03 11:44 PM



>array of about 3 to 4 meg

A table this size wouldn't be able to stay in the cache. Even if you used cache management instructions to minimize pollution, this would only reduce the lookup speed to approximately whatever your system RAM is running at, plus the overhead for accessing a random memory address. But hey, don't take my word for it. Try it out, and let us know what the performance is like. Then compare it (time-wise) to the MMX example mentioned previously. Since it's safe to say that the MMX code example is not going to take more than 10 clocks to execute on any x86 CPU, this could be your reference time figure to beat.




SubjectRe: Pixel Writing new  
Posted byLaughy
Posted on8/29/03 00:11 AM
From IP67.127.167.83  



Your code, Big Time, although a great idea, in practice has some issues. For instance, you have:

pcmpgtpb mm1,Level1; 4040404040404040
pcmpgtpb mm2,Level2; 8080808080808080
pcmpgtpb mm3,Level3; C0C0C0C0C0C0C0C0

Is rather interesting. For instance, if mm1 = $4040404040404040 the result will be all 0s for your first instruction, and $4141414141414141 will produce all 1s (because it's a greater than), even though they should both produce the same result. If anything level1 should be $3F3F3F3F3F3F3F3F. In addition, I believe pcmpgtpb is the older version of the instruction pcmpgtb, since pcmpgtpb does not compile. Weirdly enough, intel's homepage has pcmpgtpb, but directly below it it uses pcmpgtb, as though they are exactly the same. Most likely they are.

The overhead of dealing with 8 bit clipping and fine horizontal scrolling complicates matters more, but they can be dealt with without having to worry about different hits between athlon and pentimum processors, using a trick of masking out attribute data when neccessary, and using a jump table, which one then jumps into at certain points to trick the ppu into scrolling left horizontally. Also, the problem of background sprites showing up in front when a other sprite was in front of it was solved rather easily by having sprites written to also write to an array which would have either FF or 00, depending on whether there was a sprite color there. By pxor'ing this with color0 data, one can then shift the sprite colors into the final output.

Rather than completely re-writting the $2007 handler and complicating matters, I created a 256 lengh array of 64-bit values...which is a 2kb or so array, and will stay in processor cache. By indexing into this array with tile attribute data, one converts a byte into a 64bit value (for instance 11001011 indexed into the array becomes 1000000 1000000 0000000 00000000 10000000 0000000 10000000 10000000). For the second attribute data, one index into the array, shifts right by one, and ors it with the first value, giving ua our final data to use in pcmpgtb. This takes very little clock cycles to do, and even when I took this out performance did not decrease very much at all...if anything the difference was un-detectable.

After implementing I greatly increased the speed of my ppu (speed increased by about 25%, with all thing implemented), thanks for your tips!




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on8/29/03 00:41 AM



>If anything level1 should be $3F3F3F3F3F3F3F3F.

You're right. Also, the one that says $8080... should actually be all FF's (At first, I thought this instruction did unsigned comparisons, which is why I wrote it the way I did. Later I realized that it was a signed comparison, but forgot to update the code).


> In addition, I believe pcmpgtpb is the older version of the instruction pcmpgtb, since pcmpgtpb does not compile. Weirdly enough, intel's homepage has pcmpgtpb, but directly below it it uses pcmpgtb, as though they are exactly the same. Most likely they are.

As far as I know, there are only 3 compare for greater than instructions, one for bytes, words, and doubles.


>The overhead of dealing with 8 bit clipping and fine horizontal scrolling complicates matters more

For clipping only PF, it's probably easiest just to render the transparency color overtop the PF margin before rendering OBJs.
For clipping PF+OBJ, follow the same step as mentioned, except do it after OBJs are updated.
For clipping only OBJ, simply flag all the bits in each pixel in the PF margin so that they're marked as OBJ pixels. Then when you go to draw OBJs, your OBJ processing algorithm will avoid drawing any OBJs there, because it thinks that OBJs are already there.


>After implementing I greatly increased the speed of my ppu (speed increased by about 25%, with all thing implemented), thanks for your tips!

I'm glad to hear that it was successful.




SubjectRe: Pixel Writing  
Posted byTimW
Posted on8/30/03 06:04 AM
From IP209.179.233.224  



I don't think I could beat 10 clocks, but I think it's a good way if you're using a high level language like c. how would I go about counting the clocks anyway?




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on8/30/03 5:18 PM



All MMX instructions on the first generation Pentium MMX take 1 clock cycle, even if memory is used in the instruction in any way. As far as I know, all instruction combinations are pairable, so any two adjacent instructions with no dependencies can execute simultaniously. The maximum theoretical throughput of MMX instructions here is 2 per clock.

MMX P6 architecture is similar, except that instructions cost 2 clock cycles, if a memory operand is being used with an instruction which is doing other work than a simple load. The maximum theoretical throughput of MMX instructions here is 3 per clock.

Athlon does most MMX instructions in 2 clocks, with or withouth memory access. The ones that take longer are pretty much any of them which modify general purpose registers. The maximum theoretical throughput of MMX instructions here are 3 for every 2 clocks.


So, you should be able to see where that 10 clock figure comes from. At any rate, even after considering that tile lookup, attribute lookup, and even loop overhead code can't take any more than another 10 clocks (ASM coded, of course), you're still looking at a pretty modest 20 clock cycles/ 8 pixels. Considering that the original 2C02 renders pixels at a rate of 5.37 MHz, this makes the throughput on an 1400 MHz Athlon equivelant to a 2C02 clocked at 560 MHz!




SubjectRe: Pixel Writing new  
Posted byTimW
Posted on8/31/03 11:31 AM
From IP209.178.188.74  



yeah, the lookup wasn't as great as I thought, it was even slower then the method I'm using now, which I know it's that fast =P. I'm going to have to learn those mmx instructions, it seems like the only way to get the speed up.




SubjectRe: Pixel Writing new  
Posted byBig Time
Posted on9/6/03 7:00 PM



The NES emulation discussion document has been updated (3rd release: September 6th, 2003). Besides having been practically rewritten, it now contains far more useful information in general regarding writing an NES emulator, with optimization suggestions being alot better as well. The most notable addition is that I stuck the MMX-example of the playfield renderer in there. It's been refined (it now contains the code to do the fine horizontal scrolling), and the comments should help explain better how it works. Since I've only sent the updated doc to Memblers today, have some patience if the doc hasn't been updated yet.




Previous ThreadView All ThreadsNext Thread*Show in Threaded Mode
Jump to

Memblers' homepage             Contact Me

Forums powered by WWWThreads Demo