All MMX instructions on the first generation Pentium MMX take 1 clock cycle, even if memory is used in the instruction in any way. As far as I know, all instruction combinations are pairable, so any two adjacent instructions with no dependencies can execute simultaniously. The maximum theoretical throughput of MMX instructions here is 2 per clock.
MMX P6 architecture is similar, except that instructions cost 2 clock cycles, if a memory operand is being used with an instruction which is doing other work than a simple load. The maximum theoretical throughput of MMX instructions here is 3 per clock.
Athlon does most MMX instructions in 2 clocks, with or withouth memory access. The ones that take longer are pretty much any of them which modify general purpose registers. The maximum theoretical throughput of MMX instructions here are 3 for every 2 clocks.
So, you should be able to see where that 10 clock figure comes from. At any rate, even after considering that tile lookup, attribute lookup, and even loop overhead code can't take any more than another 10 clocks (ASM coded, of course), you're still looking at a pretty modest 20 clock cycles/ 8 pixels. Considering that the original 2C02 renders pixels at a rate of 5.37 MHz, this makes the throughput on an 1400 MHz Athlon equivelant to a 2C02 clocked at 560 MHz!