Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- history and background:
- Second Reality: you remeber the head with the pentagram and the
- lens effect... then a flash and whooopsy, what's that? this is 160x100...
- Why? Otherwise it'd be too slow! But why, This is such a simple effect?
- When I (some time later) recoded this effect I noticed that the framerate
- drops at a certain angle, and the only reason could be the cache.
- The processor cache is organized in a special way to have fast access
- to it's memory. So you have cache lines of 16 (32) bytes on a 486 (pentium)
- which are atoms. They all have a tag address field which stores the
- position of the 16 (32) bytes in memory. Then you have 4 (2) ways which
- are in a way 4 (2) equal caches which can be processed at the same time.
- Finally there are 256 sets of one cache line per way. Bits 4-11 (5-12) of
- the address determine the used set for a memory access.
- At a memory access, the address is split into 3 parts: bits 0-3 (0-4)
- determine the byte in a line, bits 4-11 (5-12) determine the set, and
- bits 12-31 (13-31) are the tag address. The tag address is then compared
- to the tag addresses of the 4 (2) lines of the set. If one matches it is
- a cache hit, if not you get a cache miss, 16 (32) bytes are read from
- memory to the least recently used cache line of that set. This takes
- about 23 cycles on a 486dx2-66, while a cache hit takes no extra cycles.
- A cache is most effective if you read the memory in a linear order like
- you do it in a rotozoomer at low angles. You then get one cache miss
- out of 16 (32) memory accesses. Now imagine the angle is exactly 90°.
- You would then read then memory in steps of 256, after 8k the first
- cache line is overwritten, so if you process the next line, it is a
- cache miss. This results in 100% cache misses...
- How to optimize it?
- I had several discussions with Scholar / $eeN on this topic. (hiho!)
- We though about rendering the screen in a different order, so that
- the texture is read in a linear fashion. This would be diagonal lines
- instead of h-lines. But this is not a fast solution either, and more
- complicated anyway. You could also keep prerotated versions of the
- texture, but this would require 2x or more the amount of memory,
- and you are limited to a fixed texture if you do not want to modify
- 2 textures all the time.
- The 8x8 block approach was a good compromise. :) You can write dwords,
- and do not need too much memory while the cache contents are not
- destroyed.
- You can also use this 8x8 block approach to optimize movelist-tunnels:
- keep the movelist linear, while you go though the 8x8 blocks.
- And you can do other nice things with 8x8 blocks... ;))))))))))
- Which I cannot tell you yet. probably later! =}
- Cache optimizing seems to be quite stupid for vector engines, but
- it is ESSENTIAL for fast bitmap effects.
- This little assembler fragments show you what cache can do and
- what it cannot:
- fastloop:
- mov dx,12
- l1:
- mov cx,32768
- l2:
- mov ax,[0]
- mov ax,[0]
- mov ax,[0]
- mov ax,[0]
- mov ax,[0]
- dec cx
- jnz l2
- dec dx
- jnz l1
- slowloop:
- mov dx,12
- l3:
- mov cx,32768
- l4:
- mov ax,[2047]
- mov ax,[4095]
- mov ax,[6143]
- mov ax,[8191]
- mov ax,[10239]
- dec cx
- jnz l4
- dec dx
- jnz l3
- On a 486dx2-66 the first loop is about 25-50 times as fast as the second
- one. If you don't believe me, try it yourself.
- On a pentium 3 moves are enough to show the effects of the cache:
- mov ax,[4095]
- mov ax,[8191]
- mov ax,[12287]
- use those in the loop.
- description:
- This rotozoomer (also stretcher if you like...) does not process the
- picture like it is usually done, line by line, but block by block.
- The reason is simple: If you do it line by line you get 100% cache misses
- beyond a certain rotation angle (~60° on a 486dx 8k processor cache).
- 100% cache misses mean 64000 (mode 13h) times 23 cycles on a
- 486dx2-66 (experimental value), which is > 20ms, and your frame rate
- cannot get any better than 50Hz (ignoring all other operations).
- You might end up at 30Hz at certain angels and 100Hz at others.
- If you do it block by block you can reduce the cache misses to about
- 4000 to 8000 (2-5 ms), which has only little effect on the framerate.
- You end up with a rotozoomer which runs at constant 100fps on a dx2-66.
- or constant 350fps on a p120.
- Every 8x8 block is processed in the usual way, ie. 8 pixels, next line,
- 8 pixels, next line... The blocks are then drawn either in horizontal or
- vertical order, ie. row by row vs column by column. If used correctly this
- feature reduces the number of cache misses still a bit. You should use
- vertical order, if the angle is about 90° or 270°. If it is rather 0° or
- 180° use horizontal order.
- I had no problem to use 4 rotozoomers at the same time in our intro
- "LASSE REINB0NG" which was still quite smooth on a 486dx2-66 (>15fps).
- This is the same routine as used in the intro, so believe me, it's fast.
- (only that you can now control the blockorder)
- Nothing was slowed down!
- On a P120 this routine nearly runs in 1 frame / retrace... (did I say
- 640x480??? :) )
- That is why there is the bytes/line parameter. You can use this routine
- with segmented screen memory. In 640x480 you should set the virtual screen
- width to 1024. Then the bytes/line is not equal to yblocks*8. You can then
- process a 80x8 blocks (640x64) range with this routine without segment
- changes. If you process 7.5 of these ranges with segment changes in
- between you can fill the screen with this routine.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement