Advertisement
Guest User

Untitled

a guest
Nov 21st, 2017
71
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 5.83 KB | None | 0 0
  1. history and background:
  2. Second Reality: you remeber the head with the pentagram and the
  3. lens effect... then a flash and whooopsy, what's that? this is 160x100...
  4. Why? Otherwise it'd be too slow! But why, This is such a simple effect?
  5. When I (some time later) recoded this effect I noticed that the framerate
  6. drops at a certain angle, and the only reason could be the cache.
  7. The processor cache is organized in a special way to have fast access
  8. to it's memory. So you have cache lines of 16 (32) bytes on a 486 (pentium)
  9. which are atoms. They all have a tag address field which stores the
  10. position of the 16 (32) bytes in memory. Then you have 4 (2) ways which
  11. are in a way 4 (2) equal caches which can be processed at the same time.
  12. Finally there are 256 sets of one cache line per way. Bits 4-11 (5-12) of
  13. the address determine the used set for a memory access.
  14. At a memory access, the address is split into 3 parts: bits 0-3 (0-4)
  15. determine the byte in a line, bits 4-11 (5-12) determine the set, and
  16. bits 12-31 (13-31) are the tag address. The tag address is then compared
  17. to the tag addresses of the 4 (2) lines of the set. If one matches it is
  18. a cache hit, if not you get a cache miss, 16 (32) bytes are read from
  19. memory to the least recently used cache line of that set. This takes
  20. about 23 cycles on a 486dx2-66, while a cache hit takes no extra cycles.
  21. A cache is most effective if you read the memory in a linear order like
  22. you do it in a rotozoomer at low angles. You then get one cache miss
  23. out of 16 (32) memory accesses. Now imagine the angle is exactly 90°.
  24. You would then read then memory in steps of 256, after 8k the first
  25. cache line is overwritten, so if you process the next line, it is a
  26. cache miss. This results in 100% cache misses...
  27. How to optimize it?
  28. I had several discussions with Scholar / $eeN on this topic. (hiho!)
  29. We though about rendering the screen in a different order, so that
  30. the texture is read in a linear fashion. This would be diagonal lines
  31. instead of h-lines. But this is not a fast solution either, and more
  32. complicated anyway. You could also keep prerotated versions of the
  33. texture, but this would require 2x or more the amount of memory,
  34. and you are limited to a fixed texture if you do not want to modify
  35. 2 textures all the time.
  36. The 8x8 block approach was a good compromise. :) You can write dwords,
  37. and do not need too much memory while the cache contents are not
  38. destroyed.
  39. You can also use this 8x8 block approach to optimize movelist-tunnels:
  40. keep the movelist linear, while you go though the 8x8 blocks.
  41. And you can do other nice things with 8x8 blocks... ;))))))))))
  42. Which I cannot tell you yet. probably later! =}
  43. Cache optimizing seems to be quite stupid for vector engines, but
  44. it is ESSENTIAL for fast bitmap effects.
  45.  
  46. This little assembler fragments show you what cache can do and
  47. what it cannot:
  48.  
  49. fastloop:
  50. mov dx,12
  51. l1:
  52. mov cx,32768
  53. l2:
  54. mov ax,[0]
  55. mov ax,[0]
  56. mov ax,[0]
  57. mov ax,[0]
  58. mov ax,[0]
  59. dec cx
  60. jnz l2
  61. dec dx
  62. jnz l1
  63.  
  64. slowloop:
  65. mov dx,12
  66. l3:
  67. mov cx,32768
  68. l4:
  69. mov ax,[2047]
  70. mov ax,[4095]
  71. mov ax,[6143]
  72. mov ax,[8191]
  73. mov ax,[10239]
  74. dec cx
  75. jnz l4
  76. dec dx
  77. jnz l3
  78.  
  79. On a 486dx2-66 the first loop is about 25-50 times as fast as the second
  80. one. If you don't believe me, try it yourself.
  81. On a pentium 3 moves are enough to show the effects of the cache:
  82. mov ax,[4095]
  83. mov ax,[8191]
  84. mov ax,[12287]
  85. use those in the loop.
  86.  
  87.  
  88. description:
  89. This rotozoomer (also stretcher if you like...) does not process the
  90. picture like it is usually done, line by line, but block by block.
  91. The reason is simple: If you do it line by line you get 100% cache misses
  92. beyond a certain rotation angle (~60° on a 486dx 8k processor cache).
  93. 100% cache misses mean 64000 (mode 13h) times 23 cycles on a
  94. 486dx2-66 (experimental value), which is > 20ms, and your frame rate
  95. cannot get any better than 50Hz (ignoring all other operations).
  96. You might end up at 30Hz at certain angels and 100Hz at others.
  97. If you do it block by block you can reduce the cache misses to about
  98. 4000 to 8000 (2-5 ms), which has only little effect on the framerate.
  99. You end up with a rotozoomer which runs at constant 100fps on a dx2-66.
  100. or constant 350fps on a p120.
  101.  
  102. Every 8x8 block is processed in the usual way, ie. 8 pixels, next line,
  103. 8 pixels, next line... The blocks are then drawn either in horizontal or
  104. vertical order, ie. row by row vs column by column. If used correctly this
  105. feature reduces the number of cache misses still a bit. You should use
  106. vertical order, if the angle is about 90° or 270°. If it is rather 0° or
  107. 180° use horizontal order.
  108.  
  109. I had no problem to use 4 rotozoomers at the same time in our intro
  110. "LASSE REINB0NG" which was still quite smooth on a 486dx2-66 (>15fps).
  111. This is the same routine as used in the intro, so believe me, it's fast.
  112. (only that you can now control the blockorder)
  113. Nothing was slowed down!
  114.  
  115. On a P120 this routine nearly runs in 1 frame / retrace... (did I say
  116. 640x480??? :) )
  117. That is why there is the bytes/line parameter. You can use this routine
  118. with segmented screen memory. In 640x480 you should set the virtual screen
  119. width to 1024. Then the bytes/line is not equal to yblocks*8. You can then
  120. process a 80x8 blocks (640x64) range with this routine without segment
  121. changes. If you process 7.5 of these ranges with segment changes in
  122. between you can fill the screen with this routine.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement