Guest User

tinymembench on ODROID-XU4 the big.LITTLE way

a guest
Aug 11th, 2017
239
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. root@odroidxu4:/usr/local/src/tinymembench# taskset -c 4-7 ./tinymembench
  2. tinymembench v0.4.9 (simple benchmark for memory throughput and latency)
  3.  
  4. ==========================================================================
  5. == Memory bandwidth tests ==
  6. == ==
  7. == Note 1: 1MB = 1000000 bytes ==
  8. == Note 2: Results for 'copy' tests show how many bytes can be ==
  9. == copied per second (adding together read and writen ==
  10. == bytes would have provided twice higher numbers) ==
  11. == Note 3: 2-pass copy means that we are using a small temporary buffer ==
  12. == to first fetch data into it, and only then write it to the ==
  13. == destination (source -> L1 cache, L1 cache -> destination) ==
  14. == Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
  15. == brackets ==
  16. ==========================================================================
  17.  
  18. C copy backwards : 1184.6 MB/s
  19. C copy backwards (32 byte blocks) : 1188.5 MB/s (1.0%)
  20. C copy backwards (64 byte blocks) : 2355.6 MB/s (2.4%)
  21. C copy : 1216.8 MB/s
  22. C copy prefetched (32 bytes step) : 1464.4 MB/s (0.9%)
  23. C copy prefetched (64 bytes step) : 1462.3 MB/s
  24. C 2-pass copy : 1135.8 MB/s (1.6%)
  25. C 2-pass copy prefetched (32 bytes step) : 1403.2 MB/s (1.0%)
  26. C 2-pass copy prefetched (64 bytes step) : 1406.0 MB/s (1.2%)
  27. C fill : 4906.0 MB/s (1.5%)
  28. C fill (shuffle within 16 byte blocks) : 1832.9 MB/s (2.4%)
  29. C fill (shuffle within 32 byte blocks) : 1833.2 MB/s
  30. C fill (shuffle within 64 byte blocks) : 1914.4 MB/s (1.0%)
  31. ---
  32. standard memcpy : 2314.8 MB/s (5.3%)
  33. standard memset : 4894.1 MB/s (1.7%)
  34. ---
  35. NEON read : 3581.8 MB/s (2.9%)
  36. NEON read prefetched (32 bytes step) : 4461.1 MB/s
  37. NEON read prefetched (64 bytes step) : 4482.1 MB/s (2.0%)
  38. NEON read 2 data streams : 3714.8 MB/s (1.6%)
  39. NEON read 2 data streams prefetched (32 bytes step) : 4600.8 MB/s (4.0%)
  40. NEON read 2 data streams prefetched (64 bytes step) : 4609.2 MB/s (1.4%)
  41. NEON copy : 2848.8 MB/s (2.2%)
  42. NEON copy prefetched (32 bytes step) : 3157.8 MB/s (3.4%)
  43. NEON copy prefetched (64 bytes step) : 3148.1 MB/s (3.8%)
  44. NEON unrolled copy : 2359.5 MB/s (2.0%)
  45. NEON unrolled copy prefetched (32 bytes step) : 3421.0 MB/s (3.0%)
  46. NEON unrolled copy prefetched (64 bytes step) : 3450.3 MB/s (3.1%)
  47. NEON copy backwards : 1251.8 MB/s (1.7%)
  48. NEON copy backwards prefetched (32 bytes step) : 1458.1 MB/s (1.1%)
  49. NEON copy backwards prefetched (64 bytes step) : 1458.3 MB/s (0.8%)
  50. NEON 2-pass copy : 2119.7 MB/s (1.9%)
  51. NEON 2-pass copy prefetched (32 bytes step) : 2354.8 MB/s (2.8%)
  52. NEON 2-pass copy prefetched (64 bytes step) : 2356.4 MB/s (1.3%)
  53. NEON unrolled 2-pass copy : 1430.3 MB/s (0.8%)
  54. NEON unrolled 2-pass copy prefetched (32 bytes step) : 1775.1 MB/s (1.2%)
  55. NEON unrolled 2-pass copy prefetched (64 bytes step) : 1793.6 MB/s (3.1%)
  56. NEON fill : 4868.4 MB/s (1.6%)
  57. NEON fill backwards : 1847.2 MB/s
  58. VFP copy : 2503.4 MB/s (2.4%)
  59. VFP 2-pass copy : 1333.5 MB/s (2.6%)
  60. ARM fill (STRD) : 4886.7 MB/s (1.3%)
  61. ARM fill (STM with 8 registers) : 4879.1 MB/s (1.4%)
  62. ARM fill (STM with 4 registers) : 4893.2 MB/s (1.5%)
  63. ARM copy prefetched (incr pld) : 2969.6 MB/s (3.6%)
  64. ARM copy prefetched (wrap pld) : 2809.9 MB/s (2.3%)
  65. ARM 2-pass copy prefetched (incr pld) : 1651.8 MB/s
  66. ARM 2-pass copy prefetched (wrap pld) : 1630.9 MB/s (1.4%)
  67.  
  68. ==========================================================================
  69. == Framebuffer read tests. ==
  70. == ==
  71. == Many ARM devices use a part of the system memory as the framebuffer, ==
  72. == typically mapped as uncached but with write-combining enabled. ==
  73. == Writes to such framebuffers are quite fast, but reads are much ==
  74. == slower and very sensitive to the alignment and the selection of ==
  75. == CPU instructions which are used for accessing memory. ==
  76. == ==
  77. == Many x86 systems allocate the framebuffer in the GPU memory, ==
  78. == accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
  79. == PCI-E is asymmetric and handles reads a lot worse than writes. ==
  80. == ==
  81. == If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
  82. == or preferably >300 MB/s), then using the shadow framebuffer layer ==
  83. == is not necessary in Xorg DDX drivers, resulting in a nice overall ==
  84. == performance improvement. For example, the xf86-video-fbturbo DDX ==
  85. == uses this trick. ==
  86. ==========================================================================
  87.  
  88. NEON read (from framebuffer) : 12209.5 MB/s
  89. NEON copy (from framebuffer) : 7055.8 MB/s (1.5%)
  90. NEON 2-pass copy (from framebuffer) : 4603.5 MB/s (1.0%)
  91. NEON unrolled copy (from framebuffer) : 5726.6 MB/s (0.6%)
  92. NEON 2-pass unrolled copy (from framebuffer) : 3791.0 MB/s (0.6%)
  93. VFP copy (from framebuffer) : 5738.3 MB/s
  94. VFP 2-pass copy (from framebuffer) : 3520.9 MB/s (0.4%)
  95. ARM copy (from framebuffer) : 7588.4 MB/s (1.3%)
  96. ARM 2-pass copy (from framebuffer) : 3783.9 MB/s
  97.  
  98. ==========================================================================
  99. == Memory latency test ==
  100. == ==
  101. == Average time is measured for random memory accesses in the buffers ==
  102. == of different sizes. The larger is the buffer, the more significant ==
  103. == are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
  104. == accesses. For extremely large buffer sizes we are expecting to see ==
  105. == page table walk with several requests to SDRAM for almost every ==
  106. == memory access (though 64MiB is not nearly large enough to experience ==
  107. == this effect to its fullest). ==
  108. == ==
  109. == Note 1: All the numbers are representing extra time, which needs to ==
  110. == be added to L1 cache latency. The cycle timings for L1 cache ==
  111. == latency can be usually found in the processor documentation. ==
  112. == Note 2: Dual random read means that we are simultaneously performing ==
  113. == two independent memory accesses at a time. In the case if ==
  114. == the memory subsystem can't handle multiple outstanding ==
  115. == requests, dual random read has the same timings as two ==
  116. == single reads performed one after another. ==
  117. ==========================================================================
  118.  
  119. block size : single random read / dual random read
  120. 1024 : 0.0 ns / 0.0 ns
  121. 2048 : 0.0 ns / 0.1 ns
  122. 4096 : 0.0 ns / 0.1 ns
  123. 8192 : 0.0 ns / 0.1 ns
  124. 16384 : 0.0 ns / 0.0 ns
  125. 32768 : 0.0 ns / 0.1 ns
  126. 65536 : 4.4 ns / 6.8 ns
  127. 131072 : 6.7 ns / 9.1 ns
  128. 262144 : 9.6 ns / 12.0 ns
  129. 524288 : 11.1 ns / 13.6 ns
  130. 1048576 : 11.9 ns / 14.6 ns
  131. 2097152 : 19.8 ns / 29.9 ns
  132. 4194304 : 95.7 ns / 143.9 ns
  133. 8388608 : 134.3 ns / 182.5 ns
  134. 16777216 : 153.9 ns / 197.5 ns
  135. 33554432 : 169.3 ns / 218.2 ns
  136. 67108864 : 179.0 ns / 235.1 ns
  137. root@odroidxu4:/usr/local/src/tinymembench# taskset -c 0-3 ./tinymembench
  138. tinymembench v0.4.9 (simple benchmark for memory throughput and latency)
  139.  
  140. ==========================================================================
  141. == Memory bandwidth tests ==
  142. == ==
  143. == Note 1: 1MB = 1000000 bytes ==
  144. == Note 2: Results for 'copy' tests show how many bytes can be ==
  145. == copied per second (adding together read and writen ==
  146. == bytes would have provided twice higher numbers) ==
  147. == Note 3: 2-pass copy means that we are using a small temporary buffer ==
  148. == to first fetch data into it, and only then write it to the ==
  149. == destination (source -> L1 cache, L1 cache -> destination) ==
  150. == Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
  151. == brackets ==
  152. ==========================================================================
  153.  
  154. C copy backwards : 218.1 MB/s
  155. C copy backwards (32 byte blocks) : 278.1 MB/s (2.3%)
  156. C copy backwards (64 byte blocks) : 299.9 MB/s (4.7%)
  157. C copy : 288.9 MB/s (3.8%)
  158. C copy prefetched (32 bytes step) : 536.9 MB/s (4.0%)
  159. C copy prefetched (64 bytes step) : 688.1 MB/s (10.6%)
  160. C 2-pass copy : 282.2 MB/s (5.0%)
  161. C 2-pass copy prefetched (32 bytes step) : 405.6 MB/s (7.2%)
  162. C 2-pass copy prefetched (64 bytes step) : 423.7 MB/s (7.8%)
  163. C fill : 801.9 MB/s (9.9%)
  164. C fill (shuffle within 16 byte blocks) : 803.3 MB/s (10.9%)
  165. C fill (shuffle within 32 byte blocks) : 485.8 MB/s (0.1%)
  166. C fill (shuffle within 64 byte blocks) : 486.7 MB/s
  167. ---
  168. standard memcpy : 363.4 MB/s (4.0%)
  169. standard memset : 590.0 MB/s
  170. ---
  171. NEON read : 491.7 MB/s (0.2%)
  172. NEON read prefetched (32 bytes step) : 966.1 MB/s (0.9%)
  173. NEON read prefetched (64 bytes step) : 1018.3 MB/s (0.4%)
  174. NEON read 2 data streams : 470.8 MB/s
  175. NEON read 2 data streams prefetched (32 bytes step) : 965.1 MB/s (1.2%)
  176. NEON read 2 data streams prefetched (64 bytes step) : 1010.6 MB/s (1.1%)
  177. NEON copy : 298.8 MB/s (5.0%)
  178. NEON copy prefetched (32 bytes step) : 704.3 MB/s (8.8%)
  179. NEON copy prefetched (64 bytes step) : 728.3 MB/s (9.9%)
  180. NEON unrolled copy : 263.9 MB/s
  181. NEON unrolled copy prefetched (32 bytes step) : 421.9 MB/s (6.8%)
  182. NEON unrolled copy prefetched (64 bytes step) : 655.9 MB/s (7.9%)
  183. NEON copy backwards : 296.8 MB/s (4.5%)
  184. NEON copy backwards prefetched (32 bytes step) : 699.8 MB/s (9.9%)
  185. NEON copy backwards prefetched (64 bytes step) : 724.9 MB/s (9.6%)
  186. NEON 2-pass copy : 291.1 MB/s (4.9%)
  187. NEON 2-pass copy prefetched (32 bytes step) : 352.5 MB/s
  188. NEON 2-pass copy prefetched (64 bytes step) : 441.0 MB/s (7.3%)
  189. NEON unrolled 2-pass copy : 272.1 MB/s (3.9%)
  190. NEON unrolled 2-pass copy prefetched (32 bytes step) : 364.0 MB/s (5.8%)
  191. NEON unrolled 2-pass copy prefetched (64 bytes step) : 409.6 MB/s (6.4%)
  192. NEON fill : 803.4 MB/s (10.7%)
  193. NEON fill backwards : 803.1 MB/s (9.9%)
  194. VFP copy : 291.8 MB/s (4.2%)
  195. VFP 2-pass copy : 265.1 MB/s (3.5%)
  196. ARM fill (STRD) : 797.1 MB/s (12.0%)
  197. ARM fill (STM with 8 registers) : 803.4 MB/s (10.9%)
  198. ARM fill (STM with 4 registers) : 803.1 MB/s (11.0%)
  199. ARM copy prefetched (incr pld) : 677.5 MB/s (9.3%)
  200. ARM copy prefetched (wrap pld) : 656.5 MB/s (8.7%)
  201. ARM 2-pass copy prefetched (incr pld) : 411.3 MB/s (6.0%)
  202. ARM 2-pass copy prefetched (wrap pld) : 409.7 MB/s (7.5%)
  203.  
  204. ==========================================================================
  205. == Framebuffer read tests. ==
  206. == ==
  207. == Many ARM devices use a part of the system memory as the framebuffer, ==
  208. == typically mapped as uncached but with write-combining enabled. ==
  209. == Writes to such framebuffers are quite fast, but reads are much ==
  210. == slower and very sensitive to the alignment and the selection of ==
  211. == CPU instructions which are used for accessing memory. ==
  212. == ==
  213. == Many x86 systems allocate the framebuffer in the GPU memory, ==
  214. == accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
  215. == PCI-E is asymmetric and handles reads a lot worse than writes. ==
  216. == ==
  217. == If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
  218. == or preferably >300 MB/s), then using the shadow framebuffer layer ==
  219. == is not necessary in Xorg DDX drivers, resulting in a nice overall ==
  220. == performance improvement. For example, the xf86-video-fbturbo DDX ==
  221. == uses this trick. ==
  222. ==========================================================================
  223.  
  224. NEON read (from framebuffer) : 3360.0 MB/s
  225. NEON copy (from framebuffer) : 2239.3 MB/s (6.1%)
  226. NEON 2-pass copy (from framebuffer) : 1385.1 MB/s (3.9%)
  227. NEON unrolled copy (from framebuffer) : 1778.4 MB/s (1.4%)
  228. NEON 2-pass unrolled copy (from framebuffer) : 1037.3 MB/s (1.2%)
  229. VFP copy (from framebuffer) : 1925.6 MB/s (1.4%)
  230. VFP 2-pass copy (from framebuffer) : 1108.7 MB/s (1.2%)
  231. ARM copy (from framebuffer) : 2970.2 MB/s (1.0%)
  232. ARM 2-pass copy (from framebuffer) : 1438.9 MB/s (1.2%)
  233.  
  234. ==========================================================================
  235. == Memory latency test ==
  236. == ==
  237. == Average time is measured for random memory accesses in the buffers ==
  238. == of different sizes. The larger is the buffer, the more significant ==
  239. == are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
  240. == accesses. For extremely large buffer sizes we are expecting to see ==
  241. == page table walk with several requests to SDRAM for almost every ==
  242. == memory access (though 64MiB is not nearly large enough to experience ==
  243. == this effect to its fullest). ==
  244. == ==
  245. == Note 1: All the numbers are representing extra time, which needs to ==
  246. == be added to L1 cache latency. The cycle timings for L1 cache ==
  247. == latency can be usually found in the processor documentation. ==
  248. == Note 2: Dual random read means that we are simultaneously performing ==
  249. == two independent memory accesses at a time. In the case if ==
  250. == the memory subsystem can't handle multiple outstanding ==
  251. == requests, dual random read has the same timings as two ==
  252. == single reads performed one after another. ==
  253. ==========================================================================
  254.  
  255. block size : single random read / dual random read
  256. 1024 : 0.0 ns / 0.0 ns
  257. 2048 : 0.0 ns / 0.0 ns
  258. 4096 : 0.0 ns / 0.0 ns
  259. 8192 : 0.0 ns / 0.0 ns
  260. 16384 : 0.0 ns / 0.0 ns
  261. 32768 : 0.0 ns / 0.0 ns
  262. 65536 : 4.1 ns / 7.5 ns
  263. 131072 : 6.4 ns / 10.8 ns
  264. 262144 : 7.6 ns / 12.3 ns
  265. 524288 : 10.1 ns / 15.8 ns
  266. 1048576 : 76.4 ns / 118.3 ns
  267. 2097152 : 115.2 ns / 155.2 ns
  268. 4194304 : 135.4 ns / 168.1 ns
  269. 8388608 : 147.6 ns / 175.9 ns
  270. 16777216 : 155.2 ns / 182.6 ns
  271. 33554432 : 163.6 ns / 195.3 ns
  272. 67108864 : 176.1 ns / 217.5 ns
RAW Paste Data