Advertisement
Guest User

Untitled

a guest
Aug 18th, 2017
137
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.31 KB | None | 0 0
  1. ---
  2. title: ARM Scalable Vector Extension and Machine Learning
  3. author:
  4. - Francesco Petrogalli
  5. date: Jun 2017
  6. ...
  7.  
  8. # Introduction
  9.  
  10. In this document we present some code examples in C that show how to
  11. vectorize some of the algoritms that are part of the core
  12. functionality of most machine learning system.
  13.  
  14. Unlike any other previous vectorization technique, this document provides examples
  15. written with the *Vector Lengh Agnostic (VLA)* approach
  16. introduced by the *Scalable Vector Extension (SVE)*.
  17.  
  18. SVE is a new vector extension of the AArch64 execution mode of the A64
  19. instruction set of the ARMv8 architecture. The defining feature of the
  20. extension is that it does not fix the size of the vector registers,
  21. but instead it constrain it from a minimum of 128 bits up to a maximum
  22. of 2048 in 128-bit wide units. Most of the instructions of the
  23. extension also use predicate registers to mask lanes for operating on
  24. partial vectors. The new instruction set also provides gather loads
  25. and scatter stores, plus truncating stores and signed/unsigned
  26. extended loads.
  27.  
  28. A variety of documents describing the architecture extension is available:
  29.  
  30. * SVE architecture reference manual (get link or biblio), which
  31. defines the instruction set and the new registers in detail.
  32. * a sneak peak to sve and vector lengh agnostic programming, a
  33. whitepaper with assembly examples of loops vectorized with the SVE
  34. instructions.
  35. * ATG & R&D & MDC paper http://ieeexplore.ieee.org/document/7924233/
  36.  
  37. This document focuses on the interface at C/ C++ level for SVE that is
  38. provided via the SVE ACLEs.
  39.  
  40. In particular, the paper shows how VLA techniques can be used to
  41. efficiently to vectorize *GEMM* and *low precision GEMM* computational
  42. kernels.
  43.  
  44. # SVE ACLE
  45.  
  46. The SVE ACLE (or ACLE hereafter) is a set of function and types to be
  47. used in C and C++ code that expose the vectorization capabilities of
  48. SVE.
  49.  
  50. They introduce a set of *size less* types and *function* that map to
  51. the SVE registers and instruction. The function-to-instruction
  52. mappings are not one to one, as some of the architectural details of
  53. the instruction set can be resolved by a compiler. For example, here
  54. is no need to expose at C level some of the addressing modes of the
  55. loads and stores.
  56.  
  57. The ACLE defines a set of size-less data types in the form
  58. ``sv[type]``, where ``sv`` stands for *Scalable Vector* and `type` can
  59. be any of the scalar types supported by the lanes of the SVE
  60. vectors. The types cover SVE vectors consisting of of 8, 16, 32 and 64
  61. bit lanes for signed and unsigned integral types, and 16, 32 and 64
  62. bit lanes for floating point types:
  63.  
  64. * ``sv[u]int[8|16|32|64]_t``;
  65. * ``svfloat[16|32|64]_t``.
  66.  
  67. An additional ``svbool_t`` type is defined to represent predicates for
  68. masking ooperations. The predicate type is carrying one bit for each
  69. byte in the data types.
  70.  
  71. The intrinsc functions provided by the SVE ACLE are in the form:
  72.  
  73. ``svbase[_disambiguator][_type0][_type1]...[_predication]``
  74.  
  75. For example, the name of the intrinsic
  76.  
  77. ``svuint16_t svadd_n_u16_m(svbool_t pg, svuint16_t op1, uint16_t op1)``
  78.  
  79. is described as an vector *addition* (``add``) of *unsigned 16-bit
  80. integer* (``u16``), where one of the arguments is a scalar (``_n``)
  81. and the predication mode is *merging* (``_m``).
  82.  
  83. Some of the functions, like loads and stores, have a different form
  84. for the names, with additional parts that specify the addressing
  85. mode. For example, the function
  86.  
  87. ``svint32_t svld1_gather_u32base_offset_s32(svbool_t pg, svuint32_t bases, int64_t offset)``
  88.  
  89. is a *gather load* (``ld1_gather``) or *signed 32-bit integers*
  90. (``_s32``) from a vector of *unsigned 32-bit integer* base addresses
  91. (``_u32base``) plus an *offset in bytes* (``_offset``).
  92.  
  93. The SVE ACLE are compatible with C++ overloading and C ``_Generic``
  94. association, so that the names can be contracted removing those parts
  95. that can be derived from the arguments types. For example, the
  96. ``svadd_n_u16_m`` can be contracted to ``svadd_m``, and
  97. ``svld1_gather_u32base_offset_s32`` can be contracted to
  98. ``svld1_gather_offset``.
  99.  
  100. The naming convention of the functions in the SVE ACLE is described in
  101. detail in section 4 of the SVE ACLE document [@SVEACLE].
  102.  
  103. The examples of this document use the short form.
  104.  
  105. The first example shows how to generate VLA C code using the SVE ACLE.
  106.  
  107. Consider the following loop.^[This and all the example in the text
  108. assume that no aliasin is happening in the pointer parameters of the
  109. functions.]
  110.  
  111. ```{.c .numberLines}
  112. void add_arrays(double *dst, double *src, double c, const int N) {
  113. for (int i = 0; i < N; i++)
  114. dst[i] = src[i] + c;
  115. }
  116. ```
  117.  
  118. With the SVE ACLE, the loop can we rewritten as follows:
  119.  
  120. ```{.c .numberLines}
  121. void vla_add_arrays(double *dst, double *src, double c, const int N) {
  122. svfloat64_t vc = svdup_f64(c);
  123. for (int i = 0; i < N; i += svcntd()) {
  124. svbool_t Pg = svwhilelt_b64(i, N);
  125. svfloat64_t vsrc = svld1(Pg, &src[i]);
  126. svfloat64_t vdst = svadd_x(Pg, vsrc, vc);
  127. svst1(Pg, &dst[i], vdst);
  128. }
  129. }
  130. ```
  131.  
  132. The vector version works as follow.
  133.  
  134. First, the constant `c` is loaded into a vector `vc`, with the
  135. ``svdup_f64`` function. Notice that the ``_f64`` part in the name is
  136. required, as standard C scalar promotion does not allow contracting
  137. the name of function processing only scalar arguments (see section 4.2
  138. of [@SVEACLE] for a detailed explanation).
  139.  
  140. Then, the header of the vector loop is issued. As the number of lanes
  141. that one iteration of the vector loop is unknown at compile time, the
  142. induction variable ``i`` needs to be incremented dynamically with the
  143. ``svcntd()`` function, that return what we call ``VL.D``, that is the
  144. number of 64-bit (double-word) lanes in an SVE vector type.
  145.  
  146. In the body of the loop, the predicate ``Pg`` is set with the
  147. ``whilelt_b64`` function. This function builds a predicate by testing
  148. the ``i < N`` inequality for all values of ``i`` starting from the
  149. entry ``i``, incremented by one, up to the number ``i + VL.D -
  150. 1``. The 64-bit lanes view is choosen by the ``_b64`` part of the
  151. intrinsic, which cannot be contracted as for the ``svdup`` one.
  152.  
  153. On an 256-bits implementation, the value of the predicate ``Pg`` for a
  154. loop with ``N = 7`` would look as follows in the second iteration of the loop:
  155.  
  156. ```
  157. MSB LSB
  158. Pg = [00000000 00000001 00000001 00000001]
  159. 7 6 5 4 64-bit lanes index 'i'
  160. ```
  161.  
  162. Using the predicate ``Pg`` effectively avoids the need of taking care
  163. of the reminder of the loop that would not fit in a full vector, as
  164. for traditional vector architectures. The following graphics
  165. illustrate the vector loop behaviour for the 256-bit example with ``N
  166. = 7``.
  167.  
  168. ![Predication.](loop-iterations.svg)
  169.  
  170.  
  171. # ML algorithms with VLA SVE
  172.  
  173. ## Matrix multiplications
  174.  
  175. ## Dot products
  176.  
  177. ## Examples
  178.  
  179. ## Some best practices
  180.  
  181. ### FDIV and FDIVR
  182.  
  183. There’s no point in using `svdivr_x` rather than `svdiv_x` if both
  184. operands are vectors, since the `_x` forms are there to allow the
  185. compiler to use `svdivr` in cases where that’s better.
  186.  
  187. Instead of `svdiv_f32_x(ptrue32, vcast_vf_f(1.0f), d)` one could use
  188. `sdivr_n_f32_x(ptrue32, d, 1.0f)`.
  189.  
  190. ### About `_n` forms
  191.  
  192. The `_n` forms are really just there for convenience — any decent
  193. compiler should handle an explicit dup in the same way.
  194.  
  195. ### Unpacked integer vectors example
  196.  
  197. How to provide a vector version of a function with `int(double)` signature?
  198.  
  199. When vectorizing this function, the input/output vector types will be
  200. `svfloat64_t` and `svint32_t` respectively.
  201.  
  202. Predication can be used to treat the input vector as a unpacked
  203. vector, in which only the even lanes are holding the original data.
  204.  
  205. Suppose the function does the following:
  206.  
  207. ```
  208. int foo(double x) {
  209. // some code that return a double
  210. }
  211. // code code code, and then a loop body that does
  212. // ...
  213. y[i] = foo(x[i]);
  214. // where x,y are as needed to be
  215. ```
  216.  
  217. The vector version of the loop will be (assuming no tail loop
  218. predication).
  219.  
  220. ```
  221. svint32_t foo(svfloat64_t vx) {
  222. // Some code that return a packed vector of doubles
  223. }
  224. // code code code, and then a vector loop body that does
  225. // ...
  226. svfloat64_t vx = vld1_f64(svptrue_s64(), x + i);
  227. svint32_t vy = vector_foo(vx);
  228. // store only the even lanes of the integer vector
  229. svst1w_s64(svptrue_s64(), y + i, svreinterpret_s64_s32(vy))
  230. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement