Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ---
- title: ARM Scalable Vector Extension and Machine Learning
- author:
- - Francesco Petrogalli
- date: Jun 2017
- ...
- # Introduction
- In this document we present some code examples in C that show how to
- vectorize some of the algoritms that are part of the core
- functionality of most machine learning system.
- Unlike any other previous vectorization technique, this document provides examples
- written with the *Vector Lengh Agnostic (VLA)* approach
- introduced by the *Scalable Vector Extension (SVE)*.
- SVE is a new vector extension of the AArch64 execution mode of the A64
- instruction set of the ARMv8 architecture. The defining feature of the
- extension is that it does not fix the size of the vector registers,
- but instead it constrain it from a minimum of 128 bits up to a maximum
- of 2048 in 128-bit wide units. Most of the instructions of the
- extension also use predicate registers to mask lanes for operating on
- partial vectors. The new instruction set also provides gather loads
- and scatter stores, plus truncating stores and signed/unsigned
- extended loads.
- A variety of documents describing the architecture extension is available:
- * SVE architecture reference manual (get link or biblio), which
- defines the instruction set and the new registers in detail.
- * a sneak peak to sve and vector lengh agnostic programming, a
- whitepaper with assembly examples of loops vectorized with the SVE
- instructions.
- * ATG & R&D & MDC paper http://ieeexplore.ieee.org/document/7924233/
- This document focuses on the interface at C/ C++ level for SVE that is
- provided via the SVE ACLEs.
- In particular, the paper shows how VLA techniques can be used to
- efficiently to vectorize *GEMM* and *low precision GEMM* computational
- kernels.
- # SVE ACLE
- The SVE ACLE (or ACLE hereafter) is a set of function and types to be
- used in C and C++ code that expose the vectorization capabilities of
- SVE.
- They introduce a set of *size less* types and *function* that map to
- the SVE registers and instruction. The function-to-instruction
- mappings are not one to one, as some of the architectural details of
- the instruction set can be resolved by a compiler. For example, here
- is no need to expose at C level some of the addressing modes of the
- loads and stores.
- The ACLE defines a set of size-less data types in the form
- ``sv[type]``, where ``sv`` stands for *Scalable Vector* and `type` can
- be any of the scalar types supported by the lanes of the SVE
- vectors. The types cover SVE vectors consisting of of 8, 16, 32 and 64
- bit lanes for signed and unsigned integral types, and 16, 32 and 64
- bit lanes for floating point types:
- * ``sv[u]int[8|16|32|64]_t``;
- * ``svfloat[16|32|64]_t``.
- An additional ``svbool_t`` type is defined to represent predicates for
- masking ooperations. The predicate type is carrying one bit for each
- byte in the data types.
- The intrinsc functions provided by the SVE ACLE are in the form:
- ``svbase[_disambiguator][_type0][_type1]...[_predication]``
- For example, the name of the intrinsic
- ``svuint16_t svadd_n_u16_m(svbool_t pg, svuint16_t op1, uint16_t op1)``
- is described as an vector *addition* (``add``) of *unsigned 16-bit
- integer* (``u16``), where one of the arguments is a scalar (``_n``)
- and the predication mode is *merging* (``_m``).
- Some of the functions, like loads and stores, have a different form
- for the names, with additional parts that specify the addressing
- mode. For example, the function
- ``svint32_t svld1_gather_u32base_offset_s32(svbool_t pg, svuint32_t bases, int64_t offset)``
- is a *gather load* (``ld1_gather``) or *signed 32-bit integers*
- (``_s32``) from a vector of *unsigned 32-bit integer* base addresses
- (``_u32base``) plus an *offset in bytes* (``_offset``).
- The SVE ACLE are compatible with C++ overloading and C ``_Generic``
- association, so that the names can be contracted removing those parts
- that can be derived from the arguments types. For example, the
- ``svadd_n_u16_m`` can be contracted to ``svadd_m``, and
- ``svld1_gather_u32base_offset_s32`` can be contracted to
- ``svld1_gather_offset``.
- The naming convention of the functions in the SVE ACLE is described in
- detail in section 4 of the SVE ACLE document [@SVEACLE].
- The examples of this document use the short form.
- The first example shows how to generate VLA C code using the SVE ACLE.
- Consider the following loop.^[This and all the example in the text
- assume that no aliasin is happening in the pointer parameters of the
- functions.]
- ```{.c .numberLines}
- void add_arrays(double *dst, double *src, double c, const int N) {
- for (int i = 0; i < N; i++)
- dst[i] = src[i] + c;
- }
- ```
- With the SVE ACLE, the loop can we rewritten as follows:
- ```{.c .numberLines}
- void vla_add_arrays(double *dst, double *src, double c, const int N) {
- svfloat64_t vc = svdup_f64(c);
- for (int i = 0; i < N; i += svcntd()) {
- svbool_t Pg = svwhilelt_b64(i, N);
- svfloat64_t vsrc = svld1(Pg, &src[i]);
- svfloat64_t vdst = svadd_x(Pg, vsrc, vc);
- svst1(Pg, &dst[i], vdst);
- }
- }
- ```
- The vector version works as follow.
- First, the constant `c` is loaded into a vector `vc`, with the
- ``svdup_f64`` function. Notice that the ``_f64`` part in the name is
- required, as standard C scalar promotion does not allow contracting
- the name of function processing only scalar arguments (see section 4.2
- of [@SVEACLE] for a detailed explanation).
- Then, the header of the vector loop is issued. As the number of lanes
- that one iteration of the vector loop is unknown at compile time, the
- induction variable ``i`` needs to be incremented dynamically with the
- ``svcntd()`` function, that return what we call ``VL.D``, that is the
- number of 64-bit (double-word) lanes in an SVE vector type.
- In the body of the loop, the predicate ``Pg`` is set with the
- ``whilelt_b64`` function. This function builds a predicate by testing
- the ``i < N`` inequality for all values of ``i`` starting from the
- entry ``i``, incremented by one, up to the number ``i + VL.D -
- 1``. The 64-bit lanes view is choosen by the ``_b64`` part of the
- intrinsic, which cannot be contracted as for the ``svdup`` one.
- On an 256-bits implementation, the value of the predicate ``Pg`` for a
- loop with ``N = 7`` would look as follows in the second iteration of the loop:
- ```
- MSB LSB
- Pg = [00000000 00000001 00000001 00000001]
- 7 6 5 4 64-bit lanes index 'i'
- ```
- Using the predicate ``Pg`` effectively avoids the need of taking care
- of the reminder of the loop that would not fit in a full vector, as
- for traditional vector architectures. The following graphics
- illustrate the vector loop behaviour for the 256-bit example with ``N
- = 7``.
- ![Predication.](loop-iterations.svg)
- # ML algorithms with VLA SVE
- ## Matrix multiplications
- ## Dot products
- ## Examples
- ## Some best practices
- ### FDIV and FDIVR
- There’s no point in using `svdivr_x` rather than `svdiv_x` if both
- operands are vectors, since the `_x` forms are there to allow the
- compiler to use `svdivr` in cases where that’s better.
- Instead of `svdiv_f32_x(ptrue32, vcast_vf_f(1.0f), d)` one could use
- `sdivr_n_f32_x(ptrue32, d, 1.0f)`.
- ### About `_n` forms
- The `_n` forms are really just there for convenience — any decent
- compiler should handle an explicit dup in the same way.
- ### Unpacked integer vectors example
- How to provide a vector version of a function with `int(double)` signature?
- When vectorizing this function, the input/output vector types will be
- `svfloat64_t` and `svint32_t` respectively.
- Predication can be used to treat the input vector as a unpacked
- vector, in which only the even lanes are holding the original data.
- Suppose the function does the following:
- ```
- int foo(double x) {
- // some code that return a double
- }
- // code code code, and then a loop body that does
- // ...
- y[i] = foo(x[i]);
- // where x,y are as needed to be
- ```
- The vector version of the loop will be (assuming no tail loop
- predication).
- ```
- svint32_t foo(svfloat64_t vx) {
- // Some code that return a packed vector of doubles
- }
- // code code code, and then a vector loop body that does
- // ...
- svfloat64_t vx = vld1_f64(svptrue_s64(), x + i);
- svint32_t vy = vector_foo(vx);
- // store only the even lanes of the integer vector
- svst1w_s64(svptrue_s64(), y + i, svreinterpret_s64_s32(vy))
- ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement