Untitled

---
title: ARM Scalable Vector Extension and Machine Learning
author:
- Francesco Petrogalli
date: Jun 2017
...

# Introduction

In this document we present some code examples in C that show how to
vectorize some of the algoritms that are part of the core
functionality of most machine learning system.

Unlike any other previous vectorization technique, this document provides examples
written with the *Vector Lengh Agnostic (VLA)* approach
introduced by the *Scalable Vector Extension (SVE)*.

SVE is a new vector extension of the AArch64 execution mode of the A64
instruction set of the ARMv8 architecture. The defining feature of the
extension is that it does not fix the size of the vector registers,
but instead it constrain it from a minimum of 128 bits up to a maximum
of 2048 in 128-bit wide units. Most of the instructions of the
extension also use predicate registers to mask lanes for operating on
partial vectors. The new instruction set also provides gather loads
and scatter stores, plus truncating stores and signed/unsigned
extended loads.

A variety of documents describing the architecture extension is available:

* SVE architecture reference manual (get link or biblio), which
  defines the instruction set and the new registers in detail.
* a sneak peak to sve and vector lengh agnostic programming, a
  whitepaper with assembly examples of loops vectorized with the SVE
  instructions.
* ATG & R&D & MDC paper http://ieeexplore.ieee.org/document/7924233/

This document focuses on the interface at C/ C++ level for SVE that is
provided via the SVE ACLEs.

In particular, the paper shows how VLA techniques can be used to
efficiently to vectorize *GEMM* and *low precision GEMM* computational
kernels.

# SVE ACLE

The SVE ACLE (or ACLE hereafter) is a set of function and types to be
used in C and C++ code that expose the vectorization capabilities of
SVE.

They introduce a set of *size less* types and *function* that map to
the SVE registers and instruction. The function-to-instruction
mappings are not one to one, as some of the architectural details of
the instruction set can be resolved by a compiler. For example, here
is no need to expose at C level some of the addressing modes of the
loads and stores.

The ACLE defines a set of size-less data types in the form
``sv[type]``, where ``sv`` stands for *Scalable Vector* and `type` can
be any of the scalar types supported by the lanes of the SVE
vectors. The types cover SVE vectors consisting of of 8, 16, 32 and 64
bit lanes for signed and unsigned integral types, and 16, 32 and 64
bit lanes for floating point types:

* ``sv[u]int[8|16|32|64]_t``;
* ``svfloat[16|32|64]_t``.

An additional ``svbool_t`` type is defined to represent predicates for
masking ooperations. The predicate type is carrying one bit for each
byte in the data types.

The intrinsc functions provided by the SVE ACLE are in the form:

``svbase[_disambiguator][_type0][_type1]...[_predication]``

For example, the name of the intrinsic

``svuint16_t svadd_n_u16_m(svbool_t pg, svuint16_t op1, uint16_t op1)``

is described as an vector *addition* (``add``) of *unsigned 16-bit
integer* (``u16``), where one of the arguments is a scalar (``_n``)
and the predication mode is *merging* (``_m``).

Some of the functions, like loads and stores, have a different form
for the names, with additional parts that specify the addressing
mode. For example, the function

``svint32_t svld1_gather_u32base_offset_s32(svbool_t pg, svuint32_t bases, int64_t offset)``

is a *gather load* (``ld1_gather``) or *signed 32-bit integers*
(``_s32``) from a vector of *unsigned 32-bit integer* base addresses
(``_u32base``) plus an *offset in bytes* (``_offset``).

The SVE ACLE are compatible with C++ overloading and C ``_Generic``
association, so that the names can be contracted removing those parts
that can be derived from the arguments types. For example, the
``svadd_n_u16_m`` can be contracted to ``svadd_m``,  and
``svld1_gather_u32base_offset_s32`` can be contracted to
``svld1_gather_offset``.

The naming convention of the functions in the SVE ACLE is described in
detail in section 4 of the SVE ACLE document [@SVEACLE].

The examples of this document use the short form.

The first example shows how to generate VLA C code using the SVE ACLE.

Consider the following loop.^[This and all the example in the text
assume that no aliasin is happening in the pointer parameters of the
functions.]

```{.c .numberLines}
void add_arrays(double *dst, double *src, double c, const int N) {
  for (int i = 0; i < N; i++)
    dst[i] = src[i] + c;
}
```

With the SVE ACLE, the loop can we rewritten as follows:

```{.c .numberLines}
void vla_add_arrays(double *dst, double *src, double c, const int N) {
  svfloat64_t vc = svdup_f64(c);
  for (int i = 0; i < N; i += svcntd()) {
    svbool_t Pg = svwhilelt_b64(i, N);
    svfloat64_t vsrc = svld1(Pg, &src[i]);
    svfloat64_t vdst = svadd_x(Pg, vsrc, vc);
    svst1(Pg, &dst[i], vdst);
  }
}
```

The vector version works as follow.

First, the constant `c` is loaded into a vector `vc`, with the
``svdup_f64`` function. Notice that the ``_f64`` part in the name is
required, as standard C scalar promotion does not allow contracting
the name of function processing only scalar arguments (see section 4.2
of [@SVEACLE] for a detailed explanation).

Then, the header of the vector loop is issued. As the number of lanes
that one iteration of the vector loop is unknown at compile time, the
induction variable ``i`` needs to be incremented dynamically with the
``svcntd()`` function, that return what we call ``VL.D``, that is the
number of 64-bit (double-word) lanes in an SVE vector type.

In the body of the loop, the predicate ``Pg`` is set with the
``whilelt_b64`` function. This function builds a predicate by testing
the ``i < N`` inequality for all values of ``i`` starting from the
entry ``i``, incremented by one, up to the number ``i + VL.D -
1``. The 64-bit lanes view is choosen by the ``_b64`` part of the
intrinsic, which cannot be contracted as for the ``svdup`` one.

On an 256-bits implementation, the value of the predicate ``Pg`` for a
loop with ``N = 7`` would look as follows in the second iteration of the loop:

```
     MSB                                LSB
Pg = [00000000 00000001 00000001 00000001]
             7        6        5        4  64-bit lanes index 'i'
```

Using the predicate ``Pg`` effectively avoids the need of taking care
of the reminder of the loop that would not fit in a full vector, as
for traditional vector architectures. The following graphics
illustrate the vector loop behaviour for the 256-bit example with ``N
= 7``.

![Predication.](loop-iterations.svg)


# ML algorithms with VLA SVE

## Matrix multiplications

## Dot products

## Examples

## Some best practices

### FDIV and FDIVR

There’s no point in using `svdivr_x` rather than `svdiv_x` if both
operands are vectors, since the `_x` forms are there to allow the
compiler to use `svdivr` in cases where that’s better.

Instead of `svdiv_f32_x(ptrue32, vcast_vf_f(1.0f), d)` one could use
`sdivr_n_f32_x(ptrue32, d, 1.0f)`.

### About `_n` forms

The `_n` forms are really just there for convenience — any decent
compiler should handle an explicit dup in the same way.

### Unpacked integer vectors example

How to provide a vector version of a function with `int(double)` signature?

When vectorizing this function, the input/output vector types will be
`svfloat64_t` and `svint32_t` respectively.

Predication can be used to treat the input vector as a unpacked
vector, in which only the even lanes are holding the original data.

Suppose the function does the following:

```
int foo(double x) {
// some code that return a double
}
// code code code, and then a loop body that does
// ...
y[i] = foo(x[i]);
// where x,y are as needed to be
```

The vector version of the loop will be (assuming no tail loop
predication).

```
svint32_t foo(svfloat64_t vx) {
// Some code that return a packed vector of doubles
}
// code code code, and then a vector loop body that does
// ...
svfloat64_t vx = vld1_f64(svptrue_s64(), x + i);
svint32_t vy = vector_foo(vx);
// store only the even lanes of the integer vector
svst1w_s64(svptrue_s64(), y + i, svreinterpret_s64_s32(vy))
```