# c - raspberry - vmlaq_f32

## How to use the multiply and accumulate intrinsics in ARM Cortex-a8? (2)

Google'd for `vmlaq_f32`

, turned up the reference for the RVCT compiler tools. Here's what it says:

```
Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);
```

AND

The following types are defined to represent vectors. NEON vector data types are named according to the following pattern: <type><size>x<number of lanes>_t For example, int16x4_t is a vector containing four lanes each containing a signed 16-bit integer. Table E.1 lists the vector data types.

IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of `b`

and `c`

, and adding the contents of `a`

.

HTH

how to use the Multiply-Accumulate intrinsics provided by GCC?

`float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);`

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

Help!!!

Simply said the vmla instruction does the following:

```
struct
{
float val[4];
} float32x4_t
float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
float32x4 result;
for (int i=0; i<4; i++)
{
result.val[i] = b.val[i]*c.val[i]+a.val[i];
}
return result;
}
```

And all this compiles into a singe assembler instruction :-)

You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

```
float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
/* in a perfect world this code would compile into just four instructions */
float32x4_t result;
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
return result;
}
```

This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

Also worth mentioning: Multiply-accumulate operations like this are **very** common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a **fast-path** inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.