[LAD] vectorization

Jens M Andreasen jens.andreasen at comhem.se
Mon May 5 05:19:23 UTC 2008

I believe your declaration looks something like this:

float  // array of complex
   ffta[N][2]  __attribute__ ((aligned(16))), 
   fftb[N][2]  __attribute__ ((aligned(16))), 
   data[N][2]  __attribute__ ((aligned(16)));

.. right?

If so, then I can get the auto-vectorizer in icc to kick in with this
construct of complex multiply add:

// CC=icc -O3 -msse3

void cmadd(void)
   float *A = (float*) ffta;
   float *B = (float*) fftb;
   float *D = (float*) data;
   int i;
   for (i = 0;i < N*2; i += 2)
      D[i] += A[i] * B[i] - A[i+1] * B[i+1];
      D[i+1] += A[i] * B[i+1] + A[i+1] * B[i];

No luck with gcc though :-/


On Wed, 2008-04-23 at 09:59 +0200, Fons Adriaensen wrote:

> I tried out vectorizing the complex multipl-and-accumulate loop in
> zita-convolver. For long convolutions and certainly if you have 
> convolution matrix the MAC operation dominates the FFT and IFFT
> ones.
> This requires a permutation of the complex arrays as used by
> FFTW after each FFT and before each IFFT. In each block of 4
> complex values
>  x1 y1 x2 y2 x3 y3 x4 y4
> swap y1 with x3 and y2 with x4 to get
>  x1 x3 x2 x4 y1 y3 y2 y4
> which can be handled by the vector operations.
> The results are very marginal, about 5% relative speed increase
> even in cases where the MAC operations largely outnumber any
> others. Bypassing the permutations to have an idea of their cost
> didn't change anything.
> I'm somewhat surprised by this...

More information about the Linux-audio-dev mailing list