[LAD] vectorization
Jens M Andreasen
jens.andreasen at comhem.se
Mon May 5 05:19:23 UTC 2008
I believe your declaration looks something like this:
float // array of complex
ffta[N][2] __attribute__ ((aligned(16))),
fftb[N][2] __attribute__ ((aligned(16))),
data[N][2] __attribute__ ((aligned(16)));
.. right?
If so, then I can get the auto-vectorizer in icc to kick in with this
construct of complex multiply add:
// CC=icc -O3 -msse3
void cmadd(void)
{
float *A = (float*) ffta;
float *B = (float*) fftb;
float *D = (float*) data;
int i;
for (i = 0;i < N*2; i += 2)
{
D[i] += A[i] * B[i] - A[i+1] * B[i+1];
D[i+1] += A[i] * B[i+1] + A[i+1] * B[i];
}
}
No luck with gcc though :-/
/j
On Wed, 2008-04-23 at 09:59 +0200, Fons Adriaensen wrote:
> I tried out vectorizing the complex multipl-and-accumulate loop in
> zita-convolver. For long convolutions and certainly if you have
> convolution matrix the MAC operation dominates the FFT and IFFT
> ones.
>
> This requires a permutation of the complex arrays as used by
> FFTW after each FFT and before each IFFT. In each block of 4
> complex values
>
> x1 y1 x2 y2 x3 y3 x4 y4
>
> swap y1 with x3 and y2 with x4 to get
>
> x1 x3 x2 x4 y1 y3 y2 y4
>
> which can be handled by the vector operations.
>
> The results are very marginal, about 5% relative speed increase
> even in cases where the MAC operations largely outnumber any
> others. Bypassing the permutations to have an idea of their cost
> didn't change anything.
>
> I'm somewhat surprised by this...
>
--
More information about the Linux-audio-dev
mailing list