I believe your declaration looks something like this:
float // array of complex
ffta[N][2] __attribute__ ((aligned(16))),
fftb[N][2] __attribute__ ((aligned(16))),
data[N][2] __attribute__ ((aligned(16)));
.. right?
If so, then I can get the auto-vectorizer in icc to kick in with this
construct of complex multiply add:
// CC=icc -O3 -msse3
void cmadd(void)
{
float *A = (float*) ffta;
float *B = (float*) fftb;
float *D = (float*) data;
int i;
for (i = 0;i < N*2; i += 2)
{
D[i] += A[i] * B[i] - A[i+1] * B[i+1];
D[i+1] += A[i] * B[i+1] + A[i+1] * B[i];
}
}
No luck with gcc though :-/
/j
On Wed, 2008-04-23 at 09:59 +0200, Fons Adriaensen wrote:
I tried out vectorizing the complex
multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have
convolution matrix the MAC operation dominates the FFT and IFFT
ones.
This requires a permutation of the complex arrays as used by
FFTW after each FFT and before each IFFT. In each block of 4
complex values
x1 y1 x2 y2 x3 y3 x4 y4
swap y1 with x3 and y2 with x4 to get
x1 x3 x2 x4 y1 y3 y2 y4
which can be handled by the vector operations.
The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
others. Bypassing the permutations to have an idea of their cost
didn't change anything.
I'm somewhat surprised by this...
--