On Sat, Apr 19, 2008 at 12:30:43AM +0300, Jussi Laako wrote:
For simple operations, compilers are rather good on
vectorization. Even
though I don't know if there's any support for multi-arch targets on
gcc, so that the SSE2/SSE3 optimized binary would run on hardware
without SSE (dynamic code selection)? I haven't got time to follow the
latest gcc developments.
For more complex operations like FIR, IIR, normalized cross-correlation
or complex multiply-accumulate, I haven't seen any compiler being able
to match hand-crafted assembly code.
I tried out vectorizing the complex multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have
convolution matrix the MAC operation dominates the FFT and IFFT
ones.
This requires a permutation of the complex arrays as used by
FFTW after each FFT and before each IFFT. In each block of 4
complex values
x1 y1 x2 y2 x3 y3 x4 y4
swap y1 with x3 and y2 with x4 to get
x1 x3 x2 x4 y1 y3 y2 y4
which can be handled by the vector operations.
The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
others. Bypassing the permutations to have an idea of their cost
didn't change anything.
I'm somewhat surprised by this...
--
FA
Laboratorio di Acustica ed Elettroacustica
Parma, Italia
Lascia la spina, cogli la rosa.