fons at kokkinizita.net
Wed Apr 23 07:59:54 UTC 2008
On Sat, Apr 19, 2008 at 12:30:43AM +0300, Jussi Laako wrote:
> For simple operations, compilers are rather good on vectorization. Even
> though I don't know if there's any support for multi-arch targets on
> gcc, so that the SSE2/SSE3 optimized binary would run on hardware
> without SSE (dynamic code selection)? I haven't got time to follow the
> latest gcc developments.
> For more complex operations like FIR, IIR, normalized cross-correlation
> or complex multiply-accumulate, I haven't seen any compiler being able
> to match hand-crafted assembly code.
I tried out vectorizing the complex multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have
convolution matrix the MAC operation dominates the FFT and IFFT
This requires a permutation of the complex arrays as used by
FFTW after each FFT and before each IFFT. In each block of 4
x1 y1 x2 y2 x3 y3 x4 y4
swap y1 with x3 and y2 with x4 to get
x1 x3 x2 x4 y1 y3 y2 y4
which can be handled by the vector operations.
The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
others. Bypassing the permutations to have an idea of their cost
didn't change anything.
I'm somewhat surprised by this...
Laboratorio di Acustica ed Elettroacustica
Lascia la spina, cogli la rosa.
More information about the Linux-audio-dev