[LAD] vectorization

Fons Adriaensen fons at kokkinizita.net
Wed Apr 23 07:59:54 UTC 2008


On Sat, Apr 19, 2008 at 12:30:43AM +0300, Jussi Laako wrote:

> For simple operations, compilers are rather good on vectorization. Even 
> though I don't know if there's any support for multi-arch targets on 
> gcc, so that the SSE2/SSE3 optimized binary would run on hardware 
> without SSE (dynamic code selection)? I haven't got time to follow the 
> latest gcc developments.
> 
> For more complex operations like FIR, IIR, normalized cross-correlation 
> or complex multiply-accumulate, I haven't seen any compiler being able 
> to match hand-crafted assembly code.

I tried out vectorizing the complex multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have 
convolution matrix the MAC operation dominates the FFT and IFFT
ones.

This requires a permutation of the complex arrays as used by
FFTW after each FFT and before each IFFT. In each block of 4
complex values
 
 x1 y1 x2 y2 x3 y3 x4 y4

swap y1 with x3 and y2 with x4 to get

 x1 x3 x2 x4 y1 y3 y2 y4

which can be handled by the vector operations.

The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
others. Bypassing the permutations to have an idea of their cost
didn't change anything.

I'm somewhat surprised by this...

-- 
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.




More information about the Linux-audio-dev mailing list