Re: [LAD] vectorization

23 Apr 2008

On Sat, Apr 19, 2008 at 12:30:43AM +0300, Jussi Laako wrote:
...
  For simple operations, compilers are rather good on
vectorization. Even
 though I don't know if there's any support for multi-arch targets on
 gcc, so that the SSE2/SSE3 optimized binary would run on hardware
 without SSE (dynamic code selection)? I haven't got time to follow the
 latest gcc developments.
 For more complex operations like FIR, IIR, normalized cross-correlation
 or complex multiply-accumulate, I haven't seen any compiler being able
 to match hand-crafted assembly code. 
I tried out vectorizing the complex multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have
convolution matrix the MAC operation dominates the FFT and IFFT
ones.
This requires a permutation of the complex arrays as used by
FFTW after each FFT and before each IFFT. In each block of 4
complex values
 x1 y1 x2 y2 x3 y3 x4 y4
swap y1 with x3 and y2 with x4 to get
 x1 x3 x2 x4 y1 y3 y2 y4
which can be handled by the vector operations.
The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
others. Bypassing the permutations to have an idea of their cost
didn't change anything.
I'm somewhat surprised by this...
--
FA
Laboratorio di Acustica ed Elettroacustica
Parma, Italia
Lascia la spina, cogli la rosa.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAD] vectorization