[LAD] vectorization

Jussi Laako jussi at sonarnerd.net
Wed Apr 23 19:05:09 UTC 2008

Fons Adriaensen wrote:
> I tried out vectorizing the complex multipl-and-accumulate loop in
> zita-convolver. For long convolutions and certainly if you have 
> The results are very marginal, about 5% relative speed increase
> even in cases where the MAC operations largely outnumber any

For me, the complex MAC operation written for SSE3 practically doubled 
the speed for double precision and more than doubled for single 
precision, compared to "-march=i686 -O3 -ffast-math" case (the code has 
to run practically on all x86 platforms).

Prior to SSE3, there was no nice way to do complex multiplication on 
SSE. Now it can be done in three instructions for two single precision 
complex numbers.

Still, one of the most elegant is E3DNow on AMD, it can do single 
precision complex multiply in four instructions.

These instruction numbers are for the calculation itself, in addition it 
of course needs the load and store operations, where SSE3 requires a few 
extra instructions compared to E3DNow.


	- Jussi

More information about the Linux-audio-dev mailing list