Fons Adriaensen wrote:
I tried out vectorizing the complex
multipl-and-accumulate loop in
zita-convolver. For long convolutions and certainly if you have
The results are very marginal, about 5% relative speed increase
even in cases where the MAC operations largely outnumber any
For me, the complex MAC operation written for SSE3 practically doubled
the speed for double precision and more than doubled for single
precision, compared to "-march=i686 -O3 -ffast-math" case (the code has
to run practically on all x86 platforms).
Prior to SSE3, there was no nice way to do complex multiplication on
SSE. Now it can be done in three instructions for two single precision
complex numbers.
Still, one of the most elegant is E3DNow on AMD, it can do single
precision complex multiply in four instructions.
These instruction numbers are for the calculation itself, in addition it
of course needs the load and store operations, where SSE3 requires a few
extra instructions compared to E3DNow.
BR,
- Jussi