jussi at sonarnerd.net
Wed Apr 23 19:05:09 UTC 2008
Fons Adriaensen wrote:
> I tried out vectorizing the complex multipl-and-accumulate loop in
> zita-convolver. For long convolutions and certainly if you have
> The results are very marginal, about 5% relative speed increase
> even in cases where the MAC operations largely outnumber any
For me, the complex MAC operation written for SSE3 practically doubled
the speed for double precision and more than doubled for single
precision, compared to "-march=i686 -O3 -ffast-math" case (the code has
to run practically on all x86 platforms).
Prior to SSE3, there was no nice way to do complex multiplication on
SSE. Now it can be done in three instructions for two single precision
Still, one of the most elegant is E3DNow on AMD, it can do single
precision complex multiply in four instructions.
These instruction numbers are for the calculation itself, in addition it
of course needs the load and store operations, where SSE3 requires a few
extra instructions compared to E3DNow.
More information about the Linux-audio-dev