Re: [LAD] vectorization

23 Apr 2008

Fons Adriaensen wrote:
...
  I tried out vectorizing the complex
multipl-and-accumulate loop in
 zita-convolver. For long convolutions and certainly if you have
 The results are very marginal, about 5% relative speed increase
 even in cases where the MAC operations largely outnumber any 
For me, the complex MAC operation written for SSE3 practically doubled
the speed for double precision and more than doubled for single
precision, compared to "-march=i686 -O3 -ffast-math" case (the code has
to run practically on all x86 platforms).
Prior to SSE3, there was no nice way to do complex multiplication on
SSE. Now it can be done in three instructions for two single precision
complex numbers.
Still, one of the most elegant is E3DNow on AMD, it can do single
precision complex multiply in four instructions.
These instruction numbers are for the calculation itself, in addition it
of course needs the load and store operations, where SSE3 requires a few
extra instructions compared to E3DNow.
BR,
        - Jussi

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAD] vectorization