[...]
Another strategy for Complex Multiply Add would be to organize the data
in vertical arrays of the same type. This is to say that there should be
one array of Real followed by another array of Imaginary, perhaps
something like this:
typedef struct {
float r[N] __attribute__ ((aligned(16)));
float i[N] __attribute__ ((aligned(16)));
} cvec_t;
cvec_t cA,cB,cD;
It can be argued that the data is now scattered all over the place, and
twice as many moves are needed to perform a single cmadd. This may be
true in some cases as we shall later see, but when you are operating on
vectors you'd usually want to load and store (at least) 4 variables at a
time.
The routine to calculate them all becomes:
for (i = 0;i < N; ++i)
{
cD.r[i] += cA.r[i] * cB.r[i] - cA.i[i] * cB.i[i];
cD.i[i] += cA.r[i] * cB.i[i] + cA.i[i] * cB.r[i];
}
This auto-vectorizes really well with icc -O3 -msse, here looped a
billion times and compared to the original routine:
clock: 1340 ms (cvec_t) <--
This is not a typo!
clock: 12820 ms (original array of complex)
It is not too bad with gcc -O3 -msse -ftree-vectorize either:
clock: 6490 ms (cvec_t)
clock: 14190 ms (original array of complex)
But all things are not so rosy in the vector-department if we should
also plan to distribute this code fragment as part of a binary generic
i386 package, say conservatively compiled with gcc -O2:
clock: 18880 ms (cvec_t) <--
ouch!
clock: 14230 ms (original array of complex)
That was pretty bad! Your trusted 100MHz pentium suddenly got downgraded
to 66MHz :-/
So to conclude: A more than 10 times speedup of cmadd() of large arrays
is possible by a) rearangement of the data in a format that fits modern
machines and b) switch of compiler (which perhaps not everybody is
willing to do.)
/j