Jens M Andreasen wrote:
Could you try this out with your proposed compiler
options on your own
hardware?
I also added my own asm flavor...
With "-O3 -msse3 -ffast-math -ftree-vectorize -fprefetch-loop-arrays" on
gcc:
clock: 15740 ms (_Complex)
clock: 24930 ms (cvec_t)
clock: 17770 ms (original float array[N][2])
clock: 13660 ms (asm on float array)
With "-O3 -xO -fp-model fast" on icc (all variants vectorized):
clock: 1030 ms (_Complex)
clock: 520 ms (cvec_t)
clock: 16250 ms (original float array[N][2])
(although it seems to shine when paired up with icc :)
icc is very nice... :)
- Jussi