On Wed, Jul 20, 2011 at 1:27 AM, Maurizio De Cecco <jmax(a)dececco.name> wrote:
On clang/MacOSX i get an impressive improvement in
performance,
around 4x on the operations, even just using the vector types for copying
data; my impression is that the compiler use some kind of vector
load/store instruction that properly use the available memory bandwidth, but
unfortunately i do not know more about the x86 architecture.
On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
2.5x.
It's possible gcc generates code for lowest-common-denominator hardware
by default, and you might need to give some compiler option to get it to use
vector operations. See
http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html