On Wed, Jul 20, 2011 at 10:47 AM, Robin Gareus <robin(a)gareus.org> wrote:
On gcc/Linux,
(gcc 4.5.2) the same code produce a *slow down* of around
2.5x.
Well, anybody have an idea of why ?
I am actually running linux (Ubuntu 11.04) under a VMWare virtual
machine, i do not know is this may have any implications.
Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
clang/MacOSX vs gcc/MacOSX compiled binaries.
Also as Dan already pointed out: gcc has a whole lot of optimization
flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
'-ftree-vectorizer-verbose=2' is handy while optimizing code.
In addition... inspecting the disassembly is helpful (-S -o
myprogram.s). Rule of thumb is that you should have `movaps` (MOVe
Aligned Packed-Storage) and `mulps` (MULtiply Packed Storage)
instructions for multiplying vectors of single-precision floats.
In addition... profiling with valgrind/callgrind is helpful (esp. if
you have it dump instructions/assembly)...
$ valgrind --tool=callgrind --dump-instr=yes ./myprogram
Open the output file with kcachegrind and it'll save you a lot of time.
-gabriel