On 07/20/2011 10:27 AM, Maurizio De Cecco wrote:
I am playing around with GCC and Clang vector
extensions, on Linux and
Mac OS X, and i am getting some strange behaviour.
I am working on jMax Phoenix, and its dsp engine, in its current state,
is very memory bound; it is based on the aggregation of very small
granularity operations, like vector sum or multiply, each of them
executed independently from and to memory.
I tried to implements all this 'primitive' operations using the vector
types.
On clang/MacOSX i get an impressive improvement in performance,
around 4x on the operations, even just using the vector types for
copying data; my impression is that the compiler use some kind of vector
load/store instruction that properly use the available memory bandwidth,
but unfortunately i do not know more about the x86 architecture.
On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
2.5x.
Well, anybody have an idea of why ?
I am actually running linux (Ubuntu 11.04) under a VMWare virtual
machine, i do not know is this may have any implications.
Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
clang/MacOSX vs gcc/MacOSX compiled binaries.
Also as Dan already pointed out: gcc has a whole lot of optimization
flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
'-ftree-vectorizer-verbose=2' is handy while optimizing code.
have fun,
robin
Or you can use LLVM to *directly* generate vector code, as in the following example,
result of some experiments done with Faust and it's LLVM backend:
block_code8: ; preds =
%block_code8.block_code8_crit_edge, %block_code3
%20 = phi float* [ %15, %block_code3 ], [ %.pre11, %block_code8.block_code8_crit_edge ]
%21 = phi float* [ %14, %block_code3 ], [ %.pre10, %block_code8.block_code8_crit_edge ]
%22 = phi float* [ %16, %block_code3 ], [ %.pre9, %block_code8.block_code8_crit_edge ]
%indvar = phi i32 [ 0, %block_code3 ], [ %indvar.next,
%block_code8.block_code8_crit_edge ]
%nextindex1 = shl i32 %indvar, 2
%nextindex = add i32 %nextindex1, 4
%23 = sext i32 %nextindex1 to i64
%24 = getelementptr float* %22, i64 %23
%25 = getelementptr float* %21, i64 %23
%26 = bitcast float* %25 to <4 x float>*
%27 = load <4 x float>* %26, align 1
%28 = getelementptr float* %20, i64 %23
%29 = bitcast float* %28 to <4 x float>*
%30 = load <4 x float>* %29, align 1
%31 = fadd <4 x float> %27, %30
%32 = bitcast float* %24 to <4 x float>*
store <4 x float> %31, <4 x float>* %32, align 1
%33 = icmp ult i32 %nextindex, %18
br i1 %33, label %block_code8.block_code8_crit_edge, label %exit_block6
In this block float* arrays are loaded, then "bitcast" in vector of 4 floats,
the vector of 4 float is loaded, then manipulated with LLVM IR vector version of add,
mult...etc... then stored.
The LLVM IR is still generated to use the "conservative" "align 1"
option since it can not yet be sure data is always aligned. The result SSE code with then
use the MOVUPS (Move Unaligned Packed Single-Precision Floating-Point Values). The next
steps is to generated stuff like:
%27 = load <4 x float>* %26, align 4
so that MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values) is used
instead.
We already see so nice speed improvements, but the Faust vector LLVM backend version is
still not yet complete...
Stéphane