New subject: GCC Vector extensions

21 Jul 2011

...

 On 07/20/2011 10:27 AM, Maurizio De Cecco wrote:
  I am playing around with GCC and Clang vector
extensions, on Linux and
 Mac OS X, and i am getting some strange behaviour.
 I am working on jMax Phoenix, and its dsp engine, in its current state,
 is very memory bound; it is based on the aggregation of very small
 granularity operations, like vector sum or multiply, each of them
 executed independently from and to memory.
 I tried to implements all this 'primitive' operations using the vector
 types.
 On clang/MacOSX i get an impressive improvement in performance,
 around 4x on the operations, even just using the vector types for
 copying data; my impression is that the compiler use some kind of vector
 load/store instruction that properly use the available memory bandwidth,
 but unfortunately i do not know more about the x86 architecture.
 On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
 2.5x.
 Well, anybody have an idea of why ?
 I am actually running linux (Ubuntu 11.04) under a VMWare virtual
 machine, i do not know is this may have any implications. 
 Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
 clang/MacOSX vs gcc/MacOSX compiled binaries.
 Also as Dan already pointed out: gcc has a whole lot of optimization
 flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
 '-ftree-vectorizer-verbose=2' is handy while optimizing code.
 have fun,
 robin 
Or you can use LLVM to *directly* generate vector code, as in the following example,
result of some experiments done with Faust and it's LLVM backend:
block_code8:                                      ; preds =
%block_code8.block_code8_crit_edge, %block_code3
  %20 = phi float* [ %15, %block_code3 ], [ %.pre11, %block_code8.block_code8_crit_edge ]
  %21 = phi float* [ %14, %block_code3 ], [ %.pre10, %block_code8.block_code8_crit_edge ]
  %22 = phi float* [ %16, %block_code3 ], [ %.pre9, %block_code8.block_code8_crit_edge ]
  %indvar = phi i32 [ 0, %block_code3 ], [ %indvar.next,
%block_code8.block_code8_crit_edge ]
  %nextindex1 = shl i32 %indvar, 2
  %nextindex = add i32 %nextindex1, 4
  %23 = sext i32 %nextindex1 to i64
  %24 = getelementptr float* %22, i64 %23
  %25 = getelementptr float* %21, i64 %23
  %26 = bitcast float* %25 to <4 x float>*
  %27 = load <4 x float>* %26, align 1
  %28 = getelementptr float* %20, i64 %23
  %29 = bitcast float* %28 to <4 x float>*
  %30 = load <4 x float>* %29, align 1
  %31 = fadd <4 x float> %27, %30
  %32 = bitcast float* %24 to <4 x float>*
  store <4 x float> %31, <4 x float>* %32, align 1
  %33 = icmp ult i32 %nextindex, %18
  br i1 %33, label %block_code8.block_code8_crit_edge, label %exit_block6
In this block float* arrays are loaded, then "bitcast" in vector of 4 floats,
the vector of 4 float is loaded, then manipulated with LLVM IR vector version of add,
mult...etc... then stored.
The LLVM IR is still generated to use the "conservative" "align 1"
option since it can not yet be sure data is always aligned. The result SSE code with then
use the MOVUPS (Move Unaligned Packed Single-Precision Floating-Point Values). The next
steps is to generated stuff like:
 %27 = load <4 x float>* %26, align 4
so that  MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values) is used
instead.
We already see so nice speed improvements, but the Faust vector LLVM backend version is
still not yet complete...
Stéphane

Re: [LAD] GCC Vector extensions