[LAD] GCC Vector extensions

Stéphane Letz letz at grame.fr
Thu Jul 21 12:22:06 UTC 2011


> 
> On 07/20/2011 10:27 AM, Maurizio De Cecco wrote:
>> I am playing around with GCC and Clang vector extensions, on Linux and
>> Mac OS X, and i am getting some strange behaviour.
>> 
>> I am working on jMax Phoenix, and its dsp engine, in its current state,
>> is very memory bound; it is based on the aggregation of very small
>> granularity operations, like vector sum or multiply, each of them
>> executed independently from and to memory.
>> 
>> I tried to implements all this 'primitive' operations using the vector
>> types.
>> 
>> On clang/MacOSX i get an impressive improvement in performance,
>> around 4x on the operations, even just using the vector types for
>> copying data; my impression is that the compiler use some kind of vector
>> load/store instruction that properly use the available memory bandwidth,
>> but unfortunately i do not know more about the x86 architecture.
>> 
>> On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
>> 2.5x.
>> 
>> Well, anybody have an idea of why ?
>> 
>> I am actually running linux (Ubuntu 11.04) under a VMWare virtual
>> machine, i do not know is this may have any implications.
> 
> Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
> clang/MacOSX vs gcc/MacOSX compiled binaries.
> 
> Also as Dan already pointed out: gcc has a whole lot of optimization
> flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
> '-ftree-vectorizer-verbose=2' is handy while optimizing code.
> 
> have fun,
> robin

Or you can use LLVM to *directly* generate vector code, as in the following example, result of some experiments done with Faust and it's LLVM backend:

block_code8:                                      ; preds = %block_code8.block_code8_crit_edge, %block_code3
  %20 = phi float* [ %15, %block_code3 ], [ %.pre11, %block_code8.block_code8_crit_edge ]
  %21 = phi float* [ %14, %block_code3 ], [ %.pre10, %block_code8.block_code8_crit_edge ]
  %22 = phi float* [ %16, %block_code3 ], [ %.pre9, %block_code8.block_code8_crit_edge ]
  %indvar = phi i32 [ 0, %block_code3 ], [ %indvar.next, %block_code8.block_code8_crit_edge ]
  %nextindex1 = shl i32 %indvar, 2
  %nextindex = add i32 %nextindex1, 4
  %23 = sext i32 %nextindex1 to i64
  %24 = getelementptr float* %22, i64 %23
  %25 = getelementptr float* %21, i64 %23
  %26 = bitcast float* %25 to <4 x float>*
  %27 = load <4 x float>* %26, align 1
  %28 = getelementptr float* %20, i64 %23
  %29 = bitcast float* %28 to <4 x float>*
  %30 = load <4 x float>* %29, align 1
  %31 = fadd <4 x float> %27, %30
  %32 = bitcast float* %24 to <4 x float>*
  store <4 x float> %31, <4 x float>* %32, align 1
  %33 = icmp ult i32 %nextindex, %18
  br i1 %33, label %block_code8.block_code8_crit_edge, label %exit_block6

In this block float* arrays are loaded, then "bitcast" in vector of 4 floats, the vector of 4 float is loaded, then manipulated with LLVM IR vector version of add, mult...etc... then stored.

The LLVM IR is still generated to use the "conservative" "align 1" option since it can not yet be sure data is always aligned. The result SSE code with then use the MOVUPS (Move Unaligned Packed Single-Precision Floating-Point Values). The next steps is to generated stuff like:

 %27 = load <4 x float>* %26, align 4

so that  MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values) is used instead.

We already see so nice speed improvements, but the Faust vector LLVM backend version is still not yet complete...

Stéphane 








More information about the Linux-audio-dev mailing list