[LAD] GCC Vector extensions

Gabriel Beddingfield gabrbedd at gmail.com
Tue Jul 26 12:30:16 UTC 2011


On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:
>> So are you now considering use some #ifdef to select float/4 instead of
>> double/8 vectors in jMax or just change all of them?
>
> Well, at the moment on gcc the perfomance with vector types is the same
> as without vector types, so i'll leave the Linux version without vector
> types (the code is #ifdef'ed).

When I was playing around with this last night... the best performance 
came from your non-optimized, non-vectored code.

Why?

Because GCC translated it to optimized, vectored code.

> By the way, i forgot to mentions that all my tests where at 64 bits;
> i'll try later on a 32 bit Ubuntu.

I was on 32 bit Ubuntu.  Also, with GCC the 64-bit optimizer is known to 
be better at optimising SIMD code.

Because I'm a sucker for these kinds of diversions, I came up with a 
scheme that shaved about 1 second off your test (on my machine).  It 
assumes that `vecsize` is a power-of-two.  The idea is to store stuff in 
the processor registers, and access each buffer one page at a time (a 
cache page is 64 bytes on x86... 16 floats).

static inline void add3_vec(float * restrict arg0, float * restrict 
arg1, float * restrict arg2, unsigned int vecsize)
{
   unsigned int i;
   v4sf *v0, *v1, *v2;
   v4sf c0, c1, c2, c3, c4, c5, c6, c7;
   const unsigned cache_size = 4;

   v0 = (v4sf*)arg0;
   v1 = (v4sf*)arg1;
   v2 = (v4sf*)arg2;
   vecsize /= 4*cache_size;

   while(vecsize--) {
           c0 = *v0++;
           c1 = *v0++;
           c2 = *v0++;
           c3 = *v0++;
           c4 = *v1++;
           c5 = *v1++;
           c6 = *v1++;
           c7 = *v1++;
           *v2++ = c0 + c4;
           *v2++ = c1 + c5;
           *v2++ = c2 + c6;
           *v2++ = c3 + c7;
   }

}

-gabriel



More information about the Linux-audio-dev mailing list