Re: [LAD] GCC Vector extensions

26 Jul 2011

On 07/26/2011 03:15 AM, Maurizio De Cecco wrote:
...
   So are you now
considering use some #ifdef to select float/4 instead of
 double/8 vectors in jMax or just change all of them? 
 Well, at the moment on gcc the perfomance with vector types is the same
 as without vector types, so i'll leave the Linux version without vector
 types (the code is #ifdef'ed). 
When I was playing around with this last night... the best performance
came from your non-optimized, non-vectored code.
Why?
Because GCC translated it to optimized, vectored code.
...
  By the way, i forgot to mentions that all my tests
where at 64 bits;
 i'll try later on a 32 bit Ubuntu. 
I was on 32 bit Ubuntu.  Also, with GCC the 64-bit optimizer is known to
be better at optimising SIMD code.
Because I'm a sucker for these kinds of diversions, I came up with a
scheme that shaved about 1 second off your test (on my machine).  It
assumes that `vecsize` is a power-of-two.  The idea is to store stuff in
the processor registers, and access each buffer one page at a time (a
cache page is 64 bytes on x86... 16 floats).
static inline void add3_vec(float * restrict arg0, float * restrict
arg1, float * restrict arg2, unsigned int vecsize)
{
   unsigned int i;
   v4sf *v0, *v1, *v2;
   v4sf c0, c1, c2, c3, c4, c5, c6, c7;
   const unsigned cache_size = 4;
   v0 = (v4sf*)arg0;
   v1 = (v4sf*)arg1;
   v2 = (v4sf*)arg2;
   vecsize /= 4*cache_size;
   while(vecsize--) {
           c0 = *v0++;
           c1 = *v0++;
           c2 = *v0++;
           c3 = *v0++;
           c4 = *v1++;
           c5 = *v1++;
           c6 = *v1++;
           c7 = *v1++;
           *v2++ = c0 + c4;
           *v2++ = c1 + c5;
           *v2++ = c2 + c6;
           *v2++ = c3 + c7;
   }
}
-gabriel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAD] GCC Vector extensions